Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

Rolands Laucis; Gints Jēkabsons

Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

2021
Rolands Laucis, Gints Jēkabsons

Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.

Atslēgas vārdi
Named entity recognition, natural language processing, part-of-speech tagging, word analogy, word embeddings
DOI
10.2478/acss-2021-0016
Hipersaite
https://sciendo.com/article/10.2478/acss-2021-0016

Laucis, R., Jēkabsons, G. Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora. Applied Computer Systems, 2021, Vol. 26, No. 2, 132.-138. lpp. e-ISSN 2255-8691. Pieejams: doi:10.2478/acss-2021-0016

Publikācijas valoda
English (en)

Publikācijas veids
Zinātniskais raksts, kas indeksēts Web of science un/vai Scopus datu bāzē
Pamatdarbībai piesaistītais finansējums
Valsts budžeta finansējums izglītībai
Pētniecības nozare
1. Dabaszinātnes
Pētniecības apakšnozare
1.2. Datorzinātne un informātika
Pētniecības platforma
Informācijas un komunikāciju tehnoloģijas
ID: 34730