Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora
2021
Rolands Laucis, Gints Jēkabsons

Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.


Atslēgas vārdi
Named entity recognition, natural language processing, part-of-speech tagging, word analogy, word embeddings
DOI
10.2478/acss-2021-0016
Hipersaite
https://sciendo.com/article/10.2478/acss-2021-0016

Laucis, R., Jēkabsons, G. Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora. Applied Computer Systems, 2021, Vol. 26, No. 2, 132.-138. lpp. e-ISSN 2255-8691. Pieejams: doi:10.2478/acss-2021-0016

Publikācijas valoda
English (en)
RTU Zinātniskā bibliotēka.
E-pasts: uzzinas@rtu.lv; Tālr: +371 28399196