Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

Rolands Laucis; Gints Jēkabsons

Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora

2021
Rolands Laucis, Gints Jēkabsons

Nowadays, natural language processing (NLP) is increasingly relaying on pre-trained word embeddings for use in various tasks. However, there is little research devoted to Latvian – a language that is much more morphologically complex than English. In this study, several experiments were carried out in three NLP tasks on four different methods of creating word embeddings: word2vec, fastText, Structured Skip-Gram and ngram2vec. The obtained results can serve as a baseline for future research on the Latvian language in NLP. The main conclusions are the following: First, in the part-of-speech task, using a training corpus 46 times smaller than in a previous study, the accuracy was 91.4 % (versus 98.3 % in the previous study). Second, fastText demonstrated the overall best effectiveness. Third, the best results for all methods were observed for embeddings with a dimension size of 200. Finally, word lemmatization generally did not improve results.

Keywords
Named entity recognition, natural language processing, part-of-speech tagging, word analogy, word embeddings
DOI
10.2478/acss-2021-0016
Hyperlink
https://sciendo.com/article/10.2478/acss-2021-0016

Laucis, R., Jēkabsons, G. Evaluation of Word Embedding Models in Latvian NLP Tasks Based on Publicly Available Corpora. Applied Computer Systems, 2021, Vol. 26, No. 2, pp. 132-138. e-ISSN 2255-8691. Available from: doi:10.2478/acss-2021-0016

Publication language
English (en)

Publication Type
Scientific article indexed in SCOPUS or WOS database
Funding for basic activity
State funding for education
Field of research
1. Natural sciences
Sub-field of research
1.2 Computer and information sciences
Research platform
Information and Communication
ID: 34730