Skip to content

Word Embeddings for Low Resource Languages: The Case of Buryat

Notifications You must be signed in to change notification settings

vaskonov/burvec

Repository files navigation

Learning Word Embeddings for Low Resource Languages: The Case of Buryat

Word vector representations have been extensively studied in large text datasets. However, only a few studies analyze semantic representations of low resource languages, particularly when only small corpus is available. In most cases, low resource languages lack traditional natгral language processing instruments like lemmatizer and stemmer. In this study, we introduced a methodology to build word embeddings of low resource languages. The proposed methodology consists of defining accurate preprocessings steps, applying language-independent stemmer, introducing techniques for building word vector representations. In addition, we proposed a simple word embedding evaluation scheme that can be easily adapted to any language. By using this methodology we trained word embeddings for Buryat language. We made the source code and the resulting word embeddings corpus publicly available in order to promote further research.

Buryat Language Embeddings:

2 5 10
50 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
100 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
500 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD

Erzya Language Embeddings:

2 5 10
50 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
100 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
500 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD

Komi Language Embeddings:

2 5 10
50 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
100 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD
500 CBOW SG GloVe SVD CBOW SG GloVe SVD CBOW SG GloVe SVD

Files for evaluation: bxr myv kv

Contact

For any question, please contact [email protected]

Cite

@inproceedings{konovalov2018learning,
  title={Learning word embeddings for low resource languages: the case of Buryat},
  author={Konovalov, VP and Tumunbayarova, ZB},
  booktitle={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
  pages={331--341},
  year={2018}
}

About

Word Embeddings for Low Resource Languages: The Case of Buryat

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published