Skip to content

Latest commit

 

History

History
88 lines (66 loc) · 5.63 KB

README.md

File metadata and controls

88 lines (66 loc) · 5.63 KB

Awesome Kyrgyz NLP Awesome

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

  • Repository's owners explicitly say that "this library is not maintained".
  • Not committed to for a long time (2~3 years).

Table of Contents

Datasets

The repository currently consists of 80213 (50x50 pixel) images representing all 36 letters of the Kyrgyz alphabet These images have been hand-written.

Raw text

  • kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Several corpora are also mentioned in research works:

  • TODO

Syntax

Machine-readable dictionaries

Pretrained models

  • Polyglot morfessor — pretrained morfessor model, number 6
  • fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
  • BERT-based NERbert-base-multilingual-cased fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.

Methods/Software

  • spaCy basic support: tokenization, stopwords, like_num

Morphology

Mentioned in papers:

  • TODO

Hate Speech detection

Other

Online Demos

Miscellaneous