Awesome Kyrgyz NLP

A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.

The main focus is on open source tools, downloadable data and research papers with code.

If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:

Repository's owners explicitly say that "this library is not maintained".
Not committed to for a long time (2~3 years).

Datasets

kkWaC: Kyrgyz corpus from the web, 19M words, Jan 2012
Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff
Kyrgyz language hand-written letters (kyrgyz MNIST): A repository of images (in CSV format) of hand-written Kyrgyz alphabet letters for machine learning applications. Original images have been transformed to 50x50 images and after to csv format.

The repository currently consists of 80213 (50x50 pixel) images representing all 36 letters of the Kyrgyz alphabet These images have been hand-written.

Raw text

kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code

Several corpora are also mentioned in research works:

TODO

Syntax

UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
KTMU's UD Treebank, 781 sentences

Machine-readable dictionaries

Country names table: Kyrgyz-Russian-English
Thesaurus KyrSpell (however, unpacking it seems to break the license)

Pretrained models

Polyglot morfessor — pretrained morfessor model, number 6
fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
BERT-based NER — bert-base-multilingual-cased fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.

Methods/Software

spaCy basic support: tokenization, stopwords, like_num

Morphology

Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: install_apertium_kir.sh.
[DEPRECATED] kymopl: Kyrguz morphology in Prolog

Mentioned in papers:

TODO

Hate Speech detection

Jupyter Notebook for hate speech detection

Other

Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
ӨҮҢизатор: a proof-of-concept letter replacement Telegram bot demo code, fixes incorrect usages of 'О','У', 'Н' => 'Ө', 'Ү','Ң'
Number-to-words conversion (JavaScript) by @AzamatSooldaev
Number-to-words conversion (TypeScript) by @timursaurus
Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz

Online Demos

Cyrillic-to-Latin online converter based on this resource.

Miscellaneous

Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
A useful Apertium's list of tools and other resources
Online dictionaries and other useful resources on el-sozduk.kg
Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome Kyrgyz NLP

Table of Contents

Datasets

Raw text

Syntax

Machine-readable dictionaries

Pretrained models

Methods/Software

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome Kyrgyz NLP

Table of Contents

Datasets

Raw text

Syntax

Machine-readable dictionaries

Pretrained models

Methods/Software

Morphology

Hate Speech detection

Other

Online Demos

Miscellaneous