A curated list of awesome Kyrgyz language processing software, models and datasets. Inspired by awesome-ML.
The main focus is on open source tools, downloadable data and research papers with code.
If you want to contribute to this list (please do), send me a pull request. Also, a listed repository should be tagged as deprecated if:
- Repository's owners explicitly say that "this library is not maintained".
- Not committed to for a long time (2~3 years).
- kkWaC: Kyrgyz corpus from the web, 19M words, Jan 2012
- Kyrgyz in Leipzig Corpora Collecion: Community data / Newscrawl (1M sentences) / Wikipedia (300K sentences)
- Verbal paradigms for Kyrgyz (100 Kyrgyz verbs fully conjugated in all tenses) by Aytnatova Alima, annotation for Unimorph by E. Chodroff
- Kyrgyz language hand-written letters (kyrgyz MNIST): A repository of images (in CSV format) of hand-written Kyrgyz alphabet letters for machine learning applications. Original images have been transformed to 50x50 images and after to csv format.
The repository currently consists of 80213 (50x50 pixel) images representing all 36 letters of the Kyrgyz alphabet These images have been hand-written.
- kloop corpus: 16'826 articles (sqlite3 DB file) + crawler code
Several corpora are also mentioned in research works:
- TODO
- UD project comments on difficulties in Turkish language processing, might bring light to the question why parsing Kyrgyz is hard as well
- KTMU's UD Treebank, 781 sentences
- Country names table: Kyrgyz-Russian-English
- Thesaurus KyrSpell (however, unpacking it seems to break the license)
- Polyglot morfessor — pretrained morfessor model, number 6
- fastText — 300-dimensional fastText vectors provided by the authors: bin, txt.
- BERT-based NER —
bert-base-multilingual-cased
fine-tuned on Wikiann for NER on Kyrgyz. The author warns that this model is not usable and is built just as a proof of concept. Will be updated later.
- spaCy basic support: tokenization, stopwords,
like_num
- Kyrgyz for Apertium: morphological analysis and generation, PoS-tagging; installation script: install_apertium_kir.sh.
- [DEPRECATED] kymopl: Kyrguz morphology in Prolog
Mentioned in papers:
- TODO
- Tilchi electronic Russian-Kyrgyz dictionary, open source desktop application
- ӨҮҢизатор: a proof-of-concept letter replacement Telegram bot demo code, fixes incorrect usages of 'О','У', 'Н' => 'Ө', 'Ү','Ң'
- Number-to-words conversion (JavaScript) by @AzamatSooldaev
- Number-to-words conversion (TypeScript) by @timursaurus
- Telegram bot for Kyrgyz morphological analysis by @sasha-kir based on Apertium data for Kyrgyz
- Turkic Interlingua community and SIGTURK (ACL Turkic languages special interest group)
- A useful Apertium's list of tools and other resources
- Online dictionaries and other useful resources on el-sozduk.kg
- Turkic languages-related resources compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University