This dataset is prepared based on Kyrgyz_News_Corpus.
Preliminary processing has been carried out:
- All symbols that are absent in the Kyrgyz or Latin alphabets or numbers have been excluded.
- Various variants of dashes/hyphens have been replaced with a single type of dash, different variants of quotation marks have been replaced with a single type of quotation mark, and extra spaces have been removed.
- Long news articles have been divided into lines so that mean(len) = 102.45 and std(len) = 56.72. 4. Rows with languages other than Kyrgyz have been excluded.
Misspelled (trash) text was created using various approaches:
- 1 million trash lines were generated using a probabilistic noiser. The probabilistic noiser was trained based on a "golden dataset" with real errors, which is not public.
- 500 thousand trash lines were generated using a different probabilistic noiser.
- The remaining trash lines were created using a random noiser, which, for words longer than 5 letters, has a 20% probability of deleting a letter/swapping a letter/replacing a letter with another letter/inserting any letter.
Punctuation errors (punc_trash) text was created using a random noiser, which has a 20% probability of deleting/inserting a comma and replacing the period at the end of the sentence with another punctuation mark, such as "!" or "?".
Train and test datasets were created by train_test_split with a train size of 2 million:
- Train size = 2000000
- Test size = 66223
Misspelled-KG-dataset can be downloaded from here.
All of our achievements were made achievable thanks to the robust AI community in Kyrgyzstan and the contributions made by individuals within the AkylAI project (by TheCramer.com). We also express our gratitude to the Kyrgyz news agencies for their work, which allowed us to create this dataset.
We work on creation Kyrgyz Spell checker and grammar corrector. Please feel free to reach out [email protected] or [email protected] if you are interested in any forms of collaborations!
Dataset is licensed under a Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)