awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

News

2024/01 -- UNLP 2024 shared task has been announced

1. Datasets / Corpora

Monolingual

Malyuk — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News.
Brown-UK — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
Wikipedia
OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
mC4 — filtered CommonCrawl again, 196GB of Ukrainian text.
Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
Ukrainian forums — 250k sentences scraped from forums.
Ukrainain news headlines — 5.2M news headlines.

Parallel

OPUS
Tatoeba MT Challenge data sets
Polish-Ukrainian Parallel Corpus
Back-translated monolingual Wiki data
Wiki Edits — 5M sentence edits extracted from the Ukrainian Wikipedia revision history.

See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.

Labeled

ZNO — ~4000 questions and answers from Ukrainian External independent testing (ЗНО/ZNO).
UA-GEC — grammatical error correction (GEC) and fluency corpus.
NER-uk — Brown-UK labeled for named entities.
Yakaboo Book Reviews — book reviews, ratings and descriptions.
Universal Dependencies — dependency trees corpus.
ua-news — 150k news article in 5 categories.
UA-SQuAD — Ukrainian version of Stanford Question Answering Dataset.
Ukrainian Winograd schema challenge (WSC) Dataset — manually translated.
Ukrainian OntoNotes Dataset — scripts to build large silver dataset for coreference resolution.

Dictionaries

ВЕСУМ — POS tag dictionary. Can generate a list of all word forms valid for spelling.
Tonal dictionary
Multilingualsentiment, includes Ukrainian - a list of positive/negative words
obscene-ukr — profanity dictionary
Word stress dictionary — word stress for 2.7M word forms. See ukrainian-word-stress
Heteronyms — words that share the same spelling but have different meaning/pronunciation.
Abbreviations — map abbreviation to expansion

Prompts

Aya — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts.

2. Tools

tree_stem — stemmer
pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
LanguageTool — grammar, style and spell checker
Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
nlp-uk — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
NLP-Cube - Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing.

3. Pretrained models

Language models

Autoregressive:

aya-101 — massively multilingual LM, 13B parameters
pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
UAlpaca — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
XGLM — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
Tereveni-AI/GPT-2
uk4b and haloop inference toolkit - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books.

Masked:

xlm-roberta-base-uk — truncated version of XLM-RoBERTa with only Ukrainian and English embeddings left.
youscan/ukr-roberta-base

Mixed:

Electra

Machine translation

Helsinki-NLP / OPUS-MT models — Ukrainian to/from 25 langaguages.
- OPUS-MT models at HuggingFace
- OPUS-MT models evaluated on flores101
M2M-100 — Ukrainian to/from 100 languages.
Uk-En folktale corpus — small sentence-aligned corpus of fairy tales.

See Helsinki-NLP/ UkrainianLT for more.

Sequence-to-sequence models

mBART50
mT5

Named-entity recognition (NER)

Part-of-speech tagging (POS)

lang-uk/flair-uk-pos

Word embeddings

fastText
- Official fastText trained on CommonCrawl and Wiki — 157 languages, including Ukrainian.
- Older official fastText trained on Wiki — 294 languages, including Ukrainian.
- fastText_multilingual — 78 languages, aligned to the same vector space.
- fasttext_uk (2023) and cbow — trained on UberText 2.0
Word2Vec
GloVe
LexVec
BPEmb: Subword Embeddings, includes Ukrainian - easy to use with Flair
Flair — Ukrainian added in 2022.

Other

uk-punctcase — punctuation and case restoration model based on XLM-RoBERTa-Uk.
punctuation_uk_bert — another punctation and case restoration model based on bert-base-multilingual-cased.
ukrainian-word-stress — adds word stress.

4. Paid

LORELEI Ukrainian Representative Language Pack - Ukrainian monolingual text, Ukrainian-English parallel text, partially annotated for named entities

5. Other resources and links

Helsinki-NLP/ UkrainianLT — another collection of links to Ukrainian language tools.
egorsmkv / speech-recognition-uk — speech recognition and text-to-speech models and datasets

6. Workshops and conferences

Ukrainian Natural Language Processing Workshop
UNLP 2023 shared task — shared task (competition) in grammatical error correction for Ukrainian
- Training data and evaluation scripts
- Public leaderboard
UNLP 2024 shared task — shared task (competition) on fine-tuning large language models (LLMs) for Ukrainian

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-ukrainian-nlp

News

1. Datasets / Corpora

Monolingual

Parallel

Labeled

Dictionaries

Prompts

2. Tools

3. Pretrained models

Language models

Machine translation

Sequence-to-sequence models

Named-entity recognition (NER)

Part-of-speech tagging (POS)

Word embeddings

Other

4. Paid

5. Other resources and links

6. Workshops and conferences

About

Releases

Packages

Contributors 5

osyvokon/awesome-ukrainian-nlp

Folders and files

Latest commit

History

Repository files navigation

awesome-ukrainian-nlp

News

1. Datasets / Corpora

Monolingual

Parallel

Labeled

Dictionaries

Prompts

2. Tools

3. Pretrained models

Language models

Machine translation

Sequence-to-sequence models

Named-entity recognition (NER)

Part-of-speech tagging (POS)

Word embeddings

Other

4. Paid

5. Other resources and links

6. Workshops and conferences

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Packages