Skip to content

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

Notifications You must be signed in to change notification settings

osyvokon/awesome-ukrainian-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 

Repository files navigation

awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

News

2024/01 -- UNLP 2024 shared task has been announced

1. Datasets / Corpora

Monolingual

  • Malyuk — 113GB of text, compilation of UberText 2.0, OSCAR, Ukrainian News.
  • Brown-UK — carefully curated corpus of modern Ukrainian language with dismabiguated tokens, 1 million words
  • UberText 2.0 — over 5 GB of news, Wikipedia, social, fiction, and legal texts
  • Wikipedia
  • OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
  • CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
  • mC4 — filtered CommonCrawl again, 196GB of Ukrainian text.
  • Ukrainian Twitter corpus - Ukrainian Twitter corpus for toxic text detection.
  • Ukrainian forums — 250k sentences scraped from forums.
  • Ukrainain news headlines — 5.2M news headlines.

Parallel

See Helsinki-NLP/UkrainianLT for more data and machine translation resources links.

Labeled

Dictionaries

Prompts

  • Aya — crowd-sourced prompts and reference outputs. Ukrainian part is ~500 prompts.

2. Tools

  • tree_stem — stemmer
  • pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
  • LanguageTool — grammar, style and spell checker
  • Stanza — Python package for tokenization, multi-word-tokenization, lemmatization, POS, dependency parsing, NER
  • nlp-uk — Tools for cleaning and normalizing texts, tokenization, lemmatization, POS, disambiguation
  • NLP-Cube - Python package for tokenization, sentence splitting, multi-word-tokenization, lemmatization, part-of-speech tagging and dependency parsing.

3. Pretrained models

Language models

Autoregressive:

  • aya-101 — massively multilingual LM, 13B parameters
  • pythia-uk — mT5 finetuned on wiki and oasst1 for chats in Ukrainian.
  • UAlpaca — Llama fine-tuned for instruction following on the machine-translated Alpaca dataset.
  • XGLM — multilingual autoregressive LM, the 4.5B checkpoint includes Ukrainian.
  • Tereveni-AI/GPT-2
  • uk4b and haloop inference toolkit - GPT-2 small, medium and large-style models trained on UberText 2.0 wikipedia, news and books.

Masked:

Mixed:

Machine translation

See Helsinki-NLP/ UkrainianLT for more.

Sequence-to-sequence models

Named-entity recognition (NER)

Part-of-speech tagging (POS)

Word embeddings

Other

4. Paid

5. Other resources and links

6. Workshops and conferences

About

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published