Papers and Resources for Low Resource Machine Translation

There are a wide variety of techniques to employ when trying to create a new machine translation model for a low resource language or improve an existing baseline. The applicability of these techniques generally depend on the availability of parallel and monolingual corpora for the target language and the availability of parallel corpora for related languages/ domains.

Common scenarios

Scenario #1 - The data you have is super noisy (e.g., scraped from the web), and you aren't sure which sentence pairs are "good"

Papers:

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions

Resources/ examples:

Implementation - fast_align creates word alignments that can be used to score sentence pairs
Implementation - zipporah parallel corpus cleaner
Implementation - bicleaner parallel corpus cleaner
Implementation - LASER Language-Agnostic SEntence Representations

Scenario #2 - You don't have any parallel data for the source-target language pair, you only have monolingual target data

Papers:

Phrase-Based & Neural Unsupervised Machine Translation
Word Translation Without Parallel Data
Unsupervised Statistical Machine Translation

Resources/ examples:

Implementation - unsupervised MT (Facebook)

Scenario #3 - You only have a small amount of parallel data for the source-target language pair, but you have lots of parallel data for a related source-target language pair

Papers:

Rapid Adaptation of Neural Machine Translation to New Languages
Neural Machine Translation with Pivot Languages
Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation
Transfer Learning for Low-Resource Neural Machine Translation
Trivial Transfer Learning for Low-Resource Neural Machine Translation
Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages

Resources/ examples:

Implementation - rapid adaptation methods (Neubig)
Video - rapid adaptation methods (Neubig)
Implementation - transfer learning for low resource languages (Zoph)

Scenario #4 - You only have a small amount of parallel data for the source-target language pair, but you have lots of monolingual data for the target and/or source language

Papers:

Improving Neural Machine Translation Models with Monolingual Data
Iterative Back-Translation for Neural Machine Translation
Generalizing Back-Translation in Neural Machine Translation
Improving Back-Translation with Uncertainty-based Confidence Estimation
Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation

Resources/ examples:

Video - About iterative backtranslation and dealing with "place" issues

Scenario #5 - You have a small amount of parallel data for the source-target language pair, but you also have a lot of parallel data for other language pairs

Papers:

Massively Multilingual Neural Machine Translationin the Wild: Findings and Challenges
Multilingual Neural Machine Translation With Soft Decoupled Encoding
Meta-Learning for Low-Resource Neural Machine Translation
Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies

Resources/ examples:

Video - Meta-learning for low resource MT
Blog - Exploring Massively Multilingual, Massive Neural Machine Translation
Blog - Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System

Scenario #6 - You don't have any data for the source-target language pair, not even monolingual data, but you have a linguist or a speaker

Papers:

Apertium: a free/open-source platform for rule-based machine translation. Machine Translation 24(1) pp. 1--18

Resources / examples:

apertium.org

Miscellaneous other resources

General papers and resources about African languages or African language MT:

Towards Neural Machine Translation for African Languages
A Focus on Neural Machine Translation for African Languages
Parallel Corpora for bi-lingual English-Ethiopian Languages Statistical Machine Translation

Data gathering, corpus creation:

cocohub.cc, tools for crowdsourcing parallel corpora
Bitextor tool for mining parallel corpora from websites
CommonCrawl split by language If the language isn't supported by CLD2, build a language model then ask them to run a perplexity filter on CommonCrawl.

Research languages, language families, populations, known language resources online, etc. via:

OLAC
Glottolog
Ethnologue (Free up to a certain amount of clicks, but many Universities have subscriptions, if you happen to be affiliated with one)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MT4LRL.md

MT4LRL.md

Papers and Resources for Low Resource Machine Translation

Common scenarios

Scenario #1 - The data you have is super noisy (e.g., scraped from the web), and you aren't sure which sentence pairs are "good"

Scenario #2 - You don't have any parallel data for the source-target language pair, you only have monolingual target data

Scenario #3 - You only have a small amount of parallel data for the source-target language pair, but you have lots of parallel data for a related source-target language pair

Scenario #4 - You only have a small amount of parallel data for the source-target language pair, but you have lots of monolingual data for the target and/or source language

Scenario #5 - You have a small amount of parallel data for the source-target language pair, but you also have a lot of parallel data for other language pairs

Scenario #6 - You don't have any data for the source-target language pair, not even monolingual data, but you have a linguist or a speaker

Miscellaneous other resources

Data gathering, corpus creation:

Research languages, language families, populations, known language resources online, etc. via:

Files

MT4LRL.md

Latest commit

History

MT4LRL.md

File metadata and controls

Papers and Resources for Low Resource Machine Translation

Common scenarios

Scenario #1 - The data you have is super noisy (e.g., scraped from the web), and you aren't sure which sentence pairs are "good"

Scenario #2 - You don't have any parallel data for the source-target language pair, you only have monolingual target data

Scenario #3 - You only have a small amount of parallel data for the source-target language pair, but you have lots of parallel data for a related source-target language pair

Scenario #4 - You only have a small amount of parallel data for the source-target language pair, but you have lots of monolingual data for the target and/or source language

Scenario #5 - You have a small amount of parallel data for the source-target language pair, but you also have a lot of parallel data for other language pairs

Scenario #6 - You don't have any data for the source-target language pair, not even monolingual data, but you have a linguist or a speaker

Miscellaneous other resources

Data gathering, corpus creation:

Research languages, language families, populations, known language resources online, etc. via: