PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation
PEACH is a new sequence to sequence multilingual transformer model trained with the semi-supervised pseudo-parallel document generation, our proposed pre-training objective for training multilingual models.
Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents' quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH's ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.
The files are organized in the following system:
|
|__ models
|
|__ peach
|
|__ bin
|__ data
|__ datasets
|__ eval
|__ layers
|__ models
|__ ops
|__ params
|__ requirements.txt
|__ setup.py
|
|__ requirements.txt
|__ T5
|__ mBART
|__ peach
|
|__ denoising
|__ translation
models
directory, contains tensorflow
codes for creating the models and parameters.
There is a Readme
in the repository which shows how exactly the codes work and how they can be used.
In the pretrain
directory, we have our model objective implementation, as well as mT5' objective and mBART's objective.
For our objective, we have two pre-training methods:
- word-by-word translation which can be found at
translation
directory - denoising which can be found at
denoising
directory
In case to find out how to change the hyperparameters and parameters of the models, read the README
files in models
directory.
peach_training_finetuning.ipynb
notebook shows how the generate data for different models (pre-training), how to train models, and how to fine-tune the model. You can use the following checkpoint in order not to train the model from scratch.
Here is the link to denosing models.
language | model | vocab |
---|---|---|
German(de) | download | download |
English(en) | download | download |
French(fr) | download | download |
Macedonian(mk) | download | download |
Pre-trained and fine-tuned models for MLM objective:
Pre-trained model links:
language | model | vocab |
---|---|---|
en, fr, and de | download | download |
en, fr, and de(xlni) | download | download |
en and mk | download | download |
Fine-tuned models:
language pairs | model | vocab |
---|---|---|
de-en | download | download |
de-fr | download | download |
en-de | download | download |
en-fr | download | download |
fr-de | download | download |
fr-en | download | download |
en-mk | download | download |
mk-en | download | download |
Pre-trained and fine-tuned models for MLM with Reordering objective:
Pre-trained model links:
language | model | vocab |
---|---|---|
en, fr, and de | download | download |
en and mk | download | download |
XLNI for MLM with Reordering is available here:
model | vocab |
---|---|
download | download |
Fine-tuned models:
language pairs | model | vocab |
---|---|---|
de-en | download | download |
de-fr | download | download |
en-de | download | download |
en-fr | download | download |
fr-de | download | download |
fr-en | download | download |
en-mk | download | download |
mk-en | download | download |
checkpoint | model | vocab |
---|---|---|
checkpoint-100000 | download | download |
checkpoint-200000 | download | download |
checkpoint-300000 | download | download |
checkpoint-400000 | download | download |
checkpoint-500000 | download | download |
XLNI for SPDG is available here:
model | vocab |
---|---|
download | download |
Pre-trained pair-language models:
pair-language | model | vocab |
---|---|---|
en and de | download | download |
en and fr | download | download |
en and mk | download | download |
fr and de | download | download |
Fine-tuned pair-language models:
language | model | vocab |
---|---|---|
en-de | download | download |
de-en | download | download |
en-fr | download | download |
fr-en | download | download |
de-fr | download | download |
fr-de | download | download |
en-mk | download | download |
mk-en | download | download |
Transformer models:
languages | model | vocab |
---|---|---|
de-en | download | download |
de-fr | download | download |
en-de | download | download |
en-fr | download | download |
fr-de | download | download |
fr-de | download | download |
Translation models:
checkpoint | languages | model | vocab |
---|---|---|---|
checkpoint-100000 | de-en | download | download |
checkpoint-100000 | de-fr | download | download |
checkpoint-100000 | en-de | download | download |
checkpoint-100000 | en-fr | download | download |
checkpoint-100000 | fr-de | download | download |
checkpoint-100000 | fr-en | download | download |
checkpoint-200000 | de-en | download | download |
checkpoint-200000 | de-fr | download | download |
checkpoint-200000 | en-de | download | download |
checkpoint-200000 | en-fr | download | download |
checkpoint-200000 | fr-de | download | download |
checkpoint-200000 | fr-en | download | download |
checkpoint-300000 | de-en | download | download |
checkpoint-300000 | de-fr | download | download |
checkpoint-300000 | en-de | download | download |
checkpoint-300000 | en-fr | download | download |
checkpoint-300000 | fr-de | download | download |
checkpoint-300000 | fr-en | download | download |
checkpoint-400000 | de-en | download | download |
checkpoint-400000 | de-fr | download | download |
checkpoint-400000 | en-de | download | download |
checkpoint-400000 | en-fr | download | download |
checkpoint-400000 | fr-de | download | download |
checkpoint-400000 | fr-en | download | download |
checkpoint-500000 | de-en | download | download |
checkpoint-500000 | de-fr | download | download |
checkpoint-500000 | en-de | download | download |
checkpoint-500000 | en-fr | download | download |
checkpoint-500000 | fr-de | download | download |
checkpoint-500000 | fr-en | download | download |
@inproceedings{salemi-etal-2023-peach,
title = "{PEACH}: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation",
author = "Salemi, Alireza and
Abaskohi, Amirhossein and
Tavakoli, Sara and
Shakery, Azadeh and
Yaghoobzadeh, Yadollah",
booktitle = "Proceedings of the The Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.loresmt-1.3",
pages = "32--46",
abstract = "Multilingual pre-training significantly improves many multilingual NLP tasks, including machine translation. Most existing methods are based on some variants of masked language modeling and text-denoising objectives on monolingual data. Multilingual pre-training on monolingual data ignores the availability of parallel data in many language pairs. Also, some other works integrate the available human-generated parallel translation data in their pre-training. This kind of parallel data is definitely helpful, but it is limited even in high-resource language pairs. This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. First, a denoising model is pre-trained on monolingual data to reorder, add, remove, and substitute words, enhancing the pre-training documents{'} quality. Then, we generate different pseudo-translations for each pre-training document using dictionaries for word-by-word translation and applying the pre-trained denoising model. The resulting pseudo-parallel data is then used to pre-train our multilingual sequence-to-sequence model, PEACH. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks, including supervised, zero- and few-shot scenarios. Moreover, PEACH{'}s ability to transfer knowledge between similar languages makes it particularly useful for low-resource languages. Our results demonstrate that with high-quality dictionaries for generating accurate pseudo-parallel, PEACH can be valuable for low-resource languages.",
}