Skip to content

Latest commit

 

History

History
50 lines (36 loc) · 3.95 KB

models.md

File metadata and controls

50 lines (36 loc) · 3.95 KB

Trained models

Training strategies differ primarily in what data were used for training. Additionally, there are two distinct groups of training strategies depending on whether the model was trained in a single step or whether the data was used in two steps, using subset of the data for pre-training followed by a fine-tuning step.

Data

We used two datasets:

  • WMT 2021 released critical error annotations of translations of English Wikipedia comments into Czech, German, Japanese and Chinese (Specia et al.,2021)
  • The DEMETR dataset (Karpinska et al., 2022) of synthetically created critical errors in news data (10 language pairs)

The WMT 2021 authentic data was used in two ways:

  • monolingual: the language pairs were treated independently and a model was trained on each separately
  • multilingual: the data for all 4 language pairs was combined into a single dataset

Training strategies

In the one-step group, we primarily used the authentic WMT data. In the monolingual case, a model was trained using the original monolingual data - one model for each language pair (English-Czech, English-German, English-Japanese and English-Chinese). For the multilingual case, one multilingual model was trained on the combined multilingual authentic data and then used to make predictions for all the language pairs.

In the two-step group, we first pre-trained the model using either:

  • the multilingual authentic data
  • the synthetic DEMETR data
  • the multilingual authentic data and DEMETR data Each of the three base models were then fine-tuned on each of the four sets of authentic monolingual data.

Each of the five experiments provides a model for each of the 4 language pairs, giving 20 models in total. For all experiments, we repeat training over five random seeds. For each training run, of 100 epochs, we selected the epoch that achieved the highest MCC on the development dataset.

Config files

The tables below gives an overview of the trained models and the corresponding config file used to train them.

For the two-step experiments, we first pre-trained a new base model. All of the base models are defined in the same config file (base_models) but each with a different experiment name:

Experiment name Base model config file experiment name
Two-step > Multilingual auth data base_auth
Two-step > Synthetic base_demetr
Two-step > Multilingual auth + synthetic data base_demetr_auth

For models that were trained on monolingual data, the En-Ja language pair was run in a separate config group as it required a smaller batch size of 32 compared to other language pairs, which all could run with a batch size of 64.

Experiment name Experiment description Config file
One-step > Monolingual auth data A model was trained using the WMT 2021 monolingual data - one model for each language pair. train_monolingual_auth_data and
train_monolingual_auth_data_enja
One-step > Multilingual auth data One model was trained on the combined multilingual authentic data train_multilingual_auth_data_all
Two-step > Multilingual auth data A model was first pre-trained on all multilingual authentic data, then fine-tuned on each of the four sets of authentic monolingual data second_step_base_auth_data and second_step_base_auth_data_enja
Two-step > Synthetic A model was first pre-trained on all multilingual synthetic data, then fine-tuned on each of the four sets of authentic monolingual data second_step_base_demetr_data and second_step_base_demetr_data_enja
Two-step > Multilingual auth + synthetic data A model was first pre-trained on all multilingual authentic and synthetic data, then fine-tuned on each of the four sets of authentic monolingual data second_step_base_demetr_auth_data and second_step_base_demetr_auth_data