The implementation of Learning Deep Transformer Models for Machine Translation [ACL 2019] (Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, Lidia S. Chao)
This code is based on Fairseq v0.5.0
pip install -r requirements.txt
python setup.py develop
python setup.py install
NOTE: test in torch==0.4.1
-
Download the preprocessed WMT'16 En-De dataset provided by Google to project root dir
-
Generate binary dataset at
data-bin/wmt16_en_de_google
bash runs/prepare-wmt-en2de.sh
bash runs/train-wmt-en2de-deep-prenorm-baseline.sh
bash runs/train-wmt-en2de-deep-postnorm-dlcl.sh
bash runs/train-wmt-en2de-deep-prenorm-dlcl.sh
NOTE: BLEU will be calculated automatically when finishing training
Model | #Param. | Epoch* | BLEU |
---|---|---|---|
Transformer (base) | 65M | 20 | 27.3 |
Transparent Attention (base, 16L ) |
137M | - | 28.0 |
Transformer (big) | 213M | 60 | 28.4 |
RNMT+ (big) | 379M | 25 | 28.5 |
Layer-wise Coordination (big) | 210M* | - | 29.0 |
Relative Position Representations (big) | 210M | 60 | 29.2 |
Deep Representation (big) | 356M | - | 29.2 |
Scailing NMT (big) | 210M | 70 | 29.3 |
Our deep pre-norm Transformer (base, 20L ) |
106M | 20 | 28.9 |
Our deep post-norm DLCL (base, 25L ) |
121M | 20 | 29.2 |
Our deep pre-norm DLCL (base, 30L ) |
137M | 20 | 29.3 |
NOTE: *
denotes approximate values.