Skip to content

Latest commit

 

History

History
 
 

must-c

Speech Translation on MuST-C

MuST-C is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems for speech translation from English into several languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

The final performance of speech translation on 8 languages of MuST-C (tst-COMMON) is:

See RESULTS for the comparison with counterparts.

The benchmark models:

Language Models
DE [ASR] [MT] [ST] [ST+SpecAug]
ES [ASR] [MT] [ST] [ST+SpecAug]
FR [ASR] [MT] [ST] [ST+SpecAug]
IT [ASR] [MT] [ST] [ST+SpecAug]
NL [ASR] [MT] [ST] [ST+SpecAug]
PT [ASR] [MT] [ST] [ST+SpecAug]
RO [ASR] [MT] [ST] [ST+SpecAug]
RU [ASR] [MT] [ST] [ST+SpecAug]
  • ASR (dmodel=256, WER)
Model DE ES FR IT NL PT RO RU
Transformer ASR 13.6 13 12.9 13.5 13.8 14.4 13.7 13.4
  • MT/ST (dmodel=256, case-sensitive, tokenized BLEU/detokenized BLEU)
Model DE ES FR IT NL PT RO RU
Transformer MT 27.9/27.8 32.9/32.8 42.2/40.2 29.0/28.5 32.9/32.7 34.4/34.0 27.5/26.4 19.3/19.1
cascade ST (Transformer ASR -> Transformer MT) 23.5/23.4 28.1/28.0 35.8/33.9 24.3/23.8 27.3/27.1 28.6/28.3 23.3/22.2 16.2/16.0
Transformer ST + ASR pretrain 21.9/21.9 26.9/26.8 34.2/32.3 22.6/22.2 26.5/26.4 27.8/27.6 21.9/20.9 15.0/15.2
Transformer ST + ASR pretrain + SpecAug 22.8/22.8 27.5/27.4 35.2/33.3 23.4/22.9 27.4/27.2 29.0/28.7 23.2/22.2 15.2/15.1

In this recipe, we will introduce how to pre-process the MuST-C corpus and train/evaluate a speech translation model using neurst.

Contents

Requirements

apt

  • libsndfile1

pip

  • TensorFlow >=2.3.0
  • soundfile
  • python_speech_features
  • subword-nmt
  • pyyaml
  • sacrebleu
  • sacremoses

others

$ git clone https://github.com/moses-smt/mosesdecoder.git

Data preprocessing

Take the English-German portion as an example.

Step 1: Download Data

First, we download the original tgz files into directory /path_to_data/raw/ and we have

/path_to_data/
└── raw
    ├── MUSTC_v1.0_en-de.tar.gz
    ├── MUSTC_v1.0_en-es.tar.gz
    └── ......

Step 2: Extract audio features

The speech translation corpus contains source raw audio files, texts in a target language and other optional information (e.g. transcriptions of the corresponding audio files). Here we pre-compute audio features (that is, log-mel filterbank coefficients) because the computation is time-consuming and features are usually fixed during training and evaluation.

Though NeurST supports preprocessing audio inputs on-the-fly, we recommend to pack the extracted features into TF Records to alleviate the I/O and CPU overhead.

We can extract audio features with

$ ./examples/speech_to_text/must-c/./02-audio_feature_extraction.sh /path_to_data de --untar

Here, we use --untar option to first extract the tgz file, because it is quite time-consuming when we repeatedly iterate on the compressed file with a huge .h5 file inside.

By default, it extracts 80-channel log-mel filterbank coefficients using a lightweight python package python_speech_features with windows of 25ms and steps of 10ms. Then we have

/path_to_data/
├── devtest
│   ├── dev.en-de.tfrecords-00000-of-00001
│   └── tst-COMMON.en-de.tfrecords-00000-of-00001
├── train
│   └── de
│       ├── train.tfrecords-00000-of-00128
│       ├── ......
│       └── train.tfrecords-00127-of-00128
└── transcripts
    └── de
        ├── dev.de.txt
        ├── dev.en.txt
        ├── train.de.txt
        ├── train.en.txt
        ├── tst-COMMON.de.txt
        └── tst-COMMON.en.txt

where the directory /path_to_data/train/de/(/path_to_data/devtest) contains the extracted audio features and the corresponding transcriptions (and translations) in TF Record format for training (and evaluation). Transcriptions and translations in txt format are stored in /path_to_data/transcripts/de/.

Furthermore, to examine the elements in the TF Record files, we can simply run the command line tool view_tfrecord:

$ python3 -m neurst.cli.view_tfrecord /path_to_data/train/de/

features {
  feature {
    key: "audio"
    value {
      float_list {
        ......
        value: -0.033860281109809875
        value: -0.025679411366581917
      }
    }
  }
  feature {
    key: "transcript"
    value {
      bytes_list {
        value: "She took our order, and then went to the couple in the booth next to us, and she lowered her voice so much, I had to really strain to hear what she was saying."
      }
    }
  }
  feature {
    key: "translation"
    value {
      bytes_list {
        value: "Sie nahm unsere Bestellung auf, ging dann zum Paar in der Nische neben uns und senkte ihre Stimme so sehr, dass ich mich richtig anstrengen musste, um sie zu verstehen."
      }
    }
  }
}

elements: {
    "transcript": bytes (str)
    "translation": bytes (str)
    "audio": float32
}

Step 3 Preprocess transcriptions and translations

As is mentioned above, we can map the word tokens to IDs aforehand, to speed up the training process.

By running with MOSES_DIR ROOT_DATA_PATH TRG_LANG

$ ./examples/speech_to_text/must-c/03-preprocess.sh /path_to_moses /path_to_data de

we learn vocabulary based on BPE rules with 8,000 merge operations. The learnt BPE and vocabulary are shared across ASR, MT and ST tasks. Note that, we lowercase the transcriptions and remove all punctuations while the cases and punctuations of translations are reserved and we simply apply moses tokenizer. As a result, we obtain

/path_to_data/
├── asr_st
│   └── de
│       ├── asr_prediction_args.yml
│       ├── asr_training_args.yml
│       ├── asr_validation_args.yml
│       ├── codes.bpe
│       ├── st_prediction_args.yml
│       ├── st_training_args.yml
│       ├── st_validation_args.yml
│       ├── train
│       │   ├── train.tfrecords-00000-of-00064
│       │   ├── ......
│       │   └── train.tfrecords-00127-of-00128
│       ├── vocab.en
│       └── vocab.de
└── mt
    └── de
        ├── codes.bpe
        ├── mt_prediction_args.yml
        ├── mt_training_args.yml
        ├── mt_validation_args.yml
        ├── train
        │   ├── train.en.clean.tok.bpe.txt
        │   └── train.de.tok.bpe.txt
        ├── vocab.en
        └── vocab.de

Here, we use txt files (not TF Record) for MT tasks, while the pre-processed training samples for ASR/ST are stored in TF Record files (/path_to_data/asr_st/de/train/).

In addition, configuration files (*.yml) are generated for the following training/evaluation process. In detail,

  • *_training_args.yml: defines the arguments for training, such as batch size, optimizer, paths of training data and data pre-processing pipelines.
  • *_validation_args.yml: defines the arguments for validation during training, containing validation dataset, interval between two validation procedures, metrics and configurations about automatic checkpoint average.
  • *_prediction_args.yml: defines the arguments for inference and evaluation, containing testsets, inferece options (like beam size) and metric.

Training and evaluation

The training and evaluation procedures are the same as those of AugmentedLibrispeech.