Skip to content

Latest commit

 

History

History
89 lines (71 loc) · 5.17 KB

textless_s2st_real_data.md

File metadata and controls

89 lines (71 loc) · 5.17 KB

Textless Speech-to-Speech Translation (S2ST) on Real Data

We provide instructions and pre-trained models for the work "Textless Speech-to-Speech Translation on Real Data (Lee et al. 2021)".

Pre-trained Models

HuBERT

Model Pretraining Data Model Quantizer
mHuBERT Base VoxPopuli En, Es, Fr speech from the 100k subset download L11 km1000

Unit-based HiFi-GAN vocoder

Unit config Unit size Vocoder language Dataset Model
mHuBERT, layer 11 1000 En LJSpeech ckpt, config
mHuBERT, layer 11 1000 Es CSS10 ckpt, config
mHuBERT, layer 11 1000 Fr CSS10 ckpt, config

Speech normalizer

Language Training data Target unit config Model
En 10 mins mHuBERT, layer 11, km1000 download
En 1 hr mHuBERT, layer 11, km1000 download
En 10 hrs mHuBERT, layer 11, km1000 download
Es 10 mins mHuBERT, layer 11, km1000 download
Es 1 hr mHuBERT, layer 11, km1000 download
Es 10 hrs mHuBERT, layer 11, km1000 download
Fr 10 mins mHuBERT, layer 11, km1000 download
Fr 1 hr mHuBERT, layer 11, km1000 download
Fr 10 hrs mHuBERT, layer 11, km1000 download
  • Refer to the paper for the details of the training data.

Inference with Pre-trained Models

Speech normalizer

  1. Download the pre-trained models, including the dictionary, to DATA_DIR.
  2. Format the audio data.
# AUDIO_EXT: audio extension, e.g. wav, flac, etc.
# Assume all audio files are at ${AUDIO_DIR}/*.${AUDIO_EXT}

python examples/speech_to_speech/preprocessing/prep_sn_data.py \
  --audio-dir ${AUDIO_DIR} --ext ${AUIDO_EXT} \
  --data-name ${GEN_SUBSET} --output-dir ${DATA_DIR} \
  --for-inference
  1. Run the speech normalizer and post-process the output.
mkdir -p ${RESULTS_PATH}

python examples/speech_recognition/new/infer.py \
    --config-dir examples/hubert/config/decode/ \
    --config-name infer_viterbi \
    task.data=${DATA_DIR} \
    task.normalize=false \
    common_eval.results_path=${RESULTS_PATH}/log \
    common_eval.path=${DATA_DIR}/checkpoint_best.pt \
    dataset.gen_subset=${GEN_SUBSET} \
    '+task.labels=["unit"]' \
    +decoding.results_path=${RESULTS_PATH} \
    common_eval.post_process=none \
    +dataset.batch_size=1 \
    common_eval.quiet=True

# Post-process and generate output at ${RESULTS_PATH}/${GEN_SUBSET}.txt
python examples/speech_to_speech/preprocessing/prep_sn_output_data.py \
  --in-unit ${RESULTS_PATH}/hypo.units \
  --in-audio ${DATA_DIR}/${GEN_SUBSET}.tsv \
  --output-root ${RESULTS_PATH}

Unit-to-waveform conversion with unit vocoder

The pre-trained vocoders can support generating audio for both full unit sequences and reduced unit sequences (i.e. duplicating consecutive units removed). Set --dur-prediction for generating audio with reduced unit sequences.

# IN_CODE_FILE contains one unit sequence per line. Units are separated by space.

python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${IN_CODE_FILE} \
  --vocoder ${VOCODER_CKPT} --vocoder-cfg ${VOCODER_CFG} \
  --results-path ${RESULTS_PATH} --dur-prediction

Training new models

To be updated.