Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trained model can generate correct text but incorrect speech #13

Open
chentuochao opened this issue Jul 27, 2024 · 13 comments
Open

Trained model can generate correct text but incorrect speech #13

chentuochao opened this issue Jul 27, 2024 · 13 comments

Comments

@chentuochao
Copy link

I tried to reproduce the training of the fr-en simultaneous model. I follows the instruction to prepare the dataset and run the script train.simul-s2st.sh
The model training seems to go fine but the during evaluation of our trained model (using ./simuleval.simul-s2st.sh), weird behaviors happen.
Here is the training logging:
Screenshot 2024-07-27 at 2 39 39 AM
During the inference, when I tried to run the eval scripts on the example you provided, the weird thing happens, it can output correct text translation but the output speech is incorrect (output speech is almost silent). I print the text output and speech units output as follow:
image

Do you know what problem may be?

Thank you

@zhangshaolei1998
Copy link
Collaborator

I wonder if you have tried to test it directly using the model we provide, and whether this happens?

If not, I think it may be a problem with the training scripts? Perhaps you can provide the training scripts?

@chentuochao
Copy link
Author

Thank you for your kind reply!
I also tried your provided pretrained model, it works well. The weird issues only happens to my trained model.
Here is the traning script I am using

export CUDA_VISIBLE_DEVICES=0

LANG=fr
DATA_ROOT=/scr/data/zhangshaolei/datasets/cvss/cvss-c
DATA=$DATA_ROOT/${LANG}-en/fbank2unit
model=streamspeech.simul-s2st.${LANG}-en

fairseq-train $DATA \
  --user-dir researches/ctc_unity \
  --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
  --task speech_to_speech_ctc --target-is-code --target-code-size 1000 --vocoder code_hifigan  \
  --criterion speech_to_unit_2pass_ctc_asr_st --label-smoothing 0.1 --rdrop-alpha 0.0 \
  --arch streamspeech --share-decoder-input-output-embed \
  --encoder-layers 12 --encoder-embed-dim 256 --encoder-ffn-embed-dim 2048 --encoder-attention-heads 4 \
  --translation-decoder-layers 4 --synthesizer-encoder-layers 2 \
  --decoder-layers 2  --decoder-embed-dim 512 --decoder-ffn-embed-dim 2048 --decoder-attention-heads 8 \
  --k1 0 --k2 0 --n1 1 --n2 -1 \
  --chunk-size 8 --multichunk \
  --uni-encoder \
  --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
  --train-subset train --valid-subset dev \
  --ctc-upsample-rate 25 \
  --save-dir checkpoints/$model \
  --validate-interval 1000 --validate-interval-updates 1000 \
  --save-interval 1 --save-interval-updates 1000 \
  --keep-last-epochs 15 \
  --no-progress-bar --log-format json --log-interval 100 \
  --lr 0.001 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 1.0 \
  --max-tokens 22000 --max-target-positions 1200 --update-freq 2 \
  --attn-type espnet --pos-enc-type rel_pos \
  --keep-interval-updates 40 \
  --keep-best-checkpoints 20 \
  --seed 1 --fp16 --num-workers 8 

config_gcmvn.yaml

global_cmvn:
  stats_npz_path: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/gcmvn.npz
input_channels: 1
input_feat_per_channel: 80
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - global_cmvn
  _train:
  - global_cmvn
  - specaugment
vocoder:
  checkpoint: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000
  config: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json
  type: code_hifigan

config_mtl_asr_st_ctcst.yaml

target_unigram:
   decoder_type: transformer
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
   loss_weight: 8.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 4
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1
source_unigram:
   decoder_type: ctc
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000
   loss_weight: 4.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 0
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1
ctc_target_unigram:
   decoder_type: ctc
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
   loss_weight: 4.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 0
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1

I also attach the model we trained here (https://drive.google.com/file/d/1rdOEt1NSt8oxUBHL0WfM_CCtKczt6TzO/view?usp=share_link)

@zhangshaolei1998
Copy link
Collaborator

There seems to be no problem with training scripts. Problems with generating short speech are often caused by the non-autoregressive text-to-unit generation module. I wonder if you have modified this part of the code?

@chentuochao
Copy link
Author

Yeah, I think it should be the problem at autoregressive text-to-unit generation module. I did not change any part of training code and model. Do you have any idea what happens?
I will retry to re-download the GitHub repo and train again to see whether I an still facing the problem and update in this issue

@zhangshaolei1998
Copy link
Collaborator

Sorry, I haven't encountered this problem before, and I don't have any experience to solve this issue yet.

Maybe you can retrain with the latest code and record the final loss. We can see whether the loss after convergence is within the normal range.

@Lili-q
Copy link

Lili-q commented Aug 2, 2024

Hello, I also trained a fr-en streaming S2ST model completely according to the tutorial, and did not make any changes to the code, and encountered a similar problem as you.

The result of streaming ASR is normal, but the result of simultaneous translation is incorrect, and the corresponding token is also abnormal (very short), and the synthesized audio is less than 1s, with almost no sound.

I tested the same source audio using my own trained model and the pre-trained model provided by the author. See the following pictures.

a. Result on my own trained model:
a_error

b. Results on the pre-trained model provided by the author
b_correct

Did you solve your problem?

@chentuochao
Copy link
Author

Dear authors,
I tried to redo all pipeline again, but I still has that issues:
Here are the all commands we use after installing the environment:

bash 0.download_pretrain_models.sh

# changed the env variables
bash preprocess.sh

# changed the paths in config_gcmvn.yaml

# copy and paste config_mtl_asr_st_ctcst.yaml to fbank2unit

# changed paths in train.simul-s2st.sh
bash train.simul-s2st.sh

# changed paths in simuleval.simul-s2st.sh
bash simuleval.simul-s2st.sh

Do you know what the potential problem is?

@EmreOzkose
Copy link

I have the same issue. Is there any update?

@chentuochao
Copy link
Author

Hi Emre,
I found this bug is related to the loss function and author pushed the fixed loss in the most recent commit. Just pull it, then the problem will be solved

@EmreOzkose
Copy link

I am training on my own data. I applied loss bug fix. ASR and translation seem okey (wer decreases to ~30%). However, I cannot still get meaningful audio outputs after loss bug fix. They are very short and sound like a noise.

@EmreOzkose
Copy link

I use another Hubert model to extract source units. Do it affect this situation?

@EmreOzkose
Copy link

It was the problem :). I misunderstood some part of the model. When I changed back to the original hubert, the problem is solved.

@nasirudeenraheem
Copy link

I am training on my own data. I applied loss bug fix. ASR and translation seem okey (wer decreases to ~30%). However, I cannot still get meaningful audio outputs after loss bug fix. They are very short and sound like a noise.

Can you please, provide the link to the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants