Trying to recreate training results for the german voice trained on Thorsten Dataset #498

JoshuaTalentBait · 2021-05-21T13:44:05Z

JoshuaTalentBait
May 21, 2021

Setup

I started to train the Tacotron 2 architecture on the Thorsten Dataset according to the TTS-recipes. So the model_config.json.zip almost stayed the same. I reduced the epochs to make an incremental training which I can continue over a few Iterations with the --continue_path parameter of the train_tts.py script. I slightly modified learning rate and batch size to escape the potential local min. I started/proceeded my model with these commands:

CUDA_VISIBLE_DEVICES="0" python TTS/mozilla_voice_tts/bin/train_tts.py --config_path model_config.json
CUDA_VISIBLE_DEVICES="0" python TTS/mozilla_voice_tts/bin/train_tts.py --continue_path tts_model/training_run/

Training

I already trained 250 epoch on the standard batch size and learning rate according to the gradual_training. The output based on the test_sentences are already comprehensible. One can understand the words. The best model however was obtained and saved in epoch 150 and since wasn't improved.

Current Eval Stats:

 | > avg_decoder_loss: 0.21486 (-0.00021)
 | > avg_postnet_loss: 0.14562 (-0.00036)
 | > avg_stopnet_loss: 0.08170 (+0.00066)
 | > avg_decoder_coarse_loss: 0.28451 (-0.00018)
 | > avg_decoder_ddc_loss: 0.00904 (-0.00004)
 | > avg_ga_loss: 0.00284 (-0.00003)
 | > avg_loss: 0.65687 (-0.00082)
 | > avg_align_error: 0.41496 (+0.00180)

🔊 TEST SENTENCE AUDIO

P.S.: I did not started with the vocoder training like recommended in the recipe.

Question

Since this is a test training (to get a feel how well/easy a model can be trained on a german dataset) my goal is to estimate the influence of epoch as well as model parameters on the models performance, and see how different configs will lead to different results. Therefore my question is what would be steps I can take, parameter I can tune to improve/speed up my training to get similar (does not have to be the same quality) results like on our CoLab. Or is the training time up to 1000 epoch the main/only way to improve quality. In other words can i train my model a little bit smarter than just trying to get as much epochs as possible. I've seen that the number of output frames r seems to be an important parameter to get to the "next stage" of the training process by reducing it when the training converges.
When is a good time to proceed with the vocoder training?

erogol · 2021-05-21T23:09:34Z

erogol
May 21, 2021
Maintainer

Hi thanks for the thread.

This audio sample is without the vocoder right ?

Also maybe @thorstenMueller can maybe help too ?

0 replies

thorstenMueller · 2021-05-22T07:32:41Z

thorstenMueller
May 22, 2021

Hi.
Thanks for your experiments on my dataset.

@olafthiele trained a DDC model on my dataset and we kept tacotron 2 training running for about 460k training steps. If you're satisfied with the speechflow and all words are good to understand you can move forward to train the vocoder. This will increase output quality (but have no influence on speechflow).

Maybe this comparison page might be helpful for choosing your vocoder - maybe you can try HifiGAN as it should have a really good RTF and a nice quality (didn't train it myself):
https://thorstenmueller.github.io/deep-learning-german-tts/audio_compare

4 replies

JoshuaTalentBait May 25, 2021
Author

Hi Thorsten,
thanks for replying to the Thread and the feedback. This is already helpful. I have two questions regarding your answer:

Could you define the term speechflow? Since i am somewhat new to this field I am not sure what exactly is included? I would guess everything from pronunciation, speed, change of tone during the sentence (questions) etc.? Is clarity included or is this due to the vocoder?
How can a model already create a voice with no trained vocoder? Isn't the vocoder necessary to convert mel-spectograms into human voice audio? Or does it only transform audio reducing the noice (increase output quality) etc.?

Thanks for your contribution to the community.

thorstenMueller May 25, 2021

Hi.

You're right. That's meant by me on "speechflow". The scratchy audio will get better on better vocoders, but speed, change of tone, ... not.
The "Griffin Lim" vocoder is the included/default vocoder which is generating audio when no "special" vocoder is available. Audio can be generated really fast with GL, but it has the lowest quality.

You're welcome and thanks for working/playing around with my dataset :-).

JoshuaTalentBait May 25, 2021
Author

Thanks! So regarding the speechflow your main advice would be to keep training and aiming towards more epochs/steps?

thorstenMueller May 25, 2021

Yes. I trained most taco2 models with a batchsize of 32 up to 400-500k steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to recreate training results for the german voice trained on Thorsten Dataset #498

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Trying to recreate training results for the german voice trained on Thorsten Dataset #498

JoshuaTalentBait May 21, 2021

Setup

Training

Question

Replies: 2 comments · 4 replies

erogol May 21, 2021 Maintainer

thorstenMueller May 22, 2021

JoshuaTalentBait May 25, 2021 Author

thorstenMueller May 25, 2021

JoshuaTalentBait May 25, 2021 Author

thorstenMueller May 25, 2021

JoshuaTalentBait
May 21, 2021

Replies: 2 comments 4 replies

erogol
May 21, 2021
Maintainer

thorstenMueller
May 22, 2021

JoshuaTalentBait May 25, 2021
Author

JoshuaTalentBait May 25, 2021
Author