Trying to recreate training results for the german voice trained on Thorsten Dataset #498
Replies: 2 comments 4 replies
-
Hi thanks for the thread. This audio sample is without the vocoder right ? Also maybe @thorstenMueller can maybe help too ? |
Beta Was this translation helpful? Give feedback.
-
Hi. @olafthiele trained a DDC model on my dataset and we kept tacotron 2 training running for about 460k training steps. If you're satisfied with the speechflow and all words are good to understand you can move forward to train the vocoder. This will increase output quality (but have no influence on speechflow). Maybe this comparison page might be helpful for choosing your vocoder - maybe you can try HifiGAN as it should have a really good RTF and a nice quality (didn't train it myself): |
Beta Was this translation helpful? Give feedback.
-
Setup
I started to train the Tacotron 2 architecture on the Thorsten Dataset according to the TTS-recipes. So the model_config.json.zip almost stayed the same. I reduced the epochs to make an incremental training which I can continue over a few Iterations with the
--continue_path
parameter of thetrain_tts.py
script. I slightly modified learning rate and batch size to escape the potential local min. I started/proceeded my model with these commands:CUDA_VISIBLE_DEVICES="0" python TTS/mozilla_voice_tts/bin/train_tts.py --config_path model_config.json
CUDA_VISIBLE_DEVICES="0" python TTS/mozilla_voice_tts/bin/train_tts.py --continue_path tts_model/training_run/
Training
I already trained 250 epoch on the standard batch size and learning rate according to the
gradual_training
. The output based on thetest_sentences
are already comprehensible. One can understand the words. The best model however was obtained and saved in epoch 150 and since wasn't improved.Current Eval Stats:
🔊 TEST SENTENCE AUDIO
P.S.: I did not started with the vocoder training like recommended in the recipe.
Question
Since this is a test training (to get a feel how well/easy a model can be trained on a german dataset) my goal is to estimate the influence of epoch as well as model parameters on the models performance, and see how different configs will lead to different results. Therefore my question is what would be steps I can take, parameter I can tune to improve/speed up my training to get similar (does not have to be the same quality) results like on our CoLab. Or is the training time up to 1000 epoch the main/only way to improve quality. In other words can i train my model a little bit smarter than just trying to get as much epochs as possible. I've seen that the number of output frames
r
seems to be an important parameter to get to the "next stage" of the training process by reducing it when the training converges.When is a good time to proceed with the vocoder training?
Beta Was this translation helpful? Give feedback.
All reactions