-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183
Comments
Hi, sorry for the late reply. If I remember correctly, samples in M-AI labs are of low SN ratio, and thus WaveNet might suffer from learning a distribution of clean speech. To diagnose what the reasons would be, could you share some generated audio samples and training configurations? |
Hey, no worries. I trained with the mixture-of-logistics configuration, used data from a single male Spanish speaker. I've followed your recommendations elsewhere and decreased the log_min allowed as the training progressed. Here is an sample after ~1.6M steps: https://github.com/adhamel/samples/blob/master/response.wav For evaluation, I'm using generated _npy features from this transformer (https://github.com/espnet/espnet/blob/master/egs/m_ailabs/tts1/RESULTS.md): v.0.5.3 / Transformer |
Could you also share the config file(s) for WaveNet? For the generated sample, it seems that the signal gain is too high. I guess there would be a mismatch between acoustic features at the training time and ones at evaluation. Did you carefully normalize acoustic feauters? Did you make sure that you use same acoustic feature pipeline for both training Transformer and WaveNet? |
Absolutely. Here are the overwritten hparams. I also tried an fmin value of 125. I did not take care to normalize acoustic features, however the WaveNet is trained on the same data subset as the Transformer.
|
The harams looks okay. I'd recommend you to double-check acoustic feature normalization differences (if any), and also please check analysis/synthesis quality (not TTS). Pre-emphasis at the data preprocessing stage changes the signal gain, so you might want to turn global_gain_scale. 0.55 was chosen for LJSpeech if I remember correctly. Another suggestion is that using more higher log scale min (e.g., -9 or -11). As suggested in ClariNet paper, smaller variance bound requires more iterations for training and could be unstable. |
Thank you, you are correct. I will test reducing log scale min. (As a strange aside, I found significant drops in loss at intervals of ~53 epochs.) I hope y'all are staying safe over there. |
Dear @r9y9,
I've trained a MoL wavenet to 1000k steps on ~30,000 audio samples from M-AI Labs data. I am using a pre-trained transformer from @kan-bayashi.
The resulting audio has rather intelligible speech, but has a bit of a warble to it that I would like to clear up. Happy to share generated samples or configurations to help diagnose. Do you have any experience training on that data set or recommendations on what might move me in the right direction?
Best,
Andy
The text was updated successfully, but these errors were encountered: