Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

adhamel · 2020-03-18T20:29:29Z

Dear @r9y9,
I've trained a MoL wavenet to 1000k steps on ~30,000 audio samples from M-AI Labs data. I am using a pre-trained transformer from @kan-bayashi.

The resulting audio has rather intelligible speech, but has a bit of a warble to it that I would like to clear up. Happy to share generated samples or configurations to help diagnose. Do you have any experience training on that data set or recommendations on what might move me in the right direction?

Best,
Andy

r9y9 · 2020-03-24T04:21:46Z

Hi, sorry for the late reply. If I remember correctly, samples in M-AI labs are of low SN ratio, and thus WaveNet might suffer from learning a distribution of clean speech. To diagnose what the reasons would be, could you share some generated audio samples and training configurations?

adhamel · 2020-03-24T19:44:13Z

Hey, no worries. I trained with the mixture-of-logistics configuration, used data from a single male Spanish speaker. I've followed your recommendations elsewhere and decreased the log_min allowed as the training progressed.

Here is an sample after ~1.6M steps: https://github.com/adhamel/samples/blob/master/response.wav

For evaluation, I'm using generated _npy features from this transformer (https://github.com/espnet/espnet/blob/master/egs/m_ailabs/tts1/RESULTS.md):

v.0.5.3 / Transformer
Silence trimming
FTT in points: 1024
Shift in points: 256
Frequency limit: 80-7600
Fast-GL 64 iters
Environments
date: Sun Sep 29 21:20:05 JST 2019
python version: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]
espnet version: espnet 0.5.1
chainer version: chainer 6.0.0
pytorch version: pytorch 1.0.1.post2
Git hash: 6b2ff45d1e2c624691f197014b8fe71a5e70bae9
Commit date: Sat Sep 28 14:33:32 2019 +0900

r9y9 · 2020-03-25T07:49:40Z

Could you also share the config file(s) for WaveNet?

For the generated sample, it seems that the signal gain is too high. I guess there would be a mismatch between acoustic features at the training time and ones at evaluation. Did you carefully normalize acoustic feauters? Did you make sure that you use same acoustic feature pipeline for both training Transformer and WaveNet?

adhamel · 2020-03-25T15:38:30Z

Absolutely. Here are the overwritten hparams. I also tried an fmin value of 125. I did not take care to normalize acoustic features, however the WaveNet is trained on the same data subset as the Transformer.

{
"name": "wavenet_vocoder",
"input_type": "raw",
"quantize_channels": 65536,
"preprocess": "preemphasis",
"postprocess": "inv_preemphasis",
"global_gain_scale": 0.55,
"sample_rate": 16000,
"silence_threshold": 2,
"num_mels": 80,
"fmin": 80,
"fmax": 7600,
"fft_size": 1024,
"hop_size": 256,
"frame_shift_ms": null,
"win_length": 1024,
"win_length_ms": -1.0,
"window": "hann",
"highpass_cutoff": 70.0,
"output_distribution": "Logistic",
"log_scale_min": -32.23619130191664,
"out_channels": 30,
"layers": 24,
"stacks": 4,
"residual_channels": 128,
"gate_channels": 256,
"skip_out_channels": 128,
"dropout": 0.0,
"kernel_size": 3,
"cin_channels": 80,
"cin_pad": 2,
"upsample_conditional_features": true,
"upsample_net": "ConvInUpsampleNetwork",
"upsample_params": {
"upsample_scales": [
4,
4,
4,
4
]
},
"gin_channels": -1,
"n_speakers": 7,
"pin_memory": true,
"num_workers": 2,
"batch_size": 8,
"optimizer": "Adam",
"optimizer_params": {
"lr": 0.001,
"eps": 1e-08,
"weight_decay": 0.0
},
"lr_schedule": "step_learning_rate_decay",
"lr_schedule_kwargs": {
"anneal_rate": 0.5,
"anneal_interval": 200000
},
"max_train_steps": 1000000,
"nepochs": 2000,
"clip_thresh": -1,
"max_time_sec": null,
"max_time_steps": 10240,
"exponential_moving_average": true,
"ema_decay": 0.9999,
"checkpoint_interval": 100000,
"train_eval_interval": 100000,
"test_eval_epoch_interval": 50,
"save_optimizer_state": true
}

r9y9 · 2020-03-30T08:19:01Z

The harams looks okay. I'd recommend you to double-check acoustic feature normalization differences (if any), and also please check analysis/synthesis quality (not TTS).

Pre-emphasis at the data preprocessing stage changes the signal gain, so you might want to turn global_gain_scale. 0.55 was chosen for LJSpeech if I remember correctly.

Another suggestion is that using more higher log scale min (e.g., -9 or -11). As suggested in ClariNet paper, smaller variance bound requires more iterations for training and could be unstable.

adhamel · 2020-04-02T20:03:37Z

Thank you, you are correct. I will test reducing log scale min. (As a strange aside, I found significant drops in loss at intervals of ~53 epochs.) I hope y'all are staying safe over there.

adhamel changed the title ~~Warbly speech - MoL, 1000k steps~~ Help extending to MAILabs data - Warbly speech - MoL, 1000k steps Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

adhamel commented Mar 18, 2020

r9y9 commented Mar 24, 2020

adhamel commented Mar 24, 2020

r9y9 commented Mar 25, 2020

adhamel commented Mar 25, 2020

r9y9 commented Mar 30, 2020

adhamel commented Apr 2, 2020

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

Help extending to MAILabs data - Warbly speech - MoL, 1000k steps #183

Comments

adhamel commented Mar 18, 2020

r9y9 commented Mar 24, 2020

adhamel commented Mar 24, 2020

r9y9 commented Mar 25, 2020

adhamel commented Mar 25, 2020

r9y9 commented Mar 30, 2020

adhamel commented Apr 2, 2020