Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

Makiyuyuko · 2023-06-29T07:04:31Z

Very nice repo! Thank you authors for your contribution.

And here is my situation: I have been trying to use about 20000 hours of open-source speech data to follow this repo (version 1.2.7) and start training Soundstream from scratch. I basically made no changes to this repo except setting the batch_size as:

    trainer = SoundStreamTrainer(
        soundstream,
        audio_path_list=audio_path_list,
        batch_size=12,
        grad_accum_every=8,  # effective batch size of 12*8==96
        data_max_length_seconds=2,  # train on 2 second audio
        num_train_steps=1_000_000,
    ).cuda()

I have been running this on 4xA100 GPUs for a couple of days, and after it went over 10k steps, this kind of audio was obtained. There are some signs of speech formation, but noise is heavy. The total loss has always been around ~20, and maybe gradually decreasing to ~10. Based on my experience in training vocoders such as HIFIGAN/WAVEGAN, I think that the number of training steps may not be enough, and the high-frequency information has not been learned. However I am newbie in large model training so I'm not quite confident if I'm on the right track. Do I just need more training steps or perhaps something has went wrong?

If anyone has met with/solved a similar problem, please share some information.

8k steps:

9k steps:

And the gradients just went out of control after 10500 steps. I think it definitely failed, but doesn't know the reasons.

10.5k steps:

The text was updated successfully, but these errors were encountered:

Makiyuyuko · 2023-06-29T07:08:49Z

Additional information:
soundstream = SoundStream(
codebook_size=1024,
rq_num_quantizers=8,
rq_groups=2,
# this paper proposes using multi-headed residual vector quantization - https://arxiv.org/abs/2305.02765
attn_window_size=128, # local attention receptive field at bottleneck
attn_depth=2
)
lr= 2e-4

I didn't change these, should I?)

Makiyuyuko changed the title ~~Questions about training Soundstream: poor intelligibility after 10k steps. (sr=16k, B=96)~~ Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) Jun 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

Makiyuyuko commented Jun 29, 2023 •

edited

Loading

Makiyuyuko commented Jun 29, 2023

Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

Comments

Makiyuyuko commented Jun 29, 2023 • edited Loading

Makiyuyuko commented Jun 29, 2023

Makiyuyuko commented Jun 29, 2023 •

edited

Loading