Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) #204

Open
Makiyuyuko opened this issue Jun 29, 2023 · 1 comment

Comments

@Makiyuyuko
Copy link

Makiyuyuko commented Jun 29, 2023

Very nice repo! Thank you authors for your contribution.

And here is my situation: I have been trying to use about 20000 hours of open-source speech data to follow this repo (version 1.2.7) and start training Soundstream from scratch. I basically made no changes to this repo except setting the batch_size as:

    trainer = SoundStreamTrainer(
        soundstream,
        audio_path_list=audio_path_list,
        batch_size=12,
        grad_accum_every=8,  # effective batch size of 12*8==96
        data_max_length_seconds=2,  # train on 2 second audio
        num_train_steps=1_000_000,
    ).cuda()

I have been running this on 4xA100 GPUs for a couple of days, and after it went over 10k steps, this kind of audio was obtained. There are some signs of speech formation, but noise is heavy. The total loss has always been around ~20, and maybe gradually decreasing to ~10. Based on my experience in training vocoders such as HIFIGAN/WAVEGAN, I think that the number of training steps may not be enough, and the high-frequency information has not been learned. However I am newbie in large model training so I'm not quite confident if I'm on the right track. Do I just need more training steps or perhaps something has went wrong?

If anyone has met with/solved a similar problem, please share some information.

8k steps:
image

9k steps:
image

And the gradients just went out of control after 10500 steps. I think it definitely failed, but doesn't know the reasons.
image

10.5k steps:
image

@Makiyuyuko
Copy link
Author

Additional information:
soundstream = SoundStream(
codebook_size=1024,
rq_num_quantizers=8,
rq_groups=2,
# this paper proposes using multi-headed residual vector quantization - https://arxiv.org/abs/2305.02765
attn_window_size=128, # local attention receptive field at bottleneck
attn_depth=2
)
lr= 2e-4

I didn't change these, should I?)

@Makiyuyuko Makiyuyuko changed the title Questions about training Soundstream: poor intelligibility after 10k steps. (sr=16k, B=96) Questions about training Soundstream: poor intelligibility and gradients explosion after 10k steps. (sr=16k, B=96) Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant