Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-memory during fine-tuning #59

Closed
sch0ngut opened this issue Nov 22, 2023 · 5 comments
Closed

Out-of-memory during fine-tuning #59

sch0ngut opened this issue Nov 22, 2023 · 5 comments

Comments

@sch0ngut
Copy link

I'm trying to run a fine-tuning on a dataset of roughly 1.5h of audio data with an average audio length of ~7.5s. As hardware I'm having 4 GeForce 3090 GPUs with 24GB RAM each. Unfortunately the training always crashes due to OOM (out-of-memory) after the first couple of steps. I've checked both the README and the discussion in issue #10, but none of the suggested things seem to work. I.e. even using values such as

  • max_len: 50
  • batch_precentage: 0.125

are not working. I'm using the command as suggested in the README, i.e.

python train_finetune.py --config_path ./Configs/config_ft.yml

where config_ft.yml is as before but updated with above values and my own dataset.
Any other suggestions on what I could do to make the training run and avoid running OOM?

@Kreevoz
Copy link

Kreevoz commented Nov 22, 2023

Repo author mentioned in issue #48 that a max_len of at least 80 ( 80 * 300 / 24000 = 1 second audio duration ) is required as bare minimum. Shorter and you will not get any useful results.

batch_size can be reduced, but should be larger than 1.

@yl4579
Copy link
Owner

yl4579 commented Nov 22, 2023

Please check the colab demo: https://github.com/yl4579/StyleTTS2/blob/main/Colab/StyleTTS2_Finetune_Demo.ipynb. You can finetune with only a batch size of 2, but try not reduce max_length because it will significantly worsen the quality if it’s too short. Also the parameters you showed are for SLM adversarial training run, so it doesn’t matter for the first few epochs.

@yl4579 yl4579 closed this as completed Nov 22, 2023
@cnlinxi
Copy link

cnlinxi commented Nov 25, 2023

@yl4579 Thank you for contributing such a great job.

BTW, why does StyleTTS-2 require so much GPU memory?

I tried to finetune it on A800 (80GB) and the only change I made was to set the batch size to 4, which requires nearly 68GB in epoch 15. At the beginning of fine-tuning, it seems to only update BERT/TextEncoder/ASR/StyleEncoderx2/ProsodyPredictor/Decoder/Diffusion. And in the joint training phase, it even needs to update the parameters of WavLM.

Is this normal?

@yl4579
Copy link
Owner

yl4579 commented Nov 25, 2023

@cnlinxi It doesn’t update the parameters of WavLM but it does use its gradient to train the generator. This is unfortunately one of the limitations of using large speech language models. Probably future works can resolve it. You can also skip the joint training part, but it will significantly worsen the quality as we discussed earlier in this thread.

@Curlypla
Copy link

I tried using bitsandbytes 8Bit optimizer but it was just wrong (I don't know anything about it so I may have done something wrong), but I end up with equal VRAM usage and slower speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants