-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does anyone trained succesfully with cycleGan unpaired? #87
Comments
I was sceptical at first too but I also managed to reproduce the results. I did not dive too deep into it but here are my current findings:
The only thing that is really confusing me is the high VRAM usage. @GaParmar you stated somewhere that you trained the 512x512 models on a GPU with 48GB of RAM. For my experiments I could not come close to this value. On a H100 it needs ~56GB of RAM for BS of 1. Also I'm very confused why gradient accumulation, xformers and TF32 have nearly no effect and only reduce the occupied RAM by at most 5-10% which is much less of an effect as I see it when training other SD models. And finally training 512x models on an H100 with BS1 (with gradient accumulation of 8) would take about 12 days to reach 25000 steps which seems insane to me. |
@tim-kuechler interesting observations! I wonder if I can use a model trained on 256 and use it as is for higher res. Maybe if not, finetuning it using a method that doesn't tune full weights will work? I regularly train vision transformers to good performance by only training attention layers + top layers. On the other hand LORA adapters are already utilized here, so maybe there's not more memory gains to be had? |
Hi @tfriedel Based on your dataset and task, you can try training your model on random crops during training time and full resolution at test time. |
Hi @GaParmar, the point @tim-kuechler brought up is something really hindering my training. It seems like vae.encode() and vae.decode() in the forward pass of Pix2Pix_Turbo consumes enormous amounts of memory, which makes it impossible for me to train with a batch size larger than five on a A100 with 80 GB RAM. I have followed the guide on how to train pix2pix_turbo on my own paired dataset. Please describe how you can train with significantly higher batch size. |
@swold99 That low batch size is expected. I'm interested how you found out that the high memory usage is due to the VAE? To increase the effective batch size you can use gradient accumulation (--gradient_accumulation x). If you have e.g. a native batch size of 4 and use gradient accumulation with 2 this leads to an effective batch size of 8 although it also doubles your training time. Normally there should be no difference in using a native bs of x or a accumulated bs of x, at least not in model performance. For my trainings I can use a native bs of 8 because I use a version of a version of the H100 with 100GB of VRAM. |
@tim-kuechler I printed the memory usage on a lot of places in the code and noticed that allocated and reserved GPU memory increased by around 30 GB (batch_size=5) after both vae.encode() and vae.decode(). |
Hi, @GaParmar ,I just copy your code, including accelerate config and training code, but I meet the questions below, could you please help me analyze these problems? thank you very much! |
Me the same. I can train with batch size 1 and image size 256x256 by using RTX A6000 50GB memory. It consumes 26GB memory. When i try to increase resolution to 512, it never succeed. The largest resolution I can try is 384x384. |
I have been reading the open comments here, and I am growing skeptical that this code works regarding training..
Does anyone have it working? Does anyone successfully trained the Zebra example?
The text was updated successfully, but these errors were encountered: