-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux Controlnet Train Example, will run out of memory on validation step #9546
Comments
Can you try the following? In place of
do: del pipeline
gc.collect()
torch.cuda.empty_cache() There is a problem with |
@sayakpaul i tried newest commit and now flux training will crash before training
|
Cc @PromeAIpro in that case. |
same, did you find the reason? |
sorry respond late, i just test a few minutes earlier. tried with 10 validation images and didn't need more CUDA memory. So can you provide more detailed config? especially |
@PromeAIpro Training no longer crashes on start but OOM is still happening even with one 512x512 val image I follow flux readme to the T here is readout and launch params
|
really confused me, ive tried install same transformer\accelerate versions of yours and works good. |
@PromeAIpro 1 A100 on runpod, pytorch 2.4 |
maybe it is about precision. I guess accelerate try convert params to bf16 but fail and remain fp32? it is a guess it be device-related issue, need to test on runpod device. can you do a simple test if a bf16 convert take effect ? and paste a nvidia-smi results
|
Dear team (@sayakpaul @PromeAIpro): I follow the instruction in #9543 but still get OOM issue when running the train_controlnet_sd3.py on 1 A100 80g gpu. Any ideas? |
also meet this OOM issue during validation stage |
Consider using these two options |
same problem here, flux controlnet always oom when running log_validation. I test it on A100(80g). How to solve it? |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Closing due to inactivity. |
Describe the bug
On default settings provided in flux train example readme, with 10 validation images training will error out with out of memory error during validation. on A100 80GB
Reproduction
Run Train Flux controlnet example with default args in Flux Readme with 10 validation images
Logs
No response
System Info
Who can help?
@sayakpaul @PromeAIpro
The text was updated successfully, but these errors were encountered: