Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the issue on flux dreambooth lora training #9549

Closed
wants to merge 5 commits into from

Conversation

jeongiin
Copy link
Contributor

What does this PR do?

Fixes #9548

I resolved the issue by changing the autocast context from nullcontext() to torch.autocast(accelerator.device.type, dtype=torch_dtype). This adjustment ensures that the model properly uses the correct mixed-precision mode during validation, preventing the errors I encountered.

I performed the test with dog images as suggested in the README_flux.md, and I also confirmed that the results were successfully uploaded to wandb:
image

Before submitting

Who can review?

@sayakpaul

@linoytsaban
Copy link
Collaborator

Thanks @jeongiin! looks like this will also fix #9476

Copy link
Collaborator

@linoytsaban linoytsaban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, just unsure about the commented line with the condition over last validation (also mentioned in #9476 (comment))
@sayakpaul if it looks ok to you we can merge

@jeongiin
Copy link
Contributor Author

jeongiin commented Sep 30, 2024

Thank you for reviewing, @linoytsaban !!

I'm not sure if this will be helpful, but when I used just autocast_ctx = torch.autocast(accelerator.device.type) to solve the issue, a problem similar to issue #9558 occurred. I was using wandb, black images were uploaded. Like this:

image

@icsl-Jeon
Copy link
Contributor

@jeongiin did you check non nan of prompt_embeds from T5?
This might due to nan value of your pipeline output.

@sayakpaul
Copy link
Member

To use autocast successfully with Flux during validation inference, we need to pre-compute the text embeddings because, otherwise, T5-xxl doesn't really work under autocast.

See how it's handled in:

# pre calculate prompt embeds, pooled prompt embeds, text ids because t5 does not support autocast

@jeongiin
Copy link
Contributor Author

jeongiin commented Oct 2, 2024

Thank you for good advice, @icsl-Jeon and @sayakpaul !
I will follow @sayakpaul 's advice and apply this to train_dreambooth_lora_flux.py.

@sayakpaul
Copy link
Member

@jeongiin there's another adjacent PR here: #9565

cc: @icsl-Jeon

@sayakpaul
Copy link
Member

@jeongiin thank you for the changes but as mentioned in #9549 (comment), we need to handle the autocasting a bit differently. Let me know if anything is unclear.

@jeongiin
Copy link
Contributor Author

Hello! I apologize for the delay! @sayakpaul

If I understand correctly, you're suggesting that further revisions may be needed, referring to #9565 and #9549's comment.

Would the issue not be resolved with just the addition of torch.autocast(accelerator.device.type, dtype=torch_dtype)?
In my testing, there didn't seem to be any problems with my modification.

Would you mind clarifying if there's something I might have overlooked?

@sayakpaul
Copy link
Member

@jeongiin thanks a lot for your contributions! Could you maybe check if #9565 solves the problems this PR is trying to address?

Copy link

github-actions bot commented Nov 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Nov 5, 2024
@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Nov 5, 2024

Should we close this now if it is fixed by #9565? @sayakpaul @linoytsaban

@sayakpaul
Copy link
Member

Yes this can be closed. Sorry @jeongiin for the delay on our side but we appreciate your willingness to help us.

@sayakpaul sayakpaul closed this Nov 5, 2024
@jeongiin
Copy link
Contributor Author

jeongiin commented Nov 6, 2024

I haven't had enough GPU available lately so I can't verify this. :(
It's disappointing, but I'll check it out next time I get a chance! Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Still Issue on flux dreambooth lora training #9237
5 participants