-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations to PAG and t2i-zero #43
Conversation
This is great, thanks so much. I am getting an error when I run with CTNMS enabled, perhaps related to making ema_factor cpu?
|
Yeah... those are probably better off as python scalars. I don't have time to test it now but what I just committed should be correct. I can look over it again tomorrow if needed. |
Works great. Thanks again! |
I've made a few optimizations to PAG and t2i-zero.
For PAG, you might not notice the changes too much speed-wise, it's already unavoidably very slow. One tensor was being created on CPU then moved to device. This was a problem when using some proposed performance optimizations for A1111 particularly (AUTOMATIC1111/stable-diffusion-webui#15821). Now these changes won't break PAG.
For t2i-zero, the changes are more substantial. There were a lot of forced device syncs happening. Summarizing the changes:
aten::item
is called on them, which forces a device sync. None of them needed to be CUDA tensors to do anything else they were doing so I set them to all be CPU tensors, which removed the blocks.None
or[]
, this was creating a list of all tokens usingtorch.tensor(list(range(1, token_count.item())))
, which causes a device sync for every single token in the sequence (!).torch.arange(1, token_count, device=output.device)
is much more suitable, and creates the tensor directly on-device.Once the associated torchvision patch and the a1111 optimizations go through, you can expect t2i-zero to run with almost no significant overhead compared to having it disabled.