-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Deepcache Optimization #14210
base: dev
Are you sure you want to change the base?
Conversation
To test this on SDXL, go to forward_timestep_embed_patch.py and replace "ldm" with "sgm" |
sounding nice the hyper tile didnt bring any speed on SDXL how about this? |
Enormous speed boost. Around 2-3x faster when it kicks in. However, I'm currently unable to get good quality results with it; I think the forward timestep embed patch might need to be further adapted to the SGM version, I'm not sure though. |
@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure) |
@gel-crabs I guess we might have to adjust the indexes of in-out blocks, XL Unet is more deeper, so using shallow parts earlier would lead to cache 'noisy' semantic informations. Note: current implementation is quite different from original paper, follows the gist snippet... and its more suitable for frequently used samplers |
I adapted it to use the SGM code and the results are the exact same, so it doesn't need to be further adapted to SGM. I'm gonna do some testing with the in-out blocks and see how it goes. |
Temporary update : I think the implementation should be modified to follow the original paper again. Original paper says that we should sample the values for nearby steps, not by duration basis. Although we can only optimize the last final steps, for SD XL, I don't think current one is accurate... thus this should be fixed again. |
Very interesting results. Thanks for your effort @aria1th! If need any assistance, please feel free to reach out to us at any time. |
cache looks like degraded quality significantly? @aria1th also hyper tile looks like do not degrade quality right? |
@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though. |
I have a feeling it has something to do with the extra IN/MID/OUT blocks in SDXL. For instance in SD 1.5 IN710 corresponds to a layer, while in SDXL the equivalent is IN710-719 (so 10 blocks compared to 1). The Elements tab in the SuperMerger extension is really good for showing this information. The middle block has 9 extra blocks in SDXL as well, so I'm betting it has something to do with that. |
Oops, didn't see the new update. MUCH less quality loss than before. I'm gonna keep testing and see what I can find. So the settings are this, right? In block index: 0 |
Sorry for the spam, results and another question: So with these settings on SDXL: In block index: 8 I get next to no quality loss (even an upgrade!), however, the speedup is smaller, pretty much equivalent to a second HyperTile. So my question is: does the block cache index have any effect on the blocks before or after it? For instance, if the out block index is set to 8, does it cache the ones before it as well? I ask this because there is another output block with the same resolution, which could be cached in addition to the output block cached already. I've gotten similarly high quality (and faster) results with in-blocks set to 7 and 8, which are the same resolution on SDXL. If it gels with DeepCache I think a second Cache Out Block Index could result in a further speedup. |
@gel-crabs I fixed some explanations - for in types, it applies to after-index, thus -1-> all caching Timestep is kinda important feature- if we use 1000, it means we won't refresh any cache after once we get it. This stands for 1.5 type models - which means they seems to already know what to draw, at first cache point (!). This somehow explains few more things too... anyway However, XL models seems to have problem with this - they have to refresh cache frequently, they are very dynamic with it. Unfortunately, refreshing cache directly leads to cache failure rate increase, thus less performance increase... I'll test with mid blocks too. |
I should also explain why quality gets degraded even when we use less cache than all caching- its about input-output mismatching. To summarize, there are corresponding pairs to each caches, (as UNet blocks). In another words, if we increase input block id level, then we have to decrease output block id level. (Images will be attached for further reference) However, I guess I should use more recent implementation- or convert from pipeline... I'll be able to do this in about 12-24 hours. |
New implementation - should be tested though SD 1.5512x704 test, with 40% disable for initial steps
Vanilla with Hypertile: Vanilla without Hypertile
SD XL :
** DeepCache + HR + Hypertile** 1.47it/s |
maybe... some invalid interrupt method? move to paper implementation fix descriptions, KeyError handle sgm for XL fix ruff, change default for out_block Implement Deepcache Optimization
@gel-crabs Now it should work for both models! |
Yeah, it works great! What Cache Resnet level did you use for SDXL? (Also, what is your Hypertile VAE max tile size?) Oh yeah, and another thing: I'm getting this in the console. But yeah, the speedup here is absolutely immense. Do not miss out on this. |
@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6 |
Ahh, thank you! One more thing, perhaps another step percentage for HR fix? Also, this literally halves the time it takes to generate an image. And it barely even changes the image at all. Thank you so much for your work. |
@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...) |
Dang, I just checked with ControlNet and it makes the image go full orange. Dynamic Thresholding works perfectly though. |
https://github.com/Mikubill/sd-webui-controlnet/blob/main/scripts/hook.py#L425 |
https://github.com/aria1th/sd-webui-controlnet/tree/maybe-deepcache-wont-work I was trying some various implementation including diffusers pipeline, and I guess it does not work well with ControlNet.... ControlNet obviously handles timestep-dependent embedding, which changes the output of U-Net drastically. Thus, this is expected output. Also, I had to patch the controlnet extension, somehow hook override was not working if I offer the patched function in-place, even if it has executed correctly - it completely ignored controlnet. Thus, in this level, I will just continue to release this as extension - unless someone comes up with great compatible code, you should only use it without controlnet 😢 |
Aww man, that sucks. This is seriously a game changer. :( Also, it doesn't appear to work with FreeU. The HR fix only speeds up after the original step percentage, I assume because it doesn't cache the steps before the step percentage. |
@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar. Some more academical stuff: DDIM works well with this. its hidden states will change smoothly, we can use nearby weights. It means whenever the UNet values have to change, the caching will mess up. But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption? |
The deepcache is available on Webui now? How can we use it ? |
@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone |
Appreciate your hard and awesome work! Want to know when or whether the controlnet is available for use with deepcache now or any plan for development? |
Now it does not seem to be working |
Description
DeepCache, Yet another optimization
For adjacent timesteps, the result of each layers can be considered 'almost-same' for some cases.
We can just cache them.
Note : this is more beneficial when we have very many step such as DDIM environment.
It won't produce dramatic improvement in few-step inference, especially LCM.
The implementation was modified with gist and patched compatibility too.
Speed benchmark with 1.5 models, results will be added:
Vanilla 512x704, 23 step, DPM++ SDE Karras sampler, 2x with Anime6B, 5-sample inference
2.67 it/s
Hypertile(All)
3.74 it/s
DeepCache
3.02 it/s
DeepCache + HyperTile
4.59 it/s
Compatibility
The optimization is compatible with Controlnet, at least.
(2.6 it/s, 512x680 2x vs 2.0it/s (without))
With both, we can achieve 4.7 it/s - yes - it is faster because it reuses whole cache in hires pass.
Should be tested
We can currently change checkpoint with Refiner / Hires.fix
Then, should be invalidate the cache? or, should we just use it?
Screenshots/videos:
Works with Hypertile too,
Checklist: