Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Deepcache Optimization #14210

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from
Draft

Conversation

aria1th
Copy link
Collaborator

@aria1th aria1th commented Dec 5, 2023

Description

DeepCache, Yet another optimization

For adjacent timesteps, the result of each layers can be considered 'almost-same' for some cases.

We can just cache them.

Note : this is more beneficial when we have very many step such as DDIM environment.

It won't produce dramatic improvement in few-step inference, especially LCM.

The implementation was modified with gist and patched compatibility too.

Speed benchmark with 1.5 models, results will be added:

Vanilla 512x704, 23 step, DPM++ SDE Karras sampler, 2x with Anime6B, 5-sample inference
2.67 it/s

Hypertile(All)
image
3.74 it/s

DeepCache
image
3.02 it/s

DeepCache + HyperTile
4.59 it/s

Compatibility

The optimization is compatible with Controlnet, at least.
(2.6 it/s, 512x680 2x vs 2.0it/s (without))
With both, we can achieve 4.7 it/s - yes - it is faster because it reuses whole cache in hires pass.

Should be tested

We can currently change checkpoint with Refiner / Hires.fix
Then, should be invalidate the cache? or, should we just use it?

Screenshots/videos:

image

Works with Hypertile too,

Checklist:

@gel-crabs
Copy link
Contributor

To test this on SDXL, go to forward_timestep_embed_patch.py and replace "ldm" with "sgm"

@FurkanGozukara
Copy link

sounding nice

the hyper tile didnt bring any speed on SDXL

how about this?

@gel-crabs
Copy link
Contributor

gel-crabs commented Dec 5, 2023

sounding nice

the hyper tile didnt bring any speed on SDXL

how about this?

Enormous speed boost. Around 2-3x faster when it kicks in. However, I'm currently unable to get good quality results with it; I think the forward timestep embed patch might need to be further adapted to the SGM version, I'm not sure though.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 5, 2023

@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure)
@FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 5, 2023

@gel-crabs I guess we might have to adjust the indexes of in-out blocks, XL Unet is more deeper, so using shallow parts earlier would lead to cache 'noisy' semantic informations.

Note: current implementation is quite different from original paper, follows the gist snippet... and its more suitable for frequently used samplers

@gel-crabs
Copy link
Contributor

@gel-crabs I will do more test within 18 hours, but I guess this should work (as they share same structure) @FurkanGozukara The XL Code has released 5 hours ago, but I will have chance to implement this within day... not immediately. But the code seems to be very large....

I adapted it to use the SGM code and the results are the exact same, so it doesn't need to be further adapted to SGM. I'm gonna do some testing with the in-out blocks and see how it goes.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 6, 2023

Temporary update : I think the implementation should be modified to follow the original paper again.

Original paper says that we should sample the values for nearby steps, not by duration basis.

Although we can only optimize the last final steps, for SD XL, I don't think current one is accurate... thus this should be fixed again.

image

image

Block indexes : 0, 0, 0
image

@aria1th aria1th marked this pull request as draft December 6, 2023 02:39
@gel-crabs
Copy link
Contributor

gel-crabs commented Dec 6, 2023

Alright, I think I've gotten the correct blocks for SDXL:
Screenshot_20231205_213255
So pretty much just the Cache In Block Indexes changed to 8 and 7.

Still quality loss, the contrast is noticeably higher which I've found is caused by the cache mid.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 6, 2023

768x768 test

Hypertile only
image
7.86it/s

Index 0, 0, 0
image
image

Cache Rate 27.23%, 8.03it/s

Index 8, 8, 8
image
Cache Rate 27.23%, 8.61it/s

Index 0, 0, 5
image
Cache rate 42.37% 10.8 it/s

0, 0, 6
image
Cache rate 45.4% 11.1it./s

0, 0, 8
image

Cache rate 51.45%, 11.51 it/s

0, 0, 8 + Cache out start timestep 600
image
46.2%, 10.42 it/s

0, 0, 8 + Cache out start timestep 600 + interval 50
image
34.9%, 9.18it/s

@gel-crabs

I think we can use 0, 0, 8 for most case

@VainF
Copy link

VainF commented Dec 6, 2023

Very interesting results. Thanks for your effort @aria1th! If need any assistance, please feel free to reach out to us at any time.

@FurkanGozukara
Copy link

cache looks like degraded quality significantly? @aria1th

also hyper tile looks like do not degrade quality right?

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 6, 2023

@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.

@gel-crabs
Copy link
Contributor

@FurkanGozukara yes, its quality is degraded in xl type models - it requires more experiments or.. maybe re-implementation. It did not happen to 1.5-types though.

I have a feeling it has something to do with the extra IN/MID/OUT blocks in SDXL. For instance in SD 1.5 IN710 corresponds to a layer, while in SDXL the equivalent is IN710-719 (so 10 blocks compared to 1).

The Elements tab in the SuperMerger extension is really good for showing this information. The middle block has 9 extra blocks in SDXL as well, so I'm betting it has something to do with that.

@gel-crabs
Copy link
Contributor

gel-crabs commented Dec 6, 2023

Oops, didn't see the new update. MUCH less quality loss than before. I'm gonna keep testing and see what I can find.

So the settings are this, right?

In block index: 0
In block index 2: 0
Out block index: 8

@gel-crabs
Copy link
Contributor

Sorry for the spam, results and another question:

So with these settings on SDXL:

In block index: 8
In block index 2: 8
Out block index: 0
All starts set to 800, plus timestep refresh set to 50

I get next to no quality loss (even an upgrade!), however, the speedup is smaller, pretty much equivalent to a second HyperTile. So my question is: does the block cache index have any effect on the blocks before or after it? For instance, if the out block index is set to 8, does it cache the ones before it as well?

I ask this because there is another output block with the same resolution, which could be cached in addition to the output block cached already. I've gotten similarly high quality (and faster) results with in-blocks set to 7 and 8, which are the same resolution on SDXL.

If it gels with DeepCache I think a second Cache Out Block Index could result in a further speedup.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 6, 2023

@gel-crabs I fixed some explanations - for in types, it applies to after-index, thus -1-> all caching
For out types, it applies to before-index, thus 9-> all

Timestep is kinda important feature- if we use 1000, it means we won't refresh any cache after once we get it.

This stands for 1.5 type models - which means they seems to already know what to draw, at first cache point (!). This somehow explains few more things too... anyway

However, XL models seems to have problem with this - they have to refresh cache frequently, they are very dynamic with it.

Unfortunately, refreshing cache directly leads to cache failure rate increase, thus less performance increase...

I'll test with mid blocks too.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 6, 2023

I should also explain why quality gets degraded even when we use less cache than all caching- its about input-output mismatching.

To summarize, there are corresponding pairs to each caches, (as UNet blocks).

In another words, if we increase input block id level, then we have to decrease output block id level.

(Images will be attached for further reference)

However, I guess I should use more recent implementation- or convert from pipeline... I'll be able to do this in about 12-24 hours.
https://gist.github.com/laksjdjf/435c512bc19636e9c9af4ee7bea9eb86

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 7, 2023

New implementation - should be tested though
https://github.com/aria1th/sd-webui-deepcache-standalone

SD 1.5

512x704 test, with 40% disable for initial steps

Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 512x704, Model hash: 8c838299ab, VAE hash: 79e225b92f, VAE: blessed2.vae.pt, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679

image
Enabled, Reusing cache for HR steps
grid-0671-3335110679-1girl
5.68it/s

Enabled:
grid-0660-3335110679-1girl
4.66it/s

Vanilla with Hypertile:
grid-0661-3335110679-1girl
2.21it/s

Vanilla without Hypertile

grid-0664-3335110679-1girl
1.21it/s
Vanilla with DeepCache Only
grid-0665-3335110679-1girl
2.83it/s

SD XL :

1girl
Negative prompt: easynegative, nsfw
Steps: 23, Sampler: DPM++ SDE Karras, CFG scale: 8, Seed: 3335110679, Size: 768x768, Model hash: 9a0157cad2, VAE hash: 235745af8d, VAE: sdxl_vae(1).safetensors, Denoising strength: 0.5, Hypertile U-Net: True, Hypertile U-Net max depth: 2, Hypertile U-Net max tile size: 64, Hypertile U-Net swap size: 12, Hypertile VAE: True, Hypertile VAE swap size: 2, Hires upscale: 2, Hires upscaler: R-ESRGAN 4x+ Anime6B, Version: v1.7.0-RC-16-geb2b1679

** DeepCache + HR + Hypertile**

grid-0672-3335110679-1girl
2.65it/s
16.41GB (fp16)

Without optimization
grid-0673-3335110679-1girl

1.47it/s

maybe... some invalid interrupt method?

move to paper implementation

fix descriptions, KeyError

handle sgm for XL

fix ruff, change default for out_block

Implement Deepcache Optimization
@aria1th
Copy link
Collaborator Author

aria1th commented Dec 7, 2023

@gel-crabs Now it should work for both models!

@aria1th aria1th marked this pull request as ready for review December 7, 2023 16:54
@gel-crabs
Copy link
Contributor

gel-crabs commented Dec 7, 2023

@gel-crabs Now it should work for both models!

Yeah, it works great! What Cache Resnet level did you use for SDXL?

(Also, what is your Hypertile VAE max tile size?)

Oh yeah, and another thing: I'm getting this in the console.

Screenshot_20231207_123034

But yeah, the speedup here is absolutely immense. Do not miss out on this.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 7, 2023

@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6
The logs are removed!

@gel-crabs
Copy link
Contributor

@gel-crabs Resnet level 0, which is max as supposed to be - VAE max tile size was set to 128, swap size 6 The logs are removed!

Ahh, thank you! One more thing, perhaps another step percentage for HR fix?

Also, this literally halves the time it takes to generate an image. And it barely even changes the image at all. Thank you so much for your work.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 7, 2023

@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...)
But I guess it has to be checked with controlnet / other extensions too.

@gel-crabs
Copy link
Contributor

@gel-crabs HR fix will use 100% cache (if option enabled, and, actually success / failure rate is now requires rework, some are steps and some are function call...) But I guess it has to be checked with controlnet / other extensions too.

Dang, I just checked with ControlNet and it makes the image go full orange. Dynamic Thresholding works perfectly though.

@aria1th aria1th marked this pull request as draft December 7, 2023 18:22
@aria1th
Copy link
Collaborator Author

aria1th commented Dec 7, 2023

https://github.com/Mikubill/sd-webui-controlnet/blob/main/scripts/hook.py#L425
Okay, this explains why we have bunch of more big code....

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 8, 2023

https://github.com/aria1th/sd-webui-controlnet/tree/maybe-deepcache-wont-work

I was trying some various implementation including diffusers pipeline, and I guess it does not work well with ControlNet....

horseee/DeepCache#4

ControlNet obviously handles timestep-dependent embedding, which changes the output of U-Net drastically.

Thus, this is expected output.

image

Compared to this:
06599-3502206729-1girl

Also, I had to patch the controlnet extension, somehow hook override was not working if I offer the patched function in-place, even if it has executed correctly - it completely ignored controlnet.

Thus, in this level, I will just continue to release this as extension - unless someone comes up with great compatible code, you should only use it without controlnet 😢

@gel-crabs
Copy link
Contributor

Aww man, that sucks. This is seriously a game changer. :(

Also, it doesn't appear to work with FreeU. The HR fix only speeds up after the original step percentage, I assume because it doesn't cache the steps before the step percentage.

@aria1th
Copy link
Collaborator Author

aria1th commented Dec 8, 2023

@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.

Some more academical stuff:

DDIM works well with this. its hidden states will change smoothly, we can use nearby weights.
LCM won't even work with this.
Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.

It means whenever the UNet values have to change, the caching will mess up.

But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption?
(Meanwhile HyperTile is already useful for training)

@bigmover
Copy link

bigmover commented Mar 4, 2024

@gel-crabs Yeah, most of the U-Net forward hijacking functions won't work with this, It assumes the nearby step's effects are similar.

Some more academical stuff:

DDIM works well with this. its hidden states will change smoothly, we can use nearby weights. LCM won't even work with this. Some schedulers work drastically fast at initial steps, thus we can safely disable for those steps - yes, that's what you see as parameter.

It means whenever the UNet values have to change, the caching will mess up.

But, I guess for training, this can be kinda useful - we can force model to denoise with cache assumption? (Meanwhile HyperTile is already useful for training)

The deepcache is available on Webui now? How can we use it ?

@aria1th
Copy link
Collaborator Author

aria1th commented Mar 4, 2024

@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone
Use the extension please, and note that it can't be used with controlnet / some other specific unet hijacking extensions

@bigmover
Copy link

bigmover commented Apr 16, 2024

@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone Use the extension please, and note that it can't be used with controlnet / some other specific unet hijacking extensions

Appreciate your hard and awesome work! Want to know when or whether the controlnet is available for use with deepcache now or any plan for development?

@Bocchi-Chan2023
Copy link

@bigmover https://github.com/aria1th/sd-webui-deepcache-standalone Use the extension please, and note that it can't be used with controlnet / some other specific unet hijacking extensions

Now it does not seem to be working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants