Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) #9149

Open
tin2tin opened this issue Aug 11, 2024 · 14 comments

Comments

@tin2tin
Copy link

tin2tin commented Aug 11, 2024

lllyasviel/stable-diffusion-webui-forge#981

Flux Checkpoints
The currently supported Flux checkpoints are

flux1-dev-bnb-nf4.safetensors Full flux-dev checkpoint with main model in NF4. <- Recommended
flux1-dev-fp8.safetensors Full flux-dev checkpoint with main model in FP8.
Basic facts:

(i) NF4 is significantly faster than FP8. For GPUs with 6GB/8GB VRAM, the speed-up is about 1.3x to 2.5x (pytorch 2.4, cuda 12.4) or about 1.3x to 4x (pytorch 2.1, cuda 12.1). I test 3070 ti laptop (8GB VRAM) just now, the FP8 is 8.3 seconds per iteration; NF4 is 2.15 seconds per iteration (in my case, 3.86x faster). This is because NF4 uses native bnb.matmul_4bit rather than torch.nn.functional.linear: casts are avoided and computation is done with many low-bit cuda tricks. (Update 1: bnb's speed-up is less salient on pytorch 2.4, cuda 12.4. Newer pytorch may used improved fp8 cast.) (Update 2: the above number is not benchmark - I just tested very few devices. Some other devices may have different performances.) (Update 3: I just tested more devices now and the speed-up is somewhat random but I always see speed-ups - I will give more reliable numbers later!)

(ii) NF4 weights are about half size of FP8.

@tin2tin tin2tin changed the title Support FLUX nf4 & pf8 for GPUs for 6GB/8GB VRAM by lllyasviel Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (by lllyasviel) Aug 11, 2024
@chuck-ma
Copy link

I think if NF4 is much better than FP8, maybe we can make it usable for all models. (Including SDXL)

@Ednaordinary
Copy link

Ednaordinary commented Aug 12, 2024

@chuck-ma NF4 is better in some cases and worse in other cases compared to FP8 (hard to tell when though). NF4 is essentially FP4 (4 bits per weight) with some additional data/additional changes to calibrate it closer to the original model, compared to FP4. Here, that's done by mixing precision (e.g. using higher precision bf16 where it matters and lower precision int4 where it matters a lot less). FP8 is essentially casting down to 8 bits per weight without any sort of calibration, so it's faster to make but larger and different from NF4.

In addition, there's no real reason other models shouldn't support this since it doesn't rely on any flux specific quirk (flux just motivated the development of quantization techniques like this for image models)

@tin2tin tin2tin changed the title Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (by lllyasviel) Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) Aug 12, 2024
@Swarzox
Copy link

Swarzox commented Aug 12, 2024

It seems I can't get it work with diffusers, do you have any very simple example code to make it works?

Many thanks.

@Ednaordinary
Copy link

@Swarzox This is an issue, not a PR. The code to make this work in diffusers has not been contributed or created yet.

@Ednaordinary
Copy link

Please see #9165, #9174

@sayakpaul
Copy link
Member

sayakpaul commented Aug 15, 2024

Both NF4 and llm.int8 can be done with some code changes ad-hoc:
#8746

Serialization and direct loading support will be done through the plan proposed in #9174.

Directly loading the said checkpoint can lead to some problematic results because of the reasons explained in #9165 (comment).

If you want to obtain the text encoders and VAE from that checkpoint, you can use the snippet from #9165 (comment) and then use something like #9177 so that computations run in a higher-precision data-type while the params are kept in lower-precision data-type such as FP8.

You can also do a direct llm.int8() or NF4 style loading of the bulky T5-xxl and use it within a diffusers pipeline. See: https://gist.github.com/sayakpaul/82acb5976509851f2db1a83456e504f1

There are many options to run things in a memory-efficient way. So, with a programmatic approach, we let you choose what's best for you :)

@lonngxiang
Copy link

how to use with diffusrrs? any code

@lonngxiang
Copy link

Please see #9165, #9174

The code appears to be complex, and the GPU usage far exceeds 8GB.

@Ednaordinary
Copy link

@lonngxiang

I can't be sure of what you are doing, but I get 9 GiB with CPU offload and ~18 GiB without

CPU offload:
image

GPU only:
image

This may also get even better when #9174 is complete/has a working demo

@lonngxiang
Copy link

I am very much looking forward to it, hoping for an extremely simple implementation with just a few lines of code.

@dylanisreversing
Copy link

@lonngxiang

I can't be sure of what you are doing, but I get 9 GiB with CPU offload and ~18 GiB without

CPU offload: image

GPU only: image

This may also get even better when #9174 is complete/has a working demo

Hey @Ednaordinary, I saw your comment on the other thread about the previous error you were encountering and wanted to know how/if you resolved it. I am talking about the 'Error in FluxImageGenerator initialization: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.'

I am stuck on the same problem right now. Thank you!

@kanarch66
Copy link

NF4 is removed from comfyui latest update

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 28, 2024
@sayakpaul sayakpaul removed the stale Issues that haven't received updates label Sep 28, 2024
@sayakpaul
Copy link
Member

#9213

The flexibility of DiffusionPipeline will be that we can use whatever quantization scheme is best for our needs for each of the individual models involved in a pipeline. I haven't considered the framework overhead of doing that but I think in the end, we could want to optimize the trade-off between memory and latency and if so, this flexibility would be good to have.

To illustrate my point, consider the FluxPipeline that has two text encoders, a transformer denoiser, and a VAE.

Quantization strategies like llm.int8() or NF4 or the ones provided in torchao are suitable for models that are composed of mostly nn.Linear layers. Conv1D layers are fine because they can be expressed as linear layers in most cases. But for a VAE that has conv layers, quanto might be a better library to use as it provides better operators and primitives to deal with conv layers.

https://x.com/RisingSayak/status/1836679359521820704 gives a visual of how this might look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants