[Quantization] Add quantization support for `bitsandbytes` #9213

sayakpaul · 2024-08-19T05:42:38Z

What does this PR do?

Come back later.

Quantization config class (base and bitsandbytes)
Quantizer class (base and bitsandbytes)
Utilities related to bitsandbytes
from_pretrained() at the ModelMixin level and related changes
save_pretrained()
NF4 tests
INT8 (llm.int8()) tests
Docs

Notes

Even though I alluded to having a separate QuantizationLoaderMixin in [Quantization] bring quantization to diffusers core #9174, I realized that is not an approach we can take because loading and saving a quantized model is very much baked into the arguments of ModelMixin.save_pretrained() and ModelMixin.from_pretrained(). It is deeply entangled.
For the initial quantization support, I think it's okay to not allow passing device_map, because for a pipeline, multiple device_maps can get ugly. This will be dealt with in a follow-up PR by @SunMarc and myself.
For the point above, for checkpoints that are found to be sharded (Flux, for example), I have decided to merge them on CPU to simplify the implementation. This will be dealt with in a follow-up PR by @SunMarc.
The PR has an extensive testing suite covering training, too. However, I have decided not to add it to our CI yet. We should first let this feature flow into the community and then add the tests to our nightly CI.

No-frills code snippets

Serialization

import torch 
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel, FluxPipeline
from accelerate.utils import compute_module_sizes

model_id = "black-forest-labs/FLUX.1-dev"

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = FluxTransformer2DModel.from_pretrained(
    model_id, subfolder="transformer", quantization_config=nf4_config, torch_dtype=torch.bfloat16
)
assert model_nf4.dtype == torch.uint8, model_nf4.dtype
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)
print(compute_module_sizes(model_nf4)[""] / 1024 / 1024)

push_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4.push_to_hub(push_id)

Serialized checkpoint: https://huggingface.co/sayakpaul/flux.1-dev-nf4-with-bnb-integration.

NF4 checkpoints of Flux transformer and T5: https://huggingface.co/sayakpaul/flux.1-dev-nf4-pkg (has Colab Notebooks, too).

Inference

import torch
from diffusers import FluxTransformer2DModel, FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)

pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")

HuggingFaceDocBuilderDev · 2024-08-19T05:51:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for adding this ! I see that you used a lot of things from transformers. Do you think it is possible to import these (or inherit) from transformers ? This will help reducing the maintenance. I'm fine also doing that since there are not too many follow-up PR after a quantizer has been added. About the HfQuantizer class, there are a lot of methods that were created to fit transformers structure. I'm not sure we will need eveyone of these methods in diffusers. Ofc, we can still do a follow-up PR to clean up.

src/diffusers/quantizers/base.py

sayakpaul · 2024-08-20T15:01:18Z

@SunMarc I am guilty as charged but we don’t have transformers as a hard dependency for loading models in Diffusers. Pinging @DN6 to seek his opinion.

Update: Chatted with @DN6 as well. We think it's better to redefine inside diffusers without the transformers specific bits which we can clean in this PR.

sayakpaul · 2024-08-22T02:02:49Z

@SunMarc I think this PR is ready for another review.

SunMarc

Thanks for adding this @sayakpaul !

src/diffusers/quantizers/base.py

yiyixuxu

I don't think it makes sense to have this as a separate PR to add a base class because it's hard to understand what methods are needed - we should only introduce a minimum base class and gradually add functionalities as needed

can we have a PR with a minimum example working?

sayakpaul · 2024-08-22T20:10:25Z

Okay, so, do you want me to add everything needed for bitsandbytes integration in this PR? But do note that this won’t be very different from what we have in transformers.

yiyixuxu · 2024-08-22T20:40:07Z

@sayakpaul
I think so because:

it is better to review that way
we don't need this class in diffusers on its own because it cannot be used yet, no?

bghira · 2024-08-22T23:57:31Z

sometimes we can make a feature branch where a bunch of PRs can be merged into before one big honkin' PR is pushed to main at the end. and the pieces are all individually reviewed and can be tested. is this a viable approach for including quantisation?

sayakpaul · 2024-08-23T02:53:39Z

Okay I will update this branch. @yiyixuxu

SunMarc · 2024-08-23T13:30:25Z

cc @MekkCyber for visibility

DN6 · 2024-08-28T08:06:50Z

Just a few considerations for the quantization design.

I would say the initial design should start loading/inference at just the model level and then proceed to add functionality (pipeline level loading etc).

The feature needs to perform the following functions

Perform on the fly quantization of large models so that they can be loaded in a low-memory dtype
1. with from_pretrained
2. with from_single_file
Dynamically upcast to the appropriate compute dtype when running inference
Save/Load already quantized versions of these large models (FP8, NF4)
Allow loading/inference with LoRAs in these quantized models. (This we have to figure out in more detail)

At the moment, the most common ask seems to be the ability to load models into GPU using the FP8 dtype and run inference in a supported dtype by dynamically upcasting the necessary layers. NF4 is another format that's gaining attention.

So perhaps we should focus on this first. This mostly applies to the DiT models but large models like CogVideo might also benefit with this approach.

Some example quantized versions of models that have been doing the rounds

Flux FP8:
- https://huggingface.co/Kijai/flux-fp8 (single file format)
- https://huggingface.co/XLabs-AI/flux-dev-fp8 (quanto/diffusers format)
Flux NF4
- https://huggingface.co/lllyasviel/flux1-dev-bnb-nf4
SD3 FP8:
- https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors (single file pipeline)

To cover these initial cases, we can rely on Quanto (FP8) and BitsandBytes (NF4).

Example API:

from diffusers import FluxPipeline, FluxTransformer2DModel, DiffusersQuantoConfig

# Load model in FP8 with Quanto and perform compute in configured dtype. 

quantization_config = DiffusersQuantoConfig(weights="float8", compute_dtype=torch.bfloat16)

FluxTransformer2DModel.from_pretrained("<either diffusers format or quanto format weights>", quantization_config=quantization_config)

pipe = FluxPipeline.from_pretrained("...", transformer=transformer)

The quantization config should probably take the following arguments

DiffusersQuantoConfig(
	weights_dtype="", # dtype to store weights
	compute_dtype="", # dtype to perform inference
	skip_quantize_modules=["ResBlock"]
)

I think initially we can rely on the dynamic upcasting operations performed by Quanto and BnB under the hood to start and then expand on them if needed.

Some other considerations

Since we have transformers models in diffusers that can also benefit from quantized loading, we might want to consider adding a Diffusers prefix to the quantization configs. e.g DiffusersQuantoConfig so that when we import quantization configs from transformers there aren't any conflicts.
For saving and loading models we can start with models saved in Quanto/BnB format.
One possible challenge with Pipeline level quantized loading is that we have a mix of transformers/diffusers models. So a single config to quantize/load both types might not be possible.
Single file loading has it's own set of issues, such as dealing with checkpoints that have been naively quantized. This applies to some of the Flux single file checkpoints. e.g. safetensors.torch.save_file(model.to(torch.float8_e4m3fn), "model-fp8.safetensors) and loading full pipeline single file checkpoints. But we can address these later.

sayakpaul · 2024-08-28T08:15:58Z

This PR will be at the model-level itself. And we should not add multiple backends in a single PR. This PR aims to add bitsandbytes. We can do other backends taking this PR as a reference. I would like us to mutually agree on this before I start making progress on this PR.

Concretely, I would like to stick to the outline of the changes laid out in #9174 (along with anything related) for this PR.

The feature needs to perform the following functions

I won't advocate doing all of that in a single PR because it makes things very hard to review. We would rather want to move faster with something more minimal, confirming their effectiveness.

Allow loading/inference with LoRAs in these quantized models. (This we have to figure out in more detail)

Well, note that if the underlying LoRA wasn't trained with the base quantization precision, it might not perform as expected.

So perhaps we should focus on this first. This mostly applies to the DiT models but large models like CogVideo might also benefit with this approach.

Please note that bitsandbytes related quantization mostly applies to nn.linear whereas quanto is broader in their scopes (i.e, quanto can be applied to an nn.Conv2D as well).

DN6 · 2024-08-28T08:34:33Z

This PR will be at the model-level itself. And we should not add multiple backends in a single PR. This PR aims to add bitsandbytes. We can do other backends taking this PR as a reference. I would like us to mutually agree on this before I start making progress on this PR.

Sounds good to me.

For this PR lets do

from_pretrained only
bnb quantization.

stevhliu

Thanks, this looks really good! 🔥

docs/source/en/api/quantization.md

docs/source/en/quantization/bitsandbytes.md

docs/source/en/quantization/overview.md

Co-authored-by: Steven Liu <[email protected]>

yiyixuxu · 2024-09-13T20:47:49Z

src/diffusers/configuration_utils.py

@@ -526,7 +526,8 @@ def extract_init_dict(cls, config_dict, **kwargs):
                init_dict[key] = config_dict.pop(key)

        # 4. Give nice warning if unexpected values have been passed
-        if len(config_dict) > 0:
+        only_quant_config_remaining = len(config_dict) == 1 and "quantization_config" in config_dict


I think it is better to not add to cofig_dict if it is not going into __init__, i.e. at line 511

# remove private attributes config_dict = {k: v for k, v in config_dict.items() if not k.startswith("_")} # remove quantization_config config_dict = {k: v for k, v in config_dict.items() if k != "quantization_config")}

src/diffusers/models/modeling_utils.py

yiyixuxu · 2024-09-14T02:58:32Z

src/diffusers/models/modeling_utils.py

+        if hf_quantizer is not None and not _hf_peft_config_loaded and not quantization_serializable:
+            raise ValueError(
+                f"The model is quantized with {hf_quantizer.quantization_config.quant_method} and is not serializable - check out the warnings from"
+                " the logger on the traceback to understand the reason why the quantized model is not serializable."


but we raised a ValueError here, they are not going to get traceback, no?

I think it would still throw the warnings on the console, hence.

src/diffusers/models/modeling_utils.py

yiyixuxu · 2024-09-14T03:39:15Z

src/diffusers/models/model_loading_utils.py

@@ -99,6 +131,8 @@ def load_state_dict(checkpoint_file: Union[str, os.PathLike], variant: Optional[
    """
    Reads a checkpoint file, returning properly formatted errors if they arise.
    """
+    if isinstance(checkpoint_file, dict):


why are we making this change? when will checkpoint_file passed as a dict?

We merge the sharded checkpoints (as stated in the PR description and mutually agreed upon internally) in case we're doing quantization:

diffusers/src/diffusers/models/modeling_utils.py

Line 765 in 55f96d8

model_file = _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata)

^ model_file becomes a state dict which is loaded by load_state_dict:

diffusers/src/diffusers/models/modeling_utils.py

Line 836 in 55f96d8

state_dict = load_state_dict(model_file, variant=variant)

and hence this change.

src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

yiyixuxu · 2024-09-16T23:31:31Z

src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

+            for k, v in state_dict.items():
+                # `startswith` to counter for edge cases where `param_name`
+                # substring can be present in multiple places in the `state_dict`
+                if param_name + "." in k and k.startswith(param_name):


k.split('.')[0] == param_name ?

Do you mean if param_name + "." in k and k.split('.')[0] == param_name:?

yiyixuxu · 2024-09-16T23:52:12Z

src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py

+        # Unlike `transformers`, we don't know if we should always keep certain modules in FP32
+        # in case of diffusion transformer models. For language models and others alike, `lm_head`
+        # and tied modules are usually kept in FP32.
+        self.modules_to_not_convert = list(filter(None.__ne__, self.modules_to_not_convert))


can you provide examples when this list would contain None?

It is configured via llm_int8_skip_modules within the BitsandBytesConfig object. It is defaulted to None in our case because we don't know if there's a requirement of a default unlike language models.

sayakpaul · 2024-09-17T15:35:20Z

@yiyixuxu thanks for your reviews. I think they were very nice and helpful. I have gone ahead and re-run the tests on audace and everything is green.

I have addressed your comments and made changes. PTAL.

chuck-ma · 2024-09-23T10:21:37Z

Hi, looks like everything is great. Don't know why approving review is still processing.

quantization config.

e634ff2

sayakpaul added the quantization label Aug 19, 2024

sayakpaul requested review from DN6 and SunMarc August 19, 2024 05:42

fix-copies

02a6dff

sayakpaul added 2 commits August 20, 2024 11:38

Merge branch 'main' into quantization-config

c385a2b

Merge branch 'main' into quantization-config

0355875

SunMarc reviewed Aug 20, 2024

View reviewed changes

src/diffusers/quantizers/base.py Outdated Show resolved Hide resolved

sayakpaul added 4 commits August 20, 2024 20:31

Merge branch 'main' into quantization-config

e41b494

Merge branch 'main' into quantization-config

dfb33eb

Merge branch 'main' into quantization-config

e492655

fix

6e86cc0

sayakpaul added 2 commits August 22, 2024 07:36

modules_to_not_convert

58a3d15

Merge branch 'main' into quantization-config

1d477f9

SunMarc approved these changes Aug 22, 2024

View reviewed changes

src/diffusers/quantizers/base.py Show resolved Hide resolved

yiyixuxu reviewed Aug 22, 2024

View reviewed changes

Merge branch 'main' into quantization-config

bd7f46d

sayakpaul mentioned this pull request Aug 27, 2024

NF4 Flux params in diffusers #9165

Closed

Merge branch 'main' into quantization-config

d5d7bb6

stevhliu approved these changes Sep 9, 2024

View reviewed changes

Apply suggestions from code review

c381fe0

Co-authored-by: Steven Liu <[email protected]>

sayakpaul requested a review from yiyixuxu September 10, 2024 02:12

sayakpaul added 5 commits September 10, 2024 07:43

Merge branch 'main' into quantization-config

3c92878

contribution guide.

acdeb25

Merge branch 'main' into quantization-config

aa295b7

Merge branch 'main' into quantization-config

7f7c9ce

Merge branch 'main' into quantization-config

55f96d8

yiyixuxu reviewed Sep 17, 2024

View reviewed changes

sayakpaul added 4 commits September 17, 2024 09:58

changes

b28cc65

Merge branch 'main' into quantization-config

8328e86

empty

9758942

fix tests

b1a9878

yiyixuxu mentioned this pull request Sep 17, 2024

FluxPipeline - Multi-GPU Issue - When you define transformer= you get "Expected all tensors to be on the same device" #9450

Open

sayakpaul added 2 commits September 18, 2024 07:45

harmonize with huggingface/transformers#33546.

971305b

numpy_cosine_distance

f41adf1

sayakpaul requested a review from yiyixuxu September 19, 2024 05:21

sayakpaul added 2 commits September 19, 2024 10:51

Merge branch 'main' into quantization-config

0bcb88b

Merge branch 'main' into quantization-config

55b3696

sayakpaul mentioned this pull request Sep 20, 2024

Add GGUF loader for FluxTransformer2DModel #9487

Open

sayakpaul mentioned this pull request Sep 23, 2024

Support optimum-quanto huggingface/peft#1997

Open

sayakpaul added 2 commits September 23, 2024 16:41

Merge branch 'main' into quantization-config

4cb3a6d

Merge branch 'main' into quantization-config

8a03eae

sayakpaul mentioned this pull request Sep 26, 2024

Support loading quantized models from quanto directly #9058

Closed

sayakpaul added 2 commits September 26, 2024 18:48

Merge branch 'main' into quantization-config

53f0a92

Merge branch 'main' into quantization-config

6aab47c

sayakpaul mentioned this pull request Sep 28, 2024

Support FLUX nf4 & pf8 for GPUs with 6GB/8GB VRAM (method and checkpoints by lllyasviel) #9149

Open

resolved conflicts,

9b9a610

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] Add quantization support for `bitsandbytes` #9213

[Quantization] Add quantization support for `bitsandbytes` #9213

sayakpaul commented Aug 19, 2024 •

edited by a-r-r-o-w

Loading

HuggingFaceDocBuilderDev commented Aug 19, 2024

SunMarc left a comment

sayakpaul commented Aug 20, 2024 •

edited

Loading

sayakpaul commented Aug 22, 2024

SunMarc left a comment

yiyixuxu left a comment •

edited

Loading

sayakpaul commented Aug 22, 2024 •

edited

Loading

yiyixuxu commented Aug 22, 2024 •

edited

Loading

bghira commented Aug 22, 2024

sayakpaul commented Aug 23, 2024

SunMarc commented Aug 23, 2024 •

edited

Loading

DN6 commented Aug 28, 2024

sayakpaul commented Aug 28, 2024 •

edited

Loading

DN6 commented Aug 28, 2024

stevhliu left a comment

yiyixuxu Sep 13, 2024

yiyixuxu Sep 14, 2024

sayakpaul Sep 17, 2024

yiyixuxu Sep 14, 2024

sayakpaul Sep 17, 2024

yiyixuxu Sep 16, 2024

sayakpaul Sep 17, 2024

yiyixuxu Sep 16, 2024

sayakpaul Sep 17, 2024

sayakpaul commented Sep 17, 2024

chuck-ma commented Sep 23, 2024

[Quantization] Add quantization support for bitsandbytes #9213

Are you sure you want to change the base?

[Quantization] Add quantization support for bitsandbytes #9213

Conversation

sayakpaul commented Aug 19, 2024 • edited by a-r-r-o-w Loading

What does this PR do?

Notes

No-frills code snippets

HuggingFaceDocBuilderDev commented Aug 19, 2024

SunMarc left a comment

Choose a reason for hiding this comment

sayakpaul commented Aug 20, 2024 • edited Loading

sayakpaul commented Aug 22, 2024

SunMarc left a comment

Choose a reason for hiding this comment

yiyixuxu left a comment • edited Loading

Choose a reason for hiding this comment

sayakpaul commented Aug 22, 2024 • edited Loading

yiyixuxu commented Aug 22, 2024 • edited Loading

bghira commented Aug 22, 2024

sayakpaul commented Aug 23, 2024

SunMarc commented Aug 23, 2024 • edited Loading

DN6 commented Aug 28, 2024

sayakpaul commented Aug 28, 2024 • edited Loading

DN6 commented Aug 28, 2024

stevhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul commented Sep 17, 2024

chuck-ma commented Sep 23, 2024

[Quantization] Add quantization support for `bitsandbytes` #9213

[Quantization] Add quantization support for `bitsandbytes` #9213

sayakpaul commented Aug 19, 2024 •

edited by a-r-r-o-w

Loading

sayakpaul commented Aug 20, 2024 •

edited

Loading

yiyixuxu left a comment •

edited

Loading

sayakpaul commented Aug 22, 2024 •

edited

Loading

yiyixuxu commented Aug 22, 2024 •

edited

Loading

SunMarc commented Aug 23, 2024 •

edited

Loading

sayakpaul commented Aug 28, 2024 •

edited

Loading