Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quantization] Add quantization support for bitsandbytes #9213

Open
wants to merge 84 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 69 commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
e634ff2
quantization config.
sayakpaul Aug 19, 2024
02a6dff
fix-copies
sayakpaul Aug 19, 2024
c385a2b
Merge branch 'main' into quantization-config
sayakpaul Aug 20, 2024
0355875
Merge branch 'main' into quantization-config
sayakpaul Aug 20, 2024
e41b494
Merge branch 'main' into quantization-config
sayakpaul Aug 20, 2024
dfb33eb
Merge branch 'main' into quantization-config
sayakpaul Aug 21, 2024
e492655
Merge branch 'main' into quantization-config
sayakpaul Aug 22, 2024
6e86cc0
fix
sayakpaul Aug 22, 2024
58a3d15
modules_to_not_convert
sayakpaul Aug 22, 2024
1d477f9
Merge branch 'main' into quantization-config
sayakpaul Aug 22, 2024
bd7f46d
Merge branch 'main' into quantization-config
sayakpaul Aug 23, 2024
d5d7bb6
Merge branch 'main' into quantization-config
sayakpaul Aug 28, 2024
44c8a75
Merge branch 'main' into quantization-config
sayakpaul Aug 28, 2024
6a0fcdc
add bitsandbytes utilities.
sayakpaul Aug 28, 2024
e4590fa
make progress.
sayakpaul Aug 28, 2024
77a1438
Merge branch 'main' into quantization-config
sayakpaul Aug 29, 2024
335ab6b
fixes
sayakpaul Aug 29, 2024
d44ef85
quality
sayakpaul Aug 29, 2024
210fa1e
up
sayakpaul Aug 29, 2024
f4feee1
up
sayakpaul Aug 29, 2024
e8c1722
Merge branch 'main' into quantization-config
sayakpaul Aug 29, 2024
7f86a71
Merge branch 'main' into quantization-config
sayakpaul Aug 29, 2024
ba671b6
minor
sayakpaul Aug 30, 2024
c1a9f13
up
sayakpaul Aug 30, 2024
4489c54
Merge branch 'main' into quantization-config
sayakpaul Aug 30, 2024
f2ca5e2
up
sayakpaul Aug 30, 2024
d6b8954
fix
sayakpaul Aug 30, 2024
45029e2
provide credits where due.
sayakpaul Aug 30, 2024
4eb468a
make configurations work.
sayakpaul Aug 30, 2024
939965d
fixes
sayakpaul Aug 30, 2024
8557166
Merge branch 'main' into quantization-config
sayakpaul Aug 30, 2024
d098d07
fix
sayakpaul Aug 30, 2024
c4a0074
update_missing_keys
sayakpaul Aug 30, 2024
ee45612
fix
sayakpaul Aug 30, 2024
b24c0a7
fix
sayakpaul Aug 31, 2024
473505c
make it work.
sayakpaul Aug 31, 2024
c795c82
fix
sayakpaul Aug 31, 2024
c1d5b96
Merge branch 'main' into quantization-config
sayakpaul Aug 31, 2024
af7caca
provide credits to transformers.
sayakpaul Aug 31, 2024
80967f5
empty commit
sayakpaul Sep 1, 2024
3bdf25a
handle to() better.
sayakpaul Sep 2, 2024
27415cc
tests
sayakpaul Sep 2, 2024
51cac09
change to bnb from bitsandbytes
sayakpaul Sep 2, 2024
15f3032
fix tests
sayakpaul Sep 2, 2024
77c9fdb
better safeguard.
sayakpaul Sep 2, 2024
ddc9f29
change merging status
sayakpaul Sep 2, 2024
44c4109
courtesy to transformers.
sayakpaul Sep 2, 2024
27666a8
move upper.
sayakpaul Sep 2, 2024
3464d83
better
sayakpaul Sep 2, 2024
b106124
Merge branch 'main' into quantization-config
sayakpaul Sep 2, 2024
330fa0a
Merge branch 'main' into quantization-config
sayakpaul Sep 2, 2024
abc8607
make the unused kwargs warning friendlier.
sayakpaul Sep 3, 2024
31725aa
harmonize changes with https://github.com/huggingface/transformers/pu…
sayakpaul Sep 3, 2024
e5938a6
style
sayakpaul Sep 3, 2024
444588f
trainin tests
sayakpaul Sep 3, 2024
d3360ce
Merge branch 'main' into quantization-config
sayakpaul Sep 3, 2024
d8b35f4
Merge branch 'main' into quantization-config
sayakpaul Sep 3, 2024
859f2d7
Merge branch 'main' into quantization-config
sayakpaul Sep 4, 2024
3b2d6e1
feedback part i.
sayakpaul Sep 4, 2024
5799954
Add Flux inpainting and Flux Img2Img (#9135)
Gothos Sep 4, 2024
8e4bd08
Revert "Add Flux inpainting and Flux Img2Img (#9135)"
sayakpaul Sep 6, 2024
835d4ad
tests
sayakpaul Sep 6, 2024
27075fe
don
sayakpaul Sep 6, 2024
5c00c1c
Merge branch 'main' into quantization-config
sayakpaul Sep 6, 2024
5d633a0
Merge branch 'main' into quantization-config
sayakpaul Sep 8, 2024
c381fe0
Apply suggestions from code review
sayakpaul Sep 10, 2024
3c92878
Merge branch 'main' into quantization-config
sayakpaul Sep 10, 2024
acdeb25
contribution guide.
sayakpaul Sep 11, 2024
aa295b7
Merge branch 'main' into quantization-config
sayakpaul Sep 11, 2024
7f7c9ce
Merge branch 'main' into quantization-config
sayakpaul Sep 15, 2024
55f96d8
Merge branch 'main' into quantization-config
sayakpaul Sep 15, 2024
b28cc65
changes
sayakpaul Sep 17, 2024
8328e86
Merge branch 'main' into quantization-config
sayakpaul Sep 17, 2024
9758942
empty
sayakpaul Sep 17, 2024
b1a9878
fix tests
sayakpaul Sep 17, 2024
971305b
harmonize with https://github.com/huggingface/transformers/pull/33546.
sayakpaul Sep 18, 2024
f41adf1
numpy_cosine_distance
sayakpaul Sep 19, 2024
0bcb88b
Merge branch 'main' into quantization-config
sayakpaul Sep 19, 2024
55b3696
Merge branch 'main' into quantization-config
sayakpaul Sep 20, 2024
4cb3a6d
Merge branch 'main' into quantization-config
sayakpaul Sep 23, 2024
8a03eae
Merge branch 'main' into quantization-config
sayakpaul Sep 24, 2024
53f0a92
Merge branch 'main' into quantization-config
sayakpaul Sep 26, 2024
6aab47c
Merge branch 'main' into quantization-config
sayakpaul Sep 27, 2024
9b9a610
resolved conflicts,
sayakpaul Sep 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,12 @@
title: Reinforcement learning training with DDPO
title: Methods
title: Training
- sections:
- local: quantization/overview
title: Getting Started
- local: quantization/bitsandbytes
title: bitsandbytes
title: Quantization Methods
- sections:
- local: optimization/fp16
title: Speed up inference
Expand Down Expand Up @@ -205,6 +211,8 @@
title: Logging
- local: api/outputs
title: Outputs
- local: api/quantization
title: Quantization
title: Main Classes
- isExpanded: false
sections:
Expand Down
33 changes: 33 additions & 0 deletions docs/source/en/api/quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

-->

# Quantization

Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Diffusers supports 8-bit and 4-bit quantization with [bitsandbytes](https://huggingface.co/docs/bitsandbytes/en/index).

Quantization techniques that aren't supported in Transformers can be added with the [`DiffusersQuantizer`] class.

<Tip>

Learn how to quantize models in the [Quantization](../quantization/overview) guide.

</Tip>


## BitsAndBytesConfig

[[autodoc]] BitsAndBytesConfig

## DiffusersQuantizer

[[autodoc]] quantizers.base.DiffusersQuantizer
267 changes: 267 additions & 0 deletions docs/source/en/quantization/bitsandbytes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

-->

# bitsandbytes

[bitsandbytes](https://huggingface.co/docs/bitsandbytes/index) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance.

4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs.


To use bitsandbytes, make sure you have the following libraries installed:

```bash
pip install diffusers transformers accelerate bitsandbytes -U
```

Now you can quantize a model by passing a [`BitsAndBytesConfig`] to [`~ModelMixin.from_pretrained`]. This works for any model in any modality, as long as it supports loading with [Accelerate](https://hf.co/docs/accelerate/index) and contains `torch.nn.Linear` layers.

<hfoptions id="bnb">
<hfoption id="8-bit">

Quantizing a model in 8-bit halves the memory-usage:

```py
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quantization_config
)
```

By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:

```py
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.float32
)
model_8bit.transformer_blocks.layers[-1].norm2.weight.dtype
```

Once a model is quantized, you can push the model to the Hub with the [`~ModelMixin.push_to_hub`] method. The quantization `config.json` file is pushed first, followed by the quantized model weights.

```py
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_8bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quantization_config
)
```

</hfoption>
<hfoption id="4-bit">

Quantizing a model in 4-bit reduces your memory-usage by 4x:

```py
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quantization_config
)
```

By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want:

```py
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.float32
)
model_4bit.transformer_blocks.layers[-1].norm2.weight.dtype
```

Call [`~ModelMixin.push_to_hub`] after loading it in 4-bit precision. You can also save the serialized 4-bit models locally with [`~ModelMixin.save_pretrained`].

</hfoption>
</hfoptions>

<Tip warning={true}>

Training with 8-bit and 4-bit weights are only supported for training *extra* parameters.

</Tip>

Check your memory footprint with the `get_memory_footprint` method:

```py
print(model.get_memory_footprint())
```

Quantized models can be loaded from the [`~ModelMixin.from_pretrained`] method without needing to specify the `quantization_config` parameters:

```py
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = FluxTransformer2DModel.from_pretrained(
"sayakpaul/flux.1-dev-nf4-pkg", subfolder="transformer"
)
```

## 8-bit (LLM.int8() algorithm)

<Tip>

Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)!

</Tip>

This section explores some of the specific features of 8-bit models, such as outlier thresholds and skipping module conversion.

### Outlier threshold

An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).

To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]:

```py
from diffusers import FluxTransformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
load_in_8bit=True, llm_int8_threshold=10,
)

model_8bit = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-dev",
subfolder="transformer",
quantization_config=quantization_config,
)
```

### Skip module conversion

For some models, you don't need to quantize every module to 8-bit which can actually cause instability. For example, for diffusion models like [Stable Diffusion 3](../api/pipelines/stable_diffusion/stable_diffusion_3), the `proj_out` module can be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]:

```py
from diffusers import SD3Transformer2DModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
load_in_8bit=True, llm_int8_skip_modules=["proj_out"],
)

model_8bit = SD3Transformer2DModel.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
subfolder="transformer",
quantization_config=quantization_config,
)
```


## 4-bit (QLoRA algorithm)

<Tip>

Learn more about its details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

</Tip>

This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization.


### Compute data type

To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]:

```py
import torch
from diffusers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
```

### Normal Float 4 (NF4)

NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]:

```py
from diffusers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
)

model_nf4 = SD3Transformer2DModel.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
subfolder="transformer",
quantization_config=nf4_config,
)
```

For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values.

### Nested quantization

Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter.

```py
from diffusers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
)

double_quant_model = SD3Transformer2DModel.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
subfolder="transformer",
quantization_config=double_quant_config,
)
```

## Dequantizing `bitsandbytes` models

Once quantized, you can dequantize the model to the original precision but this might result in a small quality loss of the model. Make sure you have enough GPU RAM to fit the dequantized model.

```python
from diffusers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
)

double_quant_model = SD3Transformer2DModel.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers",
subfolder="transformer",
quantization_config=double_quant_config,
)
model.dequantize()
```
35 changes: 35 additions & 0 deletions docs/source/en/quantization/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

-->

# Quantization

Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits.

<Tip>

Interested in adding a new quantization method to Transformers? Refer to the [Contribute new quantization method guide](https://huggingface.co/docs/transformers/main/en/quantization/contribute) to learn more about adding a new quantization method.

</Tip>

<Tip>

If you are new to the quantization field, we recommend you to check out these beginner-friendly courses about quantization in collaboration with DeepLearning.AI:

* [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/)
* [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth/)

</Tip>

## When to use what?

This section will be expanded once Diffusers has multiple quantization backends. Currently, we only support `bitsandbytes`. [This resource](https://huggingface.co/docs/transformers/main/en/quantization/overview#when-to-use-what) provides a good overview of the pros and cons of different quantization techniques.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it will be nice to also have a table directly in this doc in the future

Loading
Loading