Faster generation using AWQ + Fused modules #27411

younesbelkada · 2023-11-09T17:14:19Z

What does this PR do?

Introduces a new feature - fused module generation using autoawq library. Users need to specify modules that they want to fuse inside fusing_mapping.

The API is as follows:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, AwqConfig, TextStreamer

model_id = "TheBloke/Mistral-7B-OpenOrca-AWQ"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"

quantization_config = AwqConfig(
    bits=4,
    do_fuse=True,
    fuse_max_seq_len=512,
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config).to(0)
tokenizer = AutoTokenizer.from_pretrained(model_id)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt_template = """\
<|im_start|>system
You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant"""

prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"

tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer([prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt), prompt_template.format(prompt=prompt)], return_tensors="pt", padding=True).to(0)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Before this PR:

After this PR:

TODOs:

Block save_pretrained if one creates a fused model
Fuse MLP
wait until dynamic batch size is resolved on auto-awq side : Adaptive batch sizing casper-hansen/AutoAWQ#181
Force users to use autoawq>=0.1.7
Tests
Benchmarks
Docs

cc @amyeroberts @casper-hansen @SunMarc

HuggingFaceDocBuilderDev · 2023-11-09T17:48:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

…transformers into awq-fused-modules

younesbelkada · 2023-11-16T15:24:08Z

Before moving forward with tests and advanced docs, I would love to have an early feedback of the API that is described in the PR description. cc @amyeroberts , whenever you have time, i would appreciate your feedback on this PR 🙏 Thanks!

amyeroberts

Really nice piece of work! 🔥

Main comment is about structure of the input arguments. It might just be my current understanding from the PR so please do correct me if I'm wrong.

At the moment, it seems there's two types of arguments which configure the behaviour: the fusing_mapping and max_seq_len. Does max_seq_len have its own config value because we expect this to be the value users modify a lot?

For AwqConfig do we foresee other fuse arguments to be added to configure this behaviour? If so, we might want to bundle them all together into a single fuse_config which the AwqConfig owns.

src/transformers/modeling_utils.py

src/transformers/integrations/awq.py

amyeroberts · 2023-11-17T17:09:36Z

src/transformers/integrations/awq.py

+        current_fused_mapping["hidden_size"] = hidden_size
+        current_fused_mapping["num_attention_heads"] = num_attention_heads
+        current_fused_mapping["num_key_value_heads"] = num_key_value_heads
+        current_fused_mapping["max_seq_len"] = quantization_config.fuse_max_seq_len


If hidden_size, num_attentions_heads and num_key_value_heads are required arguments it would be good to verify these keys exist in the mapping from quantization_config.fusion_mapping

I propose to do it in 0a08551 lmk what do you think!

amyeroberts · 2023-11-17T17:23:24Z

src/transformers/integrations/awq.py

+        # Handle hidden_size, num_attention_heads, num_key_value_heads on our own.
+        hidden_size = model.config.hidden_size
+        num_attention_heads = model.config.num_attention_heads
+        num_key_value_heads = getattr(model.config, "num_key_value_heads", num_attention_heads)


This line here indicates a possible area of brittleness in the current design: there's going to be many different variations of model attribute names which - if this becomes popular - we'll have to account for. Passing in quantization_config.fusing_mapping is great to provide flexibility for the mapping.

My understanding is that quantization_config.fuse_max_seq_len is passed in separately because the value will be specific to each model's configuration rather than e.g. architecture. The question is - why are params like "hidden_size" and "num_key_value_heads" not passed in with quantization_config.fusing_mapping? I think it would make more sense for them all to be passed in together or have two separate dictionaries - one to map layer names and the other to map model-specific configs.

I proposed to do that to handle different architecture variants , 7b, 13b, .. etc. I did not find a way to properly map that in AWQ_FUSED_MAPPINGS and decided to make it as architecture agnostic as possible.
For now we only support mistral and llama if one passes fuse_modules, for other architectures users need to manually create a mapping and pass it through AwqConfig. If we want to support other architectures in the future, and face attribute errors I propose to fix it directly in the corresponding config object by adding an attribute_map: https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/configuration_t5.py#L83 to make sure num_attention_heads, and hidden_size exists. What do you think?

Sounds good to me!

It handles almost all cases and if there's anything more complex in the future we can handle it then when we know more about the problem.

src/transformers/integrations/awq.py

amyeroberts · 2023-11-17T18:04:38Z

src/transformers/integrations/awq.py

+    return current_fused_mapping
+
+
+def fuse_awq_modules(model, quantization_config):


This is a pretty big function - I'd split it up so that there's private functions for each layer-replacement which is then called within the big for-loop

Done in cde53ef

Beautiful 🤩

amyeroberts · 2023-11-17T18:05:25Z

src/transformers/integrations/awq.py

+                ).to(old_module.weight.device)
+                del old_module
+        # Replace MLP layers
+        if hasattr(module, fusing_mapping["mlp"][0]):


Can we guarantee fusing_mapping has an "mlp" key? and that it has at least one value?

I propose to do a more extensive check in b187c07 and I changed the logic there a bit to make sure these methods will do nothing in case we put an empty array in these fields

amyeroberts · 2023-11-17T18:07:09Z

src/transformers/integrations/awq.py

+            gate_proj = getattr(module, fusing_mapping["mlp"][0])
+            up_proj = getattr(module, fusing_mapping["mlp"][1])
+            down_proj = getattr(module, fusing_mapping["mlp"][2])


gate_, up_ down_ is highly specific to just a few models. Instead we can generalise this to take a list of linear layers so that any MLP can be passed

amyeroberts · 2023-11-17T18:08:02Z

src/transformers/integrations/awq.py

+            down_proj = getattr(module, fusing_mapping["mlp"][2])
+
+            previous_device = gate_proj.qweight.device
+            activation_fn = ACT2FN[model.config.hidden_act]


Same comment here re config param names. hidden_act is common but it's not always the name used. Could we pass this in in the fuse config too?

Hmm, do you think my suggestion here: #27411 (comment) could be applied in this case?

src/transformers/integrations/awq.py

src/transformers/utils/quantization_config.py

amyeroberts

Thanks for iterating!

Great code, great tests, great docs 💯

amyeroberts · 2023-11-24T18:59:38Z

src/transformers/integrations/awq.py

+        # Handle hidden_size, num_attention_heads, num_key_value_heads on our own.
+        hidden_size = model.config.hidden_size
+        num_attention_heads = model.config.num_attention_heads
+        num_key_value_heads = getattr(model.config, "num_key_value_heads", num_attention_heads)


Sounds good to me!

It handles almost all cases and if there's anything more complex in the future we can handle it then when we know more about the problem.

amyeroberts · 2023-11-24T18:59:49Z

src/transformers/integrations/awq.py

+        current_fused_mapping["hidden_size"] = hidden_size
+        current_fused_mapping["num_attention_heads"] = num_attention_heads
+        current_fused_mapping["num_key_value_heads"] = num_key_value_heads
+        current_fused_mapping["max_seq_len"] = quantization_config.fuse_max_seq_len


src/transformers/integrations/awq.py

amyeroberts · 2023-11-24T19:01:34Z

src/transformers/integrations/awq.py

+    return current_fused_mapping
+
+
+def fuse_awq_modules(model, quantization_config):


Beautiful 🤩

src/transformers/utils/quantization_config.py

amyeroberts · 2023-11-24T19:08:45Z

src/transformers/utils/quantization_config.py

+        if self.do_fuse and self.modules_to_fuse is not None:
+            required_keys = [
+                "hidden_size",
+                "num_attention_heads",
+                "num_key_value_heads",
+                "mlp",
+                "attention",
+                "layernorm",
+                "use_alibi",
+            ]


Based on these keys it would be cool to have a tool which automatically generates a config for a model assuming you wanted to fuse all modules.

SunMarc

Thanks for making AWQ faster through fused modules 🔥. The design looks great and could be easily extended to other quantization scheme in the future. I left a few comments.

docs/source/en/main_classes/quantization.md

src/transformers/integrations/awq.py

src/transformers/modeling_utils.py

tests/quantization/autoawq/test_awq.py

src/transformers/modeling_utils.py

…transformers into awq-fused-modules

Co-authored-by: amyeroberts <[email protected]>

Co-authored-by: Marc Sun <[email protected]>

…transformers into awq-fused-modules

younesbelkada · 2023-12-04T13:14:21Z

Thanks @amyeroberts @SunMarc for your great reviews!
@SunMarc I just want to get more clarification on this comment, otherwise good to merge IMO !

SunMarc

Thanks for iterating on this ! I've left a few comments about two points but they are not blocking for this PR !

src/transformers/modeling_utils.py

v1 fusing modules

6c995f9

younesbelkada mentioned this pull request Nov 9, 2023

New logic for passing past_key_value casper-hansen/AutoAWQ#177

Merged

add fused mlp support

85cc9c7

younesbelkada mentioned this pull request Nov 9, 2023

Feature Request: Make use of transformers + LLM-AWQ / AutoAWQ integration 🚀 oobabooga/text-generation-webui#4542

Closed

Merge remote-tracking branch 'upstream/main' into awq-fused-modules

b6cd554

younesbelkada mentioned this pull request Nov 13, 2023

[core] Replace QuantLlamaMLP with QuantFusedMLP casper-hansen/AutoAWQ#188

Merged

younesbelkada added 5 commits November 13, 2023 12:07

up

7ffbaa3

fix CI

05b5f62

block save_pretrained

8670aa2

fixup

9ee6b38

Merge remote-tracking branch 'upstream/main' into awq-fused-modules

f8d4177

younesbelkada mentioned this pull request Nov 14, 2023

'LlamaAWQForCausalLM' object has no attribute 'config' #26970

Closed

4 tasks

small fix

1a8c915

younesbelkada mentioned this pull request Nov 15, 2023

[core] Add is_hf_transformers flag casper-hansen/AutoAWQ#195

Merged

younesbelkada added 5 commits November 15, 2023 19:30

add new condition

b541b4d

Merge remote-tracking branch 'upstream/main' into awq-fused-modules

2ea1f47

Merge branch 'awq-fused-modules' of https://github.com/younesbelkada/…

024b737

…transformers into awq-fused-modules

add v1 docs

a7d74f8

add some comments

85e1e3b

younesbelkada requested a review from amyeroberts November 16, 2023 15:24

Merge branch 'main' into awq-fused-modules

3e6ba9b

amyeroberts reviewed Nov 17, 2023

View reviewed changes

younesbelkada added 6 commits November 21, 2023 15:06

Merge remote-tracking branch 'upstream/main' into awq-fused-modules

26194d0

style

f160a16

fix nit

14c820d

adapt from suggestion

03d8dff

add check

0a08551

change arg names

234165f

younesbelkada commented Nov 22, 2023

View reviewed changes

src/transformers/utils/quantization_config.py Outdated Show resolved Hide resolved

fix importlib metadata

cd37d32

amyeroberts approved these changes Nov 24, 2023

View reviewed changes

SunMarc reviewed Nov 28, 2023

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

younesbelkada and others added 13 commits December 4, 2023 12:30

Merge remote-tracking branch 'upstream/main' into awq-fused-modules

8f381ed

Merge branch 'awq-fused-modules' of https://github.com/younesbelkada/…

e80ad75

…transformers into awq-fused-modules

Update src/transformers/utils/quantization_config.py

b5c337c

Co-authored-by: amyeroberts <[email protected]>

change it to do_fuse

3f98913

nit

3bd0446

Update src/transformers/utils/quantization_config.py

e1b3bfa

Co-authored-by: Marc Sun <[email protected]>

Update src/transformers/utils/quantization_config.py

cb31546

Co-authored-by: Marc Sun <[email protected]>

Update src/transformers/utils/quantization_config.py

45875fd

Co-authored-by: Marc Sun <[email protected]>

Merge branch 'awq-fused-modules' of https://github.com/younesbelkada/…

faaa255

…transformers into awq-fused-modules

few fixes

c1ea9b2

revert

d90eec7

fix test

e65687b

fix copies

da78cf4

younesbelkada requested a review from SunMarc December 4, 2023 13:15

SunMarc approved these changes Dec 4, 2023

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

src/transformers/modeling_utils.py Show resolved Hide resolved

younesbelkada added 4 commits December 5, 2023 10:04

Merge remote-tracking branch 'upstream/main' into awq-fused-modules

2fcc465

raise error if model is not quantized

0697687

add test

12aff7c

use quantization_config.config when fusing

498fe55

younesbelkada commented Dec 5, 2023

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

Update src/transformers/modeling_utils.py

196095e

younesbelkada merged commit fdb85be into huggingface:main Dec 5, 2023
22 checks passed

younesbelkada deleted the awq-fused-modules branch December 5, 2023 11:14

moufuyu mentioned this pull request Dec 14, 2023

An error occurred when using AWQ Fused modules #28028

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster generation using AWQ + Fused modules #27411

Faster generation using AWQ + Fused modules #27411

younesbelkada commented Nov 9, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 9, 2023

younesbelkada commented Nov 16, 2023

amyeroberts left a comment

amyeroberts Nov 17, 2023

younesbelkada Nov 21, 2023

amyeroberts Nov 24, 2023

amyeroberts Nov 17, 2023

younesbelkada Nov 21, 2023

amyeroberts Nov 24, 2023

amyeroberts Nov 17, 2023

younesbelkada Nov 21, 2023

amyeroberts Nov 24, 2023

amyeroberts Nov 17, 2023

younesbelkada Nov 21, 2023

amyeroberts Nov 17, 2023

amyeroberts Nov 17, 2023

younesbelkada Nov 22, 2023 •

edited

Loading

amyeroberts left a comment

amyeroberts Nov 24, 2023

amyeroberts Nov 24, 2023

amyeroberts Nov 24, 2023

amyeroberts Nov 24, 2023

SunMarc left a comment •

edited

Loading

younesbelkada commented Dec 4, 2023

SunMarc left a comment

		return current_fused_mapping


		def fuse_awq_modules(model, quantization_config):

Faster generation using AWQ + Fused modules #27411

Faster generation using AWQ + Fused modules #27411

Conversation

younesbelkada commented Nov 9, 2023 • edited Loading

What does this PR do?

Before this PR:

After this PR:

TODOs:

HuggingFaceDocBuilderDev commented Nov 9, 2023

younesbelkada commented Nov 16, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

younesbelkada commented Dec 4, 2023

SunMarc left a comment

Choose a reason for hiding this comment

younesbelkada commented Nov 9, 2023 •

edited

Loading

younesbelkada Nov 22, 2023 •

edited

Loading

SunMarc left a comment •

edited

Loading