[`Awq`] Add llava fused modules support #28239

younesbelkada · 2023-12-25T10:53:12Z

What does this PR do?

This PR adds the Llava + fused modules support for blazing fast text generation using Llava + AWQ!

This PR also fixes the issue: #28032 (comment) pointed out by a user since a custom past key value is passed to the model, indeed filtering out indexes that are inside the range of extended_attention_mask fixes the issue.

Added also a slow test

Can also confirm all Llava slow tests pass!

cc @casper-hansen

cbjtu · 2023-12-26T03:24:11Z

Thank you soooo much, this PR and #28032 helped me work well now!

ArthurZucker

Thanks a lot! More discussion needed for the attended / non attended tokens is needed IMO! 🤗

src/transformers/integrations/awq.py

ArthurZucker · 2024-01-03T15:02:06Z

src/transformers/modeling_utils.py

+            # In case a user passes a `AwqConfig` with `do_fuse=True` for models that have
+            # a `modules_to_not_convert` attribute we need to manually set that attribute into the
+            # passed `quantization_config`
+            elif (
+                quantization_config.modules_to_not_convert is None
+                and "modules_to_not_convert" in config.quantization_config
+            ):
+                quantization_config.modules_to_not_convert = config.quantization_config["modules_to_not_convert"]


that seems a bit odd to me, should either be done in the integration (I know you don't have access to the config) or when you init the quantization_config, you should use config.quantization_config no? (at some point merging kwargs?)

I think we always rely on quantization_config, we do merge kwargs but on the other way around (from quantization_config to config.quantization_config) with get_loading_attributes(). The scenario above happens only with the specific case where users pass do_fuse=True & a non-None value in config.quantization_config["modules_to_not_convert"]. I think it is a good idea to think of a way to harmonize how to merge kwargs between config.quantization_config and quantization_config but might be slightly out of the scope of the PR as I need to do it for all quantization schemes we support. I propose to do that properly in a follow up PR

Alright let's keep that in mind

src/transformers/models/llava/modeling_llava.py

ArthurZucker · 2024-01-03T15:02:56Z

src/transformers/models/llava/modeling_llava.py

+                    valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
+                    new_batch_index = batch_index[valid_indices]
+                    new_non_attended_tokens = non_attended_tokens[valid_indices]


not a fan of adding custom code only to handle custom usages. There should be a more general way of handling these things (why use the extended attention mask and not just the attention mask, why not use the past key value length, etc)

This is not to handle custom usage, it happens when a past key value with padd tokens are on indices that are larger than the extended attention mask shape: #28032 (comment) & #28239 (comment) - this can mainly happen in batched generation with long seq len and it specifically happens for autoawq fused modules because the dummy past key values are initialized will all zeros: https://github.com/casper-hansen/AutoAWQ/blob/a3db8099a234a46a21bf5e46340da60da6992e0c/awq/modules/fused/attn.py#L238
In any case I don't think this will cause any harm since it just filers out indices of padd tokens (that are not attended anyway) that are out of the extended attention mask range, and I confirmed all slow tests pass

Alright, I though this was already solved. No worries them, it's just that tensor indexing might slow things down a bit but is required anyway. I think a refactor might help:

Init the embeddings with a different value (like -1 which is might not happen as often as zeros) when we compute the image indexes

correctly update the attention mask when merging to make sure we keep track of what we computed
I'd be in favor of moving this fix to another PR maybe? WDYT?

I see thanks !

I am happy to explore:

Init the embeddings with a different value (like -1 which is might not happen as often as zeros) when we compute the image indexes

In another PR !

I'd be in favor of moving this fix to another PR maybe? WDYT?

That might be not ideal because if this fix is not introduced, users cannot run llava + fused modules :/ I'll address the points you shared in a follow up PR !

Co-authored-by: Arthur <[email protected]>

ArthurZucker

thanks

ArthurZucker · 2024-01-11T10:21:10Z

src/transformers/models/llava/modeling_llava.py

+                    valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
+                    new_batch_index = batch_index[valid_indices]
+                    new_non_attended_tokens = non_attended_tokens[valid_indices]


Alright, I though this was already solved. No worries them, it's just that tensor indexing might slow things down a bit but is required anyway. I think a refactor might help:

Init the embeddings with a different value (like -1 which is might not happen as often as zeros) when we compute the image indexes

correctly update the attention mask when merging to make sure we keep track of what we computed
I'd be in favor of moving this fix to another PR maybe? WDYT?

younesbelkada · 2024-01-12T05:55:50Z

Thanks for your reviews @ArthurZucker ! Merging ! I'll address the points you shared in #28239 (comment) in another PR as stated in my reply

* add llava + fused modules * Update src/transformers/models/llava/modeling_llava.py Co-authored-by: Arthur <[email protected]> --------- Co-authored-by: Arthur <[email protected]>

add llava + fused modules

c90268d

This was referenced Dec 25, 2023

[Llava] Fix llava index errors #28032

Merged

[Mixtral / Awq] Add mixtral fused modules for Awq #28240

Merged

ArthurZucker self-requested a review January 2, 2024 10:00

ArthurZucker reviewed Jan 3, 2024

View reviewed changes

younesbelkada and others added 2 commits January 8, 2024 14:13

Update src/transformers/models/llava/modeling_llava.py

a09571d

Co-authored-by: Arthur <[email protected]>

Merge remote-tracking branch 'upstream/main' into llava-fused-modules

8695950

younesbelkada requested a review from ArthurZucker January 11, 2024 04:34

ArthurZucker approved these changes Jan 11, 2024

View reviewed changes

younesbelkada merged commit 07bdbeb into huggingface:main Jan 12, 2024
21 checks passed

younesbelkada deleted the llava-fused-modules branch January 12, 2024 05:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Awq`] Add llava fused modules support #28239

[`Awq`] Add llava fused modules support #28239

younesbelkada commented Dec 25, 2023 •

edited

Loading

cbjtu commented Dec 26, 2023

ArthurZucker left a comment

ArthurZucker Jan 3, 2024

younesbelkada Jan 11, 2024

ArthurZucker Jan 11, 2024

ArthurZucker Jan 3, 2024

younesbelkada Jan 11, 2024

ArthurZucker Jan 11, 2024

younesbelkada Jan 12, 2024

ArthurZucker left a comment

ArthurZucker Jan 11, 2024

younesbelkada commented Jan 12, 2024

[Awq] Add llava fused modules support #28239

[Awq] Add llava fused modules support #28239

Conversation

younesbelkada commented Dec 25, 2023 • edited Loading

What does this PR do?

cbjtu commented Dec 26, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada commented Jan 12, 2024

[`Awq`] Add llava fused modules support #28239

[`Awq`] Add llava fused modules support #28239

younesbelkada commented Dec 25, 2023 •

edited

Loading