[WIP] Add FA2 for all Bart-like #26722

patrickvonplaten · 2023-10-10T17:44:33Z

What does this PR do?

Add FA2 to all Bart-like models

examples/research_projects/jax-projects/big_bird/bigbird_flax.py

examples/research_projects/jax-projects/big_bird/train.py

examples/research_projects/vqgan-clip/VQGAN_CLIP.py

examples/research_projects/jax-projects/big_bird/train.py

tests/test_modeling_common.py

HuggingFaceDocBuilderDev · 2023-10-10T18:08:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

…rmers into add_fa2_bart

examples/research_projects/jax-projects/big_bird/bigbird_flax.py

…rmers into add_fa2_bart

src/transformers/models/bart/modeling_bart.py

patrickvonplaten · 2023-10-10T21:41:49Z

Verified that FA2 works by checking Whisper. Bart's attention is exactly the same as Whisper so it should as well. I will run some better benchmarks later.

@ArthurZucker @younesbelkada could you do a first review here, just for the following files:

tests/test_modeling_common.py
src/transformers/models/bart/modeling_bart.py
[ignore Whisper completely now, please]

It would be nice to agree on these files before running make fix-copies which will change 10+ other modeling files.
The implementation was pretty straight-forward as I can more or less copy-paste all the code from Llama (nice job @younesbelkada!)

Some comments:

1.) The flash attention tests are very nicely implemented in tests/test_modeling_common.py, but it looks like the tolerance is too high to catch any incorrect masking or other settings. E.g. in the beginning for BART I had an incorrect scaling factor in the attention and all tests passed anyways here. We might want to look into this.
2.) I'm not super happy about passing around both attention_mask and padding_mask all the time. This makes the code really difficult to read and is quite confusing (what is the difference between the two?!). As far as I understand it the two masks are the same - the only reason we use padding_mask in addition to attention_mask is because the attention_mask is expanded and thus can't be used for FA. I wonder whether we should do a bigger refactor here actually and instead of expanding the attention_mask in the beginning we only expand it right before the attention so that we don't have to pass around both padding_mask and attention_mask. We could even cache the expanded mask if needed for speed. I would be strongly in favor of not having both a padding mask and an attention mask. Also cc @fxmarty here.
3.) Can we make the automatic conversion to intended FA precision a bit more robust. E.g. see here: https://github.com/huggingface/transformers/pull/26722/files#r1353462371 . Aren't there use cases where the user would like to train in bfloat16, but might have a layer norm in fp32? cc @younesbelkada

ArthurZucker

Looks good yeah!
regarding your comments, I totally agree regarding the padding mask, this was my initial concern here. Llama needed less tolerance but let's update it. Otherwise Looks good, let's make sure the attention is as clean as possible as it will be the reference for cross attention.

src/transformers/models/bart/modeling_bart.py

ArthurZucker · 2023-10-11T07:01:16Z

src/transformers/models/bart/modeling_bart.py

+
+    def _flash_attention_forward(


Suggested change

def _flash_attention_forward(

# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward

def _flash_attention_forward(

this is copied from as well no?

Actually we need a new causal function argument here to differentiate between non-causal (encoder) and causal (decoder) attention

src/transformers/models/bart/modeling_bart.py

ArthurZucker · 2023-10-11T07:03:26Z

tests/test_modeling_common.py

@@ -2797,16 +2797,35 @@ def test_flash_attn_2_inference(self):
                dummy_input = torch.LongTensor([[1, 2, 3, 4, 5]]).to(torch_device)


might need decoder input ids as well for cross attention testing?

patrickvonplaten · 2023-10-24T16:42:52Z

Update:

The PR now works for Bart. @ArthurZucker @younesbelkada @fxmarty @LysandreJik could you give the design chosen for BART a look here and if ok, I'll apply it to all other Bart-like models.

Please only review modeling_bart.py !!!

src/transformers/models/whisper/modeling_whisper.py

…rmers into add_fa2_bart

src/transformers/models/bart/modeling_bart.py

…rmers into add_fa2_bart

fxmarty

Looks good! There are probably some docstring (e.g. BartDecoderLayer) whose attention_mask doc should be modified accordingly.

fxmarty · 2023-10-25T07:33:09Z

src/transformers/models/bart/modeling_bart.py

@@ -148,7 +276,9 @@ def __init__(
        num_heads: int,
        dropout: float = 0.0,
        is_decoder: bool = False,
+        is_causal: bool = False,


Personal taste, but I would add this arg after bias in case somebody is using positional arguments.

fxmarty · 2023-10-25T07:33:45Z

src/transformers/models/bart/modeling_bart.py

+        # BartFlashAttention2 attention does not support output_attentions
+        output_attentions = False


Don't know how it was for llama, but I would raise an error here in case output_attentions is True

Yes fair, problem though is that by now it's backwards breaking

fxmarty · 2023-10-25T07:40:49Z

src/transformers/models/bart/modeling_bart.py

+        # TODO: Bart does not have dropout in the config??
+        # It is recommended to use dropout with FA according to the docs
+        # when training.
+        dropout_rate = 0.0  # if not self.training else self.attn_dropout


I think Bart has some dropout:

transformers/src/transformers/models/bart/modeling_bart.py

Line 274 in 6cbc136

attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)

wanted to comment rather than approve

ArthurZucker

Only reviewed the bart modelling file, looks good overall!
If the attention logic happens in the attention class might be slightly better, but otherwise it nice that we expose in a better way how attention masks need to be processed! Thanks.

src/transformers/models/bart/modeling_bart.py

ArthurZucker · 2023-10-25T13:59:59Z

src/transformers/models/bart/modeling_bart.py

+            if attention_mask is not None:
+                attention_mask = self.causal_attn_mask_converter.to_4d(
+                    attention_mask, input_shape[-1], key_value_length, dtype=inputs_embeds.dtype
+                )
+            else:
+                attention_mask = self.causal_attn_mask_converter.to_causal_4d(
+                    input_shape[0], input_shape[-1], key_value_length, dtype=inputs_embeds.dtype, device=inputs_embeds.device
+                )


would be nice if the to_causal_4d supports feeding a mask and takes care of this if else no?

Hmm but the mask is None here and then I need to pass all these shapes anyways

ArthurZucker · 2023-10-25T14:01:07Z

src/transformers/models/bart/modeling_bart.py

-            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-            encoder_attention_mask = _expand_mask(encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1])
+            if getattr(self.config, "_flash_attn_2_enabled", False):
+                encoder_attention_mask = encoder_attention_mask if (encoder_attention_mask is not None and 0 in encoder_attention_mask) else None


think a comment would be nice to say why we don't pass the mask to FA if there is not 0 values in it.

ArthurZucker · 2023-10-25T14:04:25Z

src/transformers/models/bart/modeling_bart.py

+        if getattr(config, "_flash_attn_2_enabled", False):
+            self.encoder_attn = BartFlashAttention2(
+                embed_dim=self.embed_dim,
+                num_heads=config.decoder_attention_heads,
+                dropout=config.attention_dropout,
+                is_decoder=True,
+                config=config,
+            )
+        else:
+            self.encoder_attn = BartAttention(
+                embed_dim=self.embed_dim,
+                num_heads=config.decoder_attention_heads,
+                dropout=config.attention_dropout,
+                is_decoder=True,
+                config=config,
+            )


think a mapping BERT_ATTENTIONS["attention_class"] will be cleaner long term if we add sdpa, flash decoding etc, specifically given that the init arguments are consistent (and we want this to always be the case)

Sure that makes sense!

patrickvonplaten added 2 commits October 10, 2023 16:55

[FA2] Bart-like models

fbc715a

make tests work

ce43f2a

patrickvonplaten commented Oct 10, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/bigbird_flax.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 10, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/bigbird_flax.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 10, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/train.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 10, 2023

View reviewed changes

examples/research_projects/vqgan-clip/VQGAN_CLIP.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 10, 2023

View reviewed changes

examples/research_projects/vqgan-clip/VQGAN_CLIP.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 10, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/train.py Outdated Show resolved Hide resolved

Apply suggestions from code review

2b79352

patrickvonplaten commented Oct 10, 2023

View reviewed changes

tests/test_modeling_common.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 10, 2023

View reviewed changes

tests/test_modeling_common.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 10, 2023

View reviewed changes

tests/test_modeling_common.py Outdated Show resolved Hide resolved

Apply suggestions from code review

a70f9cf

patrickvonplaten added 2 commits October 10, 2023 19:26

improve

be48a5c

Merge branch 'add_fa2_bart' of https://github.com/huggingface/transfo…

22b3e2e

…rmers into add_fa2_bart

patrickvonplaten commented Oct 10, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/bigbird_flax.py Outdated Show resolved Hide resolved

patrickvonplaten added 9 commits October 10, 2023 21:27

Update examples/research_projects/jax-projects/big_bird/bigbird_flax.py

8e344ee

improve further

8e35ac9

Merge branch 'add_fa2_bart' of https://github.com/huggingface/transfo…

7c66a63

…rmers into add_fa2_bart

fix bart

a070445

add FA to whisper

341b7c9

make fix copies whisper

cdf2190

correct more

ddad165

more

045f183

improve bart

5a82297

patrickvonplaten changed the title ~~[WIP] Add FA2 for all Bart-like~~ Add FA2 for all Bart-like Oct 10, 2023

patrickvonplaten commented Oct 10, 2023

View reviewed changes

src/transformers/models/bart/modeling_bart.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Oct 11, 2023

View reviewed changes

patrickvonplaten added 6 commits October 24, 2023 14:41

improve flash attention

39f820d

fix more

c4517f3

fix attn mask bug

f195c15

Fix all

c4488b9

rename to converter

2f99cea

fix whisper

d8ae461

patrickvonplaten requested review from ArthurZucker, younesbelkada, fxmarty and LysandreJik October 24, 2023 16:41

patrickvonplaten commented Oct 24, 2023

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 24, 2023

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 24, 2023

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 24, 2023

View reviewed changes

src/transformers/models/whisper/modeling_whisper.py Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits October 24, 2023 19:44

Apply suggestions from code review

c4fa0c9

add spec decoding

1ee2b6d

Merge branch 'add_fa2_bart' of https://github.com/huggingface/transfo…

fd880c9

…rmers into add_fa2_bart

patrickvonplaten commented Oct 24, 2023

View reviewed changes

src/transformers/models/bart/modeling_bart.py Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits October 24, 2023 19:52

Update src/transformers/models/bart/modeling_bart.py

53f7700

Merge branch 'add_fa2_bart' of https://github.com/huggingface/transfo…

21a9735

…rmers into add_fa2_bart

correct more

d16fde8

fxmarty previously approved these changes Oct 25, 2023

View reviewed changes

ArthurZucker reviewed Oct 25, 2023

View reviewed changes

patrickvonplaten marked this pull request as draft October 26, 2023 15:17

patrickvonplaten changed the title ~~Add FA2 for all Bart-like~~ [WIP] Add FA2 for all Bart-like Oct 26, 2023

patrickvonplaten added 2 commits October 27, 2023 15:30

Add all

b6d2b65

get spec dec batch working

b944f25

patrickvonplaten closed this Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add FA2 for all Bart-like #26722

[WIP] Add FA2 for all Bart-like #26722

patrickvonplaten commented Oct 10, 2023

HuggingFaceDocBuilderDev commented Oct 10, 2023

patrickvonplaten commented Oct 10, 2023

ArthurZucker left a comment

ArthurZucker Oct 11, 2023

patrickvonplaten Oct 11, 2023

ArthurZucker Oct 11, 2023

patrickvonplaten commented Oct 24, 2023 •

edited

Loading

fxmarty left a comment

fxmarty Oct 25, 2023

fxmarty Oct 25, 2023

patrickvonplaten Oct 25, 2023

fxmarty Oct 25, 2023

ArthurZucker left a comment

ArthurZucker Oct 25, 2023

patrickvonplaten Oct 25, 2023

ArthurZucker Oct 25, 2023

ArthurZucker Oct 25, 2023

patrickvonplaten Oct 25, 2023

		@@ -2797,16 +2797,35 @@ def test_flash_attn_2_inference(self):
		dummy_input = torch.LongTensor([[1, 2, 3, 4, 5]]).to(torch_device)

		# BartFlashAttention2 attention does not support output_attentions
		output_attentions = False

[WIP] Add FA2 for all Bart-like #26722

[WIP] Add FA2 for all Bart-like #26722

Conversation

patrickvonplaten commented Oct 10, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 10, 2023

patrickvonplaten commented Oct 10, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 24, 2023 • edited Loading

fxmarty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 24, 2023 •

edited

Loading