[Assistant Generation] Improve Encoder Decoder #26701

patrickvonplaten · 2023-10-09T17:52:17Z

What does this PR do?

This PR speeds up assistant generation / speculative decoding for encoder-decoder models such as Distill-Whisper by ~20-30%.

Improvements:

If assistant and model share same encoder, let's allow the user to pass assistant_encoder_outputs so that the inputs are not encoded twice (gives ~20% speed-up)
In the small loop I don't think we have to allocate tensors for the attention mask all the time. This is done automatically by the model if necessary (gives ~3,4% speed-up)
The heuristic to increase / decrease the number of "look-ahead" tokens doesn't work well for whisper, can we maybe allow the user to somehow disable it? Maybe via a config attribute?

patrickvonplaten · 2023-10-09T17:56:04Z

src/transformers/generation/utils.py

-                assistant_model.max_assistant_tokens += 2.0
-            else:
-                assistant_model.max_assistant_tokens = max(1.0, assistant_model.max_assistant_tokens - 1.0)
+            # if n_matches == int(assistant_model.max_assistant_tokens):


The heuristic to increase / decrease the number of "look-ahead" tokens doesn't work well for whisper, can we maybe allow the user to somehow disable it? Maybe via a config attribute?

patrickvonplaten · 2023-10-09T17:56:36Z

src/transformers/generation/utils.py

@@ -4391,19 +4395,16 @@ def assisted_decoding(
                    # `new_token_len` can be 1 or 2 (next token in assistant + last token picked by the larger model)
                    new_token_len = candidate_input_ids.shape[1] - prev_seq_len
                    assist_inputs = candidate_input_ids[:, -new_token_len:]
-                    assist_attn = torch.ones_like(candidate_input_ids)


Do we really need this @gante ? Allocating new memory here every time leads to some slow downs that are not insignificant for Distil Whisper

Yes, this makes sense to remove, since it is the default attention mask! 👍

HuggingFaceDocBuilderDev · 2023-10-09T18:11:05Z

The documentation is not available anymore as the PR was closed or merged.

patrickvonplaten · 2023-10-10T11:09:54Z

src/transformers/generation/utils.py

@@ -4484,18 +4485,18 @@ def assisted_decoding(
            # 2.2. Process the new logits
            new_logits = outputs.logits[:, -candidate_length - 1 :]  # excludes the input prompt if present
            if len(logits_processor) > 0:
-                for i in range(candidate_length):
+                for i in range(candidate_length + 1):


That was a bug previously. We forgot to apply the logits processors to the last logit here

Good catch 👀

patrickvonplaten · 2023-10-10T11:10:11Z

src/transformers/generation/utils.py

                selected_tokens = torch.multinomial(probs[0, :, :], num_samples=1).squeeze(1)[None, :]
            else:
-                selected_tokens = new_logits[:, -candidate_length - 1 :, :].argmax(dim=-1)


This is unnecessary here as new_logits is already sliced

…to improve_assistant_generation_enc_dec

gante

Fantastic, thank you for the upgrades @patrickvonplaten 🔥

Only added two minor, optional nits.

gante · 2023-10-10T16:14:06Z

src/transformers/generation/configuration_utils.py

@@ -227,6 +227,20 @@ class GenerationConfig(PushToHubMixin):
        decoder_start_token_id (`int`, *optional*):
            If an encoder-decoder model starts decoding with a different token than *bos*, the id of that token.

+        > Generation parameters exclusive to [assistant generation](https://arxiv.org/abs/2211.17192)
+
+        max_assistant_tokens (`int`, *optional*, defaults to 5):


Perhaps we can take the chance to give a better name to this variable: assistant_tokens or similar. max_assistant_tokens implies that the assistant will never cross this limit but, as we can see in max_assistant_tokens_schedule (which should also be renamed accordingly), that is not true :)

Poor original naming choice by me :D

Sounds good!

gante · 2023-10-10T16:15:46Z

src/transformers/generation/utils.py

-            assistant_model.max_assistant_tokens = 5  # this value, which will be updated, persists across calls
+        if hasattr(assistant_model, "max_assistant_tokens"):
+            warnings.warn(
+                "Setting `max_assistant_tokens` via `assistant_model.max_assistant_tokens` is deprecated and will be removed in v5. Make sure to set `max_assistant_tokens` via the generation_config instead.",


Perhaps we can deprecate this earlier (like in v4.37)?

I haven't seen users fiddling with this internal variable :)

gante · 2023-10-10T16:18:29Z

src/transformers/generation/utils.py

@@ -4391,19 +4395,16 @@ def assisted_decoding(
                    # `new_token_len` can be 1 or 2 (next token in assistant + last token picked by the larger model)
                    new_token_len = candidate_input_ids.shape[1] - prev_seq_len
                    assist_inputs = candidate_input_ids[:, -new_token_len:]
-                    assist_attn = torch.ones_like(candidate_input_ids)


Yes, this makes sense to remove, since it is the default attention mask! 👍

gante · 2023-10-10T16:19:07Z

src/transformers/generation/utils.py

@@ -4484,18 +4485,18 @@ def assisted_decoding(
            # 2.2. Process the new logits
            new_logits = outputs.logits[:, -candidate_length - 1 :]  # excludes the input prompt if present
            if len(logits_processor) > 0:
-                for i in range(candidate_length):
+                for i in range(candidate_length + 1):


Good catch 👀

…to improve_assistant_generation_enc_dec

examples/research_projects/jax-projects/big_bird/bigbird_flax.py

examples/research_projects/jax-projects/big_bird/train.py

examples/research_projects/vqgan-clip/VQGAN_CLIP.py

…into improve_assistant_generation_enc_dec

patrickvonplaten · 2023-10-11T13:30:13Z

The failing Hub test seems to be flaky.

This PR is ready for a final review.

patrickvonplaten · 2023-10-11T13:30:51Z

src/transformers/models/biogpt/modeling_biogpt.py

@@ -544,7 +544,11 @@ def forward(
            inputs_embeds = self.embed_tokens(input) * self.embed_scale

        if attention_mask is None:
-            attention_mask = torch.ones(inputs_embeds.shape[:2], dtype=torch.bool, device=inputs_embeds.device)


This was a bug previously. The attention_mask should be equal to input_embeds + past_key_values length.

ArthurZucker

Very clean thanks for improving the performances!

ArthurZucker · 2023-10-11T13:40:23Z

src/transformers/generation/configuration_utils.py

+            - `"_heuristic_`: When all _speculative_ tokens are correct, increase `num_assistant_tokens` by 2 else
+              reduce by 1


wondering if the schedule parameters should be hard coded but fine for me

* [Assistant Generation] Improve enc dec * save more * Fix logit processor checks * Clean * make style * fix deprecation * fix generation test * Apply suggestions from code review * fix biogpt * make style

[Assistant Generation] Improve enc dec

fb0b1c5

patrickvonplaten marked this pull request as draft October 9, 2023 17:53

patrickvonplaten commented Oct 9, 2023

View reviewed changes

patrickvonplaten added 2 commits October 9, 2023 21:11

save more

2aa4113

Fix logit processor checks

c1e344f

patrickvonplaten commented Oct 10, 2023

View reviewed changes

patrickvonplaten added 2 commits October 10, 2023 12:19

Clean

6241236

Merge branch 'main' of https://github.com/huggingface/transformers in…

f4ee3b1

…to improve_assistant_generation_enc_dec

patrickvonplaten marked this pull request as ready for review October 10, 2023 12:20

make style

0e330e9

patrickvonplaten requested a review from gante October 10, 2023 13:55

gante approved these changes Oct 10, 2023

View reviewed changes

patrickvonplaten added 3 commits October 11, 2023 11:22

Merge branch 'main' of https://github.com/huggingface/transformers in…

aa096d4

…to improve_assistant_generation_enc_dec

fix deprecation

6aa60fe

fix generation test

d59bcc6

patrickvonplaten commented Oct 11, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/bigbird_flax.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 11, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/bigbird_flax.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 11, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/train.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 11, 2023

View reviewed changes

examples/research_projects/jax-projects/big_bird/train.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 11, 2023

View reviewed changes

examples/research_projects/vqgan-clip/VQGAN_CLIP.py Outdated Show resolved Hide resolved

patrickvonplaten commented Oct 11, 2023

View reviewed changes

examples/research_projects/vqgan-clip/VQGAN_CLIP.py Outdated Show resolved Hide resolved

Apply suggestions from code review

1efe7a4

patrickvonplaten changed the title ~~[Assistant Generation] Improve enc dec~~ [Assistant Generation] Improve Encoder Decoder Oct 11, 2023

patrickvonplaten added 3 commits October 11, 2023 12:57

xxxMerge branch 'main' of https://github.com/huggingface/transformers …

d205169

…into improve_assistant_generation_enc_dec

fix biogpt

5959aaa

make style

a607acd

patrickvonplaten commented Oct 11, 2023

View reviewed changes

patrickvonplaten requested review from LysandreJik and ArthurZucker October 11, 2023 13:31

ArthurZucker approved these changes Oct 11, 2023

View reviewed changes

patrickvonplaten merged commit da69de1 into main Oct 11, 2023
19 of 21 checks passed

patrickvonplaten deleted the improve_assistant_generation_enc_dec branch October 11, 2023 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Assistant Generation] Improve Encoder Decoder #26701

[Assistant Generation] Improve Encoder Decoder #26701

patrickvonplaten commented Oct 9, 2023 •

edited

Loading

patrickvonplaten Oct 9, 2023

patrickvonplaten Oct 9, 2023

gante Oct 10, 2023

HuggingFaceDocBuilderDev commented Oct 9, 2023 •

edited

Loading

patrickvonplaten Oct 10, 2023

gante Oct 10, 2023

patrickvonplaten Oct 10, 2023

gante left a comment •

edited

Loading

gante Oct 10, 2023 •

edited

Loading

patrickvonplaten Oct 11, 2023

gante Oct 10, 2023

patrickvonplaten Oct 11, 2023

gante Oct 10, 2023

gante Oct 10, 2023

patrickvonplaten commented Oct 11, 2023

patrickvonplaten Oct 11, 2023

ArthurZucker left a comment

ArthurZucker Oct 11, 2023

		- `"_heuristic_`: When all _speculative_ tokens are correct, increase `num_assistant_tokens` by 2 else
		reduce by 1

[Assistant Generation] Improve Encoder Decoder #26701

[Assistant Generation] Improve Encoder Decoder #26701

Conversation

patrickvonplaten commented Oct 9, 2023 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante left a comment • edited Loading

Choose a reason for hiding this comment

gante Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 11, 2023

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Oct 9, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 9, 2023 •

edited

Loading

gante left a comment •

edited

Loading

gante Oct 10, 2023 •

edited

Loading