fix assisted decoding assistant model inputs #27503

jiqing-feng · 2023-11-15T01:54:52Z

In the last PR, we didn't consider the decoder_attention_mask while updating model_kwargs, see here. This PR has fixed it.

Furthermore, I also use a cleaner way to process assistant models's inputs.

Hi @gante , would you please help me to review this PR? Thx!

gante

@jiqing-feng thank you for promptly working on a fix! 🤗

I suspected this could be an issue, but I was a bit lazy and decided to rely on the tests. The catch is that the tests are stochastic (they are sample-based), so we has a lucky run in the past CI.

Regarding the PR itself: a few minor nits, and then it should be ready to go!

src/transformers/generation/utils.py

gante · 2023-11-15T11:24:15Z

@jiqing-feng If possible, I would also like to revert these temporary changes in this PR :)

ArthurZucker · 2023-11-15T13:25:44Z

🤗 thanks for the fix we had to skip it in #27508 as well! (Only the relevant test)

jiqing-feng · 2023-11-16T04:23:34Z

Hi @gante @ArthurZucker . I think I have fixed all the comments and also added the tests you mentioned. Would you please help me review it? Thx!

BTW, the failed test if not related to my changes.

gante

Thank you for iterating 👍

gante · 2023-11-16T10:03:46Z

tests/models/speech_to_text/test_modeling_speech_to_text.py

@@ -759,10 +759,6 @@ def test_pt_tf_model_equivalence(self, allow_missing_keys=True):
        # Allow missing keys since TF doesn't cache the sinusoidal embeddings in an attribute
        super().test_pt_tf_model_equivalence(allow_missing_keys=allow_missing_keys)

-    @unittest.skip("Test failing,  @RocketNight is looking into it")


Actually, we need to keep this skip, it is what is causing the failure in CI!

amyeroberts

Thanks for adding!

Before we merge can we update the non-slow test to be more robust? If the past CI was green because of a lucky sample, how do we know that this PR fixes it and this wasn't just another lucky run? e.g. can we set a seed which we know causes a failure on main and passes here?

gante · 2023-11-16T11:14:24Z

@amyeroberts I'll have to think harder about assisted generation test robustness, as there are two conflicting effects in place:

In theory, assisted generation should yield the exact same outputs
In practice, due to the matrix multiplication being shape-dependent (see here), there will be tiny fluctuations. With random models, this means that the odds of a simple assisted vs non-assisted output check failing are high.

On top of that, pinning a seed to a previous failure does not prevent bad failure checks in future models or flags.

My suggestion would be: I'll work on test robustness today, and we merge this fix as is. WDYT?

amyeroberts · 2023-11-16T11:26:32Z

In practice, due to the matrix multiplication being shape-dependent (see #25420 (comment)), there will be tiny fluctuations. With random models, this means that the odds of a simple assisted vs non-assisted output check failing are high.

For my own understanding, why wouldn't a seed resolve the issues in randomness here? I'm guessing the tests are using hf-internal-testing/tiny-random-model-name which can change?

On top of that, pinning a seed to a previous failure does not prevent bad failure checks in future models or flags.

Agreed - but it should make sure that this one passes! For any future models or flags we should add new tests.

In terms of tests to add - this relates back to my previous request here. It seems that the PR broke for a specific model type (encoder-decoder). Are there tests, which do not rely on randomness, which we can add that make sure just the API works?

patrickvonplaten · 2023-11-16T12:05:57Z

Hey @jiqing-feng,

There is sadly still a bug with speculative decoding. The following doesn't work:

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers import AutoModelForCausalLM
from datasets import load_dataset
import time
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

assistant_model_id = "distil-whisper/distil-large-v2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

input_features = processor(sample["array"], return_tensors="pt").input_features.to("cuda").to(torch.float16)

# warm-up
_ = model.generate(input_features, assistant_model=assistant_model)

start_time = time.time()
out = model.generate(input_features, assistant_model=assistant_model)
# out = model.generate(input_features)
print(time.time() - start_time)

gante · 2023-11-16T12:11:01Z

@amyeroberts There is something odd here. We have a mixin test that should be catching API issues. I'm looking into it to attempt to figure out what's wrong.

patrickvonplaten · 2023-11-16T12:12:09Z

The following code snippet also needs to work:

- assistant_model_id = "distil-whisper/distil-large-v2"
- assistant_model = AutoModelForCausalLM.from_pretrained(
-    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
-)
+ assistant_model_id = "openai/whisper-tiny"
+ assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
+)

But I think it does already

jiqing-feng · 2023-11-16T13:09:02Z

Hi @patrickvonplaten

I run the following script on my CPU device, and it works well.

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers import AutoModelForCausalLM
from datasets import load_dataset
import time
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

assistant_model_id = "openai/whisper-tiny"
assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
   assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

input_features = processor(sample["array"], return_tensors="pt").input_features.to(device).to(torch_dtype)

# warm-up
_ = model.generate(input_features, assistant_model=assistant_model)

start_time = time.time()
out = model.generate(input_features, assistant_model=assistant_model)
# out = model.generate(input_features)
print(time.time() - start_time)

patrickvonplaten · 2023-11-16T13:58:13Z

Hey @jiqing-feng,

Thanks so much for quickly jumping on fixing the problem here 🙏

It sadly still doesn't fix Whisper distillation as per code snippet above. To make sure distil whisper works again on "main", we have now reverted the PR here: #27523 and also added two slow tests that should be run now everytime we do changes to assisted decoding:

RUN_SLOW=1 pytest tests/models/whisper/test_modeling_whisper.py -k "distil" -sv

It would be amazing if you could maybe try to open a new PR that is rebased to current "main" with all your nice changes and in which all fast tests as well as the slow tests pass:

RUN_SLOW=1 pytest tests/models/whisper/test_modeling_whisper.py -k "distil" -sv

Very sorry about the duplicated work here

jiqing-feng · 2023-11-16T14:35:52Z

Hi @patrickvonplaten . There is no need to open a new PR. I have fixed the conflicts.

There might be a mistake (here) that I see that you use distil-whisper/distil-large-v2 as assistant model, and you use WhisperForCausalLM to load a WhisperForConditionalGeneration model. The distil-whisper/distil-large-v2 is an encoder-decoder model, so I use WhisperForConditionalGeneration to load it (it is also the original architectures in the model card). After this change, I can successfully run RUN_SLOW=1 pytest tests/models/whisper/test_modeling_whisper.py -k "distil" -sv on my current changes.

gante · 2023-11-16T15:00:00Z

Hi @jiqing-feng 👋

I have strengthened the test suite for assisted generation and did a small post mortem on why we didn't caught the issue in our tests in this PR.

Let's merge that PR first and then rebase here, to ensure we don't break CI again 🤗

Again, apologies on our end for not having a robust enough test coverage!

gante · 2023-11-16T18:54:55Z

@jiqing-feng the improved assisted generation tests were merged 🤗

jiqing-feng · 2023-11-17T01:49:28Z

Hi @gante . I also updated my code base. Would you please help to merge this PR? Thx.

gante · 2023-11-17T16:25:37Z

Hi @jiqing-feng 👋

I got it working on my end, without the change you added on the Whisper test (which we must revert). It is a non-trivial set of changes, so I'm going to detail the entire diff :)

Remove the self._extend_attention_mask and self._extend_token_type_ids functions from the GenerationMixin
Replace them by the following stand-alone functions, which can be added at the bottom of the file

def _prepare_attention_mask(model_kwargs: Dict[str, Any], new_length: int, is_encoder_decoder: bool) -> Dict[str, Any]:
    """Expands or crops the model's mask for decoding purposes, to the defined length"""

    mask_key = "decoder_attention_mask" if is_encoder_decoder else "attention_mask"
    if mask_key not in model_kwargs:
        return model_kwargs

    mask = model_kwargs[mask_key]
    mask_length_diff = new_length - mask.shape[1]

    if mask_length_diff < 0:
        model_kwargs[mask_key] = mask[:, :mask_length_diff]
    elif mask_length_diff > 0:
        model_kwargs[mask_key] = torch.cat([mask, mask.new_ones((mask.shape[0], mask_length_diff))], dim=-1)
    return model_kwargs


def _prepare_token_type_ids(model_kwargs: Dict[str, Any], new_length: int) -> Dict[str, Any]:
    """Expands or crops the model's token_type_ids for decoding purposes, to the defined length"""
    if "token_type_ids" not in model_kwargs or model_kwargs["token_type_ids"] is None:
        return model_kwargs

    token_type_ids = model_kwargs["token_type_ids"]
    final_token_type = token_type_ids[:, -1].unsqueeze(-1)
    type_length_diff = new_length - token_type_ids.shape[1]

    if type_length_diff < 0:
        token_type_ids = token_type_ids[:, :type_length_diff]
    elif type_length_diff > 0:
        token_type_copies = final_token_type.repeat(1, type_length_diff)
        model_kwargs["token_type_ids"] = torch.cat([model_kwargs["token_type_ids"], token_type_copies], dim=-1)
    return model_kwargs

Replace the code after # Update assistant_kwargs for the assistant's next round of generations by

            assistant_kwargs = _prepare_attention_mask(
                assistant_kwargs, new_cur_len, assistant_model.config.is_encoder_decoder
            )
            assistant_kwargs = _prepare_token_type_ids(assistant_kwargs, new_cur_len)

Replace the code after # 2.1. Prepare the model inputs by

            candidate_kwargs = copy.copy(model_kwargs)
            candidate_kwargs = _prepare_attention_mask(
                candidate_kwargs, candidate_input_ids.shape[1], self.config.is_encoder_decoder
            )
            candidate_kwargs = _prepare_token_type_ids(candidate_kwargs, candidate_input_ids.shape[1])

            model_inputs = self.prepare_inputs_for_generation(candidate_input_ids, **candidate_kwargs)

Replace the code after # prepare assistant model's keys of inputs by

        assistant_kwargs = copy.copy(model_kwargs)
        if assistant_model.config.is_encoder_decoder:
            # both are encoder-decoder
            input_ids_key = "decoder_input_ids"
            attention_key = "decoder_attention_mask"
            assistant_kwargs["encoder_outputs"] = assistant_kwargs.pop("assistant_encoder_outputs")
        elif "assistant_encoder_outputs" in assistant_kwargs:
            # special case for encoder-decoder with decoder-only assistant (like DistilWhisper)
            input_ids_key = "input_ids"
            attention_key = "attention_mask"
            assistant_kwargs["attention_mask"] = assistant_kwargs.get(
                "decoder_attention_mask",
                torch.ones((input_ids.shape[0], 1), device=input_ids.device, dtype=torch.long),
            )
            assistant_kwargs["encoder_outputs"] = assistant_kwargs.pop("assistant_encoder_outputs")
        else:
            # both are decoder-only
            input_ids_key = "input_ids"
            attention_key = "attention_mask"

All these changes will make assisted_generation compatible with all use cases, even the more complex DistilWhisper 🤗

jiqing-feng · 2023-11-20T07:17:20Z

Hi @gante . Thanks for your review, I have updated all the changes you proposed. Would you please help me to check and merge it? Thx!

tests/models/whisper/test_modeling_whisper.py

gante

Perfect, thank you for working on the changes 💪

@jiqing-feng if possible, it would be nice to delete the now unused _extend_attention_mask and _extend_token_type_ids functions :)

@amyeroberts I've confirmed on my end that all relevant tests are passing:

RUN_SLOW=1 py.test tests/models/whisper/ -k speculative
py.test tests/ -k test_assisted_decoding_matches_greedy_search
py.test tests/ -k test_assisted_decoding_sample

jiqing-feng · 2023-11-21T14:47:39Z

Perfect, thank you for working on the changes 💪

@jiqing-feng if possible, it would be nice to delete the now unused _extend_attention_mask and _extend_token_type_ids functions :)

@amyeroberts I've confirmed on my end that all relevant tests are passing:

RUN_SLOW=1 py.test tests/models/whisper/ -k speculative

py.test tests/ -k test_assisted_decoding_matches_greedy_search

py.test tests/ -k test_assisted_decoding_sample

Do you mean delete these 2 functions and replace all _extend_xxx functions with our new _prepare_xxx functions?

gante · 2023-11-21T15:59:16Z

@jiqing-feng yes _extend_attention_mask and _extend_token_type_ids -- are not used anywhere in the code

amyeroberts

Thanks for iterating!

@gante Thanks for running the tests! Do these also cover the tests that were breaking previously for whisper? Happy to merge once we know it's whisper compatible 🤗

gante · 2023-11-27T14:23:46Z

Do these also cover the tests that were breaking previously for whisper? Happy to merge once we know it's whisper compatible 🤗

Yes, it is RUN_SLOW=1 py.test tests/models/whisper/ -k speculative in the list of tests above :) Merging!

gante · 2023-11-27T14:24:15Z

@jiqing-feng thank you for bearing with us 🤗

amyeroberts · 2023-11-27T15:22:49Z

@gante D'oh sorry - PR blindness 🤦 Thanks for merging and thanks again @jiqing-feng for all the work iterating on this PR!

fix assisted decoding attention_cat

dd36dd1

jiqing-feng marked this pull request as ready for review November 15, 2023 02:40

jiqing-feng mentioned this pull request Nov 15, 2023

add attention_mask and position_ids in assisted model #26892

Merged

jiqing-feng added 8 commits November 14, 2023 19:29

fix attention_mask for assisted decoding

3a21ada

fix attention_mask len

5295d31

fix attn len

6664ab7

Use a more clean way to prepare assistant models inputs

d19ad6f

fix param meaning

fd0bc76

fix param name

7996be5

fix assistant model inputs

dadece9

update token type ids

41eb39d

jiqing-feng changed the title ~~fix assisted decoding attention_cat~~ fix assisted decoding assistant model inputs Nov 15, 2023

gante reviewed Nov 15, 2023

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

gante mentioned this pull request Nov 15, 2023

[CircleCI] skip test_assisted_decoding_sample for everyone #27511

Merged

jiqing-feng and others added 4 commits November 15, 2023 19:11

fix assistant kwargs copy

5721a63

Merge branch 'huggingface:main' into assisted

7fcd4ff

add encoder-decoder tests of assisted decoding

5a38625

check if assistant kwargs contains updated keys

7658269

jiqing-feng mentioned this pull request Nov 16, 2023

Revert "add attention_mask and position_ids in assisted model" #27523

Merged

gante approved these changes Nov 16, 2023

View reviewed changes

gante reviewed Nov 16, 2023

View reviewed changes

gante requested a review from amyeroberts November 16, 2023 10:04

revert test

f984aae

amyeroberts reviewed Nov 16, 2023

View reviewed changes

Merge branch 'huggingface:main' into assisted

0b2f79e

jiqing-feng added 2 commits November 16, 2023 06:01

fix conflict

f7502ee

fix whisper tests

f01006b

merge main

1030328

jiqing-feng added 2 commits November 19, 2023 22:29

Merge branch 'main' into assisted

fea33ea

fix assistant kwargs

eeffe13

patrickvonplaten reviewed Nov 20, 2023

View reviewed changes

tests/models/whisper/test_modeling_whisper.py Outdated Show resolved Hide resolved

revert whisper test

af36fe9

gante approved these changes Nov 21, 2023

View reviewed changes

jiqing-feng and others added 2 commits November 22, 2023 09:09

Merge branch 'huggingface:main' into assisted

03a0648

delete _extend funcs

9e0ca1c

amyeroberts approved these changes Nov 24, 2023

View reviewed changes

gante merged commit 1d7f406 into huggingface:main Nov 27, 2023
20 checks passed

jiqing-feng deleted the assisted branch December 13, 2023 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix assisted decoding assistant model inputs #27503

fix assisted decoding assistant model inputs #27503

jiqing-feng commented Nov 15, 2023 •

edited

Loading

gante left a comment

gante commented Nov 15, 2023

ArthurZucker commented Nov 15, 2023

jiqing-feng commented Nov 16, 2023

gante left a comment

gante Nov 16, 2023

jiqing-feng Nov 16, 2023

amyeroberts left a comment •

edited

Loading

gante commented Nov 16, 2023 •

edited

Loading

amyeroberts commented Nov 16, 2023

patrickvonplaten commented Nov 16, 2023

gante commented Nov 16, 2023

patrickvonplaten commented Nov 16, 2023

jiqing-feng commented Nov 16, 2023

patrickvonplaten commented Nov 16, 2023 •

edited

Loading

jiqing-feng commented Nov 16, 2023 •

edited

Loading

gante commented Nov 16, 2023

gante commented Nov 16, 2023

jiqing-feng commented Nov 17, 2023

gante commented Nov 17, 2023

jiqing-feng commented Nov 20, 2023

gante left a comment

jiqing-feng commented Nov 21, 2023

gante commented Nov 21, 2023

amyeroberts left a comment

gante commented Nov 27, 2023

gante commented Nov 27, 2023

amyeroberts commented Nov 27, 2023

fix assisted decoding assistant model inputs #27503

fix assisted decoding assistant model inputs #27503

Conversation

jiqing-feng commented Nov 15, 2023 • edited Loading

gante left a comment

Choose a reason for hiding this comment

gante commented Nov 15, 2023

ArthurZucker commented Nov 15, 2023

jiqing-feng commented Nov 16, 2023

gante left a comment

Choose a reason for hiding this comment

gante Nov 16, 2023

Choose a reason for hiding this comment

jiqing-feng Nov 16, 2023

Choose a reason for hiding this comment

amyeroberts left a comment • edited Loading

Choose a reason for hiding this comment

gante commented Nov 16, 2023 • edited Loading

amyeroberts commented Nov 16, 2023

patrickvonplaten commented Nov 16, 2023

gante commented Nov 16, 2023

patrickvonplaten commented Nov 16, 2023

jiqing-feng commented Nov 16, 2023

patrickvonplaten commented Nov 16, 2023 • edited Loading

jiqing-feng commented Nov 16, 2023 • edited Loading

gante commented Nov 16, 2023

gante commented Nov 16, 2023

jiqing-feng commented Nov 17, 2023

gante commented Nov 17, 2023

jiqing-feng commented Nov 20, 2023

gante left a comment

Choose a reason for hiding this comment

jiqing-feng commented Nov 21, 2023

gante commented Nov 21, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

gante commented Nov 27, 2023

gante commented Nov 27, 2023

amyeroberts commented Nov 27, 2023

jiqing-feng commented Nov 15, 2023 •

edited

Loading

amyeroberts left a comment •

edited

Loading

gante commented Nov 16, 2023 •

edited

Loading

patrickvonplaten commented Nov 16, 2023 •

edited

Loading

jiqing-feng commented Nov 16, 2023 •

edited

Loading