Cache: dynamic cache with cross attention and UMT5 `Cache` support #28185

gante · 2023-12-21T15:47:55Z

What does this PR do?

#28065 was becoming messy due to all Bart "copied from" dependencies, so this PR is a tiny version of it.

This PR:

Introduces DynamicCacheWithCrossAttention, which expands DynamicCache [cache object equivalent to the previous past_key_values input/output] with the ability to hold a cross-attention cache. This design was intentional: most LLMs (and now even multimodel models) tend to be decoder-only, so this separation will keep the cache class for decoder-only models simpler. It also enables us to be more strict -- in Cache: Bart and related architectures support Cache objects #28065 I've caught an unintended cache deletion in Whisper thanks to the increased specificity!
Adds Cache support to modeling_umt5.py, which is a form to test whether DynamicCacheWithCrossAttention is equivalent to the previous cache. These changes are the equivalent of the modeling changes in Generate: New Cache abstraction and Attention Sinks support #26681, but for encoder-decoder models.

Local tests run:

RUN_SLOW=1 py.test tests/models/umt5/test_modeling_umt5.py -vv [Note: adds a test to ensure we keep the same results as in main]

HuggingFaceDocBuilderDev · 2023-12-21T16:10:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2023-12-21T17:59:17Z

src/transformers/models/umt5/modeling_umt5.py

@@ -240,41 +248,71 @@ def compute_bias(self, query_length, key_length, device=None):
        values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
        return values

+    def _prepare_key_values(


This abstraction does not look particularly useful here. However, for models with multiple attention implementations, this abstraction is useful: all attention implementations can share it!

(e.g. in Bart the benefits are clear)

gante · 2023-12-21T18:01:17Z

src/transformers/models/umt5/modeling_umt5.py

@@ -481,6 +501,7 @@ class UMT5PreTrainedModel(PreTrainedModel):
    supports_gradient_checkpointing = True
    _no_split_modules = ["UMT5Block"]
    _keep_in_fp32_modules = ["wo"]
+    _supports_cache_class = True


This enables the test_new_cache_format test -> converting back and forth between the new cache and the legacy cache with cross attention is tested

gante · 2023-12-21T18:03:58Z

tests/models/umt5/test_modeling_umt5.py

@@ -560,6 +560,27 @@ def test_training_gradient_checkpointing_use_reentrant_false(self):
 @require_sentencepiece
 @require_tokenizers
 class Umt5IntegrationTest(unittest.TestCase):
+    def test_generation(self):


Ensures there is no regression from main

I've double-checked that we get the same values in main. I've also checked the results with and without cache, in both main and this PR.

gante · 2023-12-21T18:06:47Z

src/transformers/models/umt5/modeling_umt5.py

+            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
+            if (
+                max_cache_length is not None
+                and decoder_attention_mask is not None
+                and cache_length + decoder_input_ids.shape[1] > max_cache_length
+            ):
+                decoder_attention_mask = decoder_attention_mask[:, -max_cache_length:]


logic copied from llama + sink cache -> this makes the model ready for caches like sink cache

ArthurZucker

#27931 will shamble things up 👿

github-actions · 2024-02-23T08:04:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

umt5 with cache

91b0fa0

add test, fix test

b5fdea7

gante requested a review from ArthurZucker December 21, 2023 17:56

gante marked this pull request as ready for review December 21, 2023 17:56

gante commented Dec 21, 2023

View reviewed changes

make test slow

194c77e

gante commented Dec 21, 2023

View reviewed changes

huggingface deleted a comment from github-actions bot Jan 23, 2024

ArthurZucker reviewed Jan 30, 2024

View reviewed changes

github-actions bot closed this Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache: dynamic cache with cross attention and UMT5 `Cache` support #28185

Cache: dynamic cache with cross attention and UMT5 `Cache` support #28185

gante commented Dec 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 21, 2023

gante Dec 21, 2023

gante Dec 21, 2023

gante Dec 21, 2023

gante Dec 21, 2023

ArthurZucker left a comment

github-actions bot commented Feb 23, 2024

Cache: dynamic cache with cross attention and UMT5 Cache support #28185

Cache: dynamic cache with cross attention and UMT5 Cache support #28185

Conversation

gante commented Dec 21, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Dec 21, 2023

gante Dec 21, 2023

Choose a reason for hiding this comment

gante Dec 21, 2023

Choose a reason for hiding this comment

gante Dec 21, 2023

Choose a reason for hiding this comment

gante Dec 21, 2023

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 23, 2024

Cache: dynamic cache with cross attention and UMT5 `Cache` support #28185

Cache: dynamic cache with cross attention and UMT5 `Cache` support #28185

gante commented Dec 21, 2023 •

edited

Loading