Fix VLM generation issues #32836

ArthurZucker · 2024-08-15T15:53:24Z

What does this PR do?

Fixes generation for llava-next-video. Apparently started failing after we moved to cache class but some parts of the code were not modified. Checked all llava models, others are working since check is done on a different condition.

Yes, we can start using cache_position and rely on that, but we should note that cache_position for VLMs will not be correct and will contain positions only for text tokens. Adding support for cache position will come in the next PR, which is in progress. We;ll have to deprecate many things before we can get rid of the current checks to "merge or expand".

This PR ports the changes from #32527 to a proper patch

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-08-15T16:12:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-08-15T15:56:49Z

src/transformers/models/llava_next_video/modeling_llava_next_video.py

@@ -853,7 +853,7 @@ def forward(
            inputs_embeds = self.get_input_embeddings()(input_ids)

        # Merge text and images in prefill stage
-        if past_key_values is None:
+        if input_ids.shape[1] != 1:


Suggested change

if input_ids.shape[1] != 1:

if inputs_embeds.shape[1] != 1:

should we not check inputs_embeds instead?

nope, we don't expect a user to pass inputs embeds and pixels at the same time, because for merging we need to slice image token ids positions. Yes, we have a ValueError in case both are passed, but it is in latest main.

UPDATE: Okey, this was causing test fails so yeah, I fixed it with embeds to make CI happy. This should be enough for the patch, we'll get the new VLMs on v4.45, will try to add mixin tests until then

ArthurZucker · 2024-08-15T16:30:29Z

src/transformers/models/llava_next_video/modeling_llava_next_video.py

@@ -910,7 +910,7 @@ def forward(
            pass

        # generation with cache, decoding stage
-        elif past_key_values is not None and (pixel_values is not None or pixel_values_videos is not None):
+        elif pixel_values is not None or pixel_values_videos is not None:


Not sure why this is changed as the first if is inputs_embeds.shape[1] != 1: so that is prefill / simple forward (and indeed merging should always be done regardless of generation)
But then here, you need to index the past key values, which don't always exist (forward call with use_cache = False). for example

In forward call with no cache, we should go with the first 'if' and never reach the elif. yeah kinda too many dependencies, but shouldn't cause an error. I'll add a parametrized to the new tests (use-cache or no-cache)

* fix in one commit * add parameterized * fix tests * fix test flakiness * maybe that's why flaky * style * flakiness... --------- Co-authored-by: raushan <[email protected]>

fix in one commit

526efc2

ArthurZucker mentioned this pull request Aug 15, 2024

LLaVA-NeXT-Video: fix generation with cache #32527

Closed

ArthurZucker commented Aug 15, 2024

View reviewed changes

zucchini-nlp added 6 commits August 16, 2024 06:51

add parameterized

1e4d322

fix tests

21cf43d

fix test flakiness

9dac3ea

maybe that's why flaky

a33452b

style

736ecdf

flakiness...

43ef074

ArthurZucker marked this pull request as ready for review August 16, 2024 18:58

ArthurZucker merged commit ad87ed0 into v4.44-release Aug 16, 2024
22 of 25 checks passed

ArthurZucker deleted the patch_vlms branch August 16, 2024 18:58

ArthurZucker added a commit that referenced this pull request Aug 20, 2024

Fix VLM generation issues (#32836)

0d1629a

* fix in one commit * add parameterized * fix tests * fix test flakiness * maybe that's why flaky * style * flakiness... --------- Co-authored-by: raushan <[email protected]>

ArthurZucker added a commit that referenced this pull request Aug 20, 2024

Fix VLM generation issues (#32836)

4fd0f48

* fix in one commit * add parameterized * fix tests * fix test flakiness * maybe that's why flaky * style * flakiness... --------- Co-authored-by: raushan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix VLM generation issues #32836

Fix VLM generation issues #32836

ArthurZucker commented Aug 15, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 15, 2024

ArthurZucker Aug 15, 2024

zucchini-nlp Aug 16, 2024 •

edited

Loading

ArthurZucker Aug 15, 2024

zucchini-nlp Aug 16, 2024

Fix VLM generation issues #32836

Fix VLM generation issues #32836

Conversation

ArthurZucker commented Aug 15, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 15, 2024

ArthurZucker Aug 15, 2024

Choose a reason for hiding this comment

zucchini-nlp Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Aug 15, 2024

Choose a reason for hiding this comment

zucchini-nlp Aug 16, 2024

Choose a reason for hiding this comment

ArthurZucker commented Aug 15, 2024 •

edited

Loading

zucchini-nlp Aug 16, 2024 •

edited

Loading