Expand inputs in processors for VLMs #30962

zucchini-nlp · 2024-05-22T12:27:03Z

What does this PR do?

Fixes #30809, This PR moves the _merge_inputs_with_vision_embeds to the processing logics, and thus making VLMs more versatile in terms of generation strategies. All models were tested locally with different batch sizes and img resolutions, the generation is same as it was before making changes.

The main idea is to get sequence length for image features inside the processing files, and expand input ids by repeating special image token. Same is already done for IDEFICS in transformers.

zucchini-nlp · 2024-05-22T12:27:53Z

src/transformers/models/llava_next/processing_llava_next.py

        )

        return BatchFeature(data={**text_inputs, **image_inputs})

+    def _get_number_of_features(self, height: int, width: int) -> int:


Mostly copied from TGI with minor changes in calculations for unpadding, otherwise it won't work for low resolution images

HuggingFaceDocBuilderDev · 2024-05-22T12:47:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

molbap · 2024-05-24T15:35:49Z

Looking forward to see this expanded to other VLMs! Some might be trickier, PaliGemma incorporates causal mask computation in the merge method for instance (thought about that when reading) but it makes sense that most should belong in the processor, not the modeling

zucchini-nlp · 2024-05-29T13:14:27Z

@amyeroberts I did some clean-up after Arthur's comments. Requesting review, should be ready. If this works I will expand the logic to BLIP and PaliGemma in the next weeks

What changed:

Model can generate from both: old inputs and new-expanded inputs. If it's old inputs, warning is raised, asking to upgrade the processor config.
Processor also can return both types. If it has all the necessary parts to calculate image embedding length, the inputs will be expanded. Otherwise, warning is raised and old behavior retained.
Old behavior is planned to be totally removed in v4.44 (or better v4.43?)
Added tests to check that old vs new inputs generation is identical
To actually have llava-based models work in new style, I'll later update all hf-llava configs in the hub. Other models in the hub will continue to work with old behavior

amyeroberts

Thanks for working on this - will be great to have some of this logic unified!

Main comment is about how we set the required arguments for processing in the processor

src/transformers/models/llava/processing_llava.py

tests/models/llava/test_modeling_llava.py

zucchini-nlp · 2024-06-10T10:31:05Z

@amyeroberts addressed the comments and added all VLMs to the PR (excluding Idefics, Fuyu and Kosmos as those already have expansion in processing).

warning text is more clear and it's easy for users to add new attributes to Processor class (with processor.patch_size = patch_size)
BLIP-2 needed more modifications as it didn't have special image token, lmk if the way I did works
Paligemma worked out of the box but needed changes for causal mask. There's also smth weird with "position_ids" which will be fixed by @molbap
All models have their "old-new format equivalence" tests and are passing locally. I don't know how to make happy the failing doctest, it's red even after I deprecated the unused attribute

amyeroberts

Wow - a big piece of work!

Overall looks good to me, just a few comments here and there. I'd like to have a second review from @molbap and a run on the slow tests for all the models touched here

tests/models/vipllava/test_modeling_vipllava.py

src/transformers/models/vipllava/modeling_vipllava.py

src/transformers/models/video_llava/processing_video_llava.py

src/transformers/models/vipllava/modeling_vipllava.py

amyeroberts · 2024-06-17T16:49:53Z

src/transformers/models/blip_2/modeling_blip_2.py

+            special_image_mask = (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
+            inputs_embeds[special_image_mask] = language_model_inputs.flatten()
+        else:
+            logger.warning_once(


We should make sure the official checkpoints have been updated this way

molbap

It's a biiig piece of work, nicely done, tests and all! I left a few comments on some things I didn't understand well + paligemma masking in particular

molbap · 2024-08-05T08:40:37Z

src/transformers/models/blip_2/modeling_blip_2.py

+            logger.warning_once(
+                "Expanding inputs for image tokens in BLIP-2 should be done in processing. "
+                "Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. "
+                "Using processors without these attributes in the config is deprecated and will throw an error in v4.44."


Maybe this, or another version number as we're in 4.44 dev version

Suggested change

"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."

"Using processors without these attributes in the config is deprecated and will throw an error in a later version"

Yes, will update accordingly when we get one step away from merging. I think two-three major versions from current one will work :)

src/transformers/models/blip_2/modeling_blip_2.py

src/transformers/models/blip_2/processing_blip_2.py

src/transformers/models/llava/processing_llava.py

src/transformers/models/llava_next/processing_llava_next.py

temp.py

tests/models/blip_2/test_modeling_blip_2.py

tests/models/llava/test_modeling_llava.py

tests/models/paligemma/test_modeling_paligemma.py

tests/models/video_llava/test_modeling_video_llava.py

Co-authored-by: Pablo Montalvo <[email protected]>

src/transformers/models/llava_next_video/diff_llava_next_video.py

zucchini-nlp · 2024-08-06T09:18:22Z

This should be done, addressed the comments. For the failing test, I have no idea how to skip it after deprecating a property from config.

molbap · 2024-08-06T09:26:06Z

Alright cool, taking a look soon! For the config option, a quick&dirty solution could be to do something like _ = config.ignore_index in the modeling?

molbap

LGTM! Some minor comments remaining but seems good

src/transformers/models/blip_2/processing_blip_2.py

molbap · 2024-08-07T08:57:43Z

src/transformers/models/paligemma/modeling_paligemma.py

-            final_labels = torch.full(
-                (batch_size, sequence_length), self.config.ignore_index, dtype=input_ids.dtype, device=input_ids.device
-            )
-            final_labels = torch.where(input_ids != self.pad_token_id, labels, final_labels)


If the labels are not defined in the same way - i.e. not nulled where padding tokens are - it'll break BC for existing FT scripts, right?

Hmm, do we not expect users to ignore padding while preparing the labels? We can bring this back for BC but afaik the general rule is that LLMs don't mask out pad tokens in labels

returned back the masking, and added a warning that users should mask labels themselves

Thanks! During training padding for uneven batches is definitely masked in labels, iiuc

src/transformers/models/paligemma/modeling_paligemma.py

molbap · 2024-08-07T09:04:50Z

tests/models/llava/test_modeling_llava.py

+    def test_inputs_embeds_matches_input_ids(self):
+        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+        for model_class in self.all_model_classes:
+            model = model_class(config)
+            model.to(torch_device)
+            model.eval()
+
+            inputs = self._prepare_for_class(inputs_dict, model_class)
+            input_ids = inputs["input_ids"]
+            del inputs["input_ids"]
+            del inputs["pixel_values"]
+
+            inputs_embeds = model.get_input_embeddings()(input_ids)
+
+            with torch.no_grad():
+                out_ids = model(input_ids=input_ids, **inputs)[0]
+                out_embeds = model(inputs_embeds=inputs_embeds, **inputs)[0]
+            self.assertTrue(torch.allclose(out_embeds, out_ids))


Interesting, same remark, would be worth having in common or an option that checks needed inputs for a given model to do this del on-demand? (nit)

yeah, I tries but seems like some models require to have both, ids and pixels, while other require only one. Will have to think about unifying Vision2Seq model tests somehow, in the scope of another PR

Shouldn't be big problem to repeat the code in tests

molbap · 2024-08-07T09:06:00Z

tests/models/paligemma/test_modeling_paligemma.py

+            del inputs["input_ids"]
+            del inputs["pixel_values"]
+
+            wte = model.get_input_embeddings()


same remark for wte and the transformers version warning, to modify before merge!

Co-authored-by: Pablo Montalvo <[email protected]>

amyeroberts

Looks great - thanks for handling all of this!

amyeroberts · 2024-08-08T10:23:58Z

src/transformers/models/blip_2/configuration_blip_2.py

+        qformer_config=None,
+        text_config=None,
+        num_query_tokens=32,
+        image_token_index=None,


index or token_id? Index would indicate a specific location, but the logic in 1776 looks like it's matching token_ids

It's token index same as in all llava models

src/transformers/models/blip_2/processing_blip_2.py

zucchini-nlp · 2024-08-09T07:24:04Z

I'll run slow tests and check everything is okey, will merge some time next week

* let it be * draft * should not have changed * add warnings * fix & add tests * fix tests * ipnuts embeds cannot be passed with pixels * more updates * paligemma ready! * minor typos * update blip-2 * fix tests & raise error * docstring * add blip2 test * tmp * add image seq length to config * update docstring * delete * fix tests * fix blip * fix paligemma * out-of-place scatter * add llava-next-video * Update src/transformers/models/blip_2/modeling_blip_2.py Co-authored-by: Pablo Montalvo <[email protected]> * remove tmp * codestyle * nits * more nits * remove overriding in tests * comprehension when merging video * fix-copies * revert changes for embeds test * fix tests after making comprehension * Update src/transformers/models/blip_2/processing_blip_2.py Co-authored-by: Pablo Montalvo <[email protected]> * Update src/transformers/models/blip_2/processing_blip_2.py Co-authored-by: Pablo Montalvo <[email protected]> * more updates * fix tests --------- Co-authored-by: Pablo Montalvo <[email protected]>

zucchini-nlp added 3 commits May 20, 2024 12:54

let it be

050657f

draft

a67087e

should not have changed

1e2b873

zucchini-nlp commented May 22, 2024

View reviewed changes

zucchini-nlp added 2 commits May 29, 2024 12:32

add warnings

70145d4

Merge remote-tracking branch 'upstream/main' into vlm_processors

16a6787

zucchini-nlp mentioned this pull request May 29, 2024

[Llava] Phi text model produces ValueError: Attention mask should be of size (1, 1, 1, 230), but is torch.Size([1, 1, 1, 8]) when using past_key_values in generate #30809

Closed

4 tasks

zucchini-nlp added 2 commits May 29, 2024 14:59

fix & add tests

8472035

fix tests

13af9e8

zucchini-nlp requested a review from amyeroberts May 29, 2024 13:14

ipnuts embeds cannot be passed with pixels

41d086f

zucchini-nlp marked this pull request as ready for review May 29, 2024 13:34

amyeroberts reviewed Jun 3, 2024

View reviewed changes

zucchini-nlp added 9 commits June 7, 2024 15:45

more updates

bf59ed6

paligemma ready!

020e7ed

minor typos

3e0455c

update blip-2

674f16e

fix tests & raise error

42ae646

Merge branch 'main' into vlm_processors

b5259f2

docstring

a6c50de

add blip2 test

4766e2e

Merge branch 'main' into vlm_processors

d46df90

zucchini-nlp requested a review from amyeroberts June 10, 2024 10:31

tmp

f74297b

amyeroberts added the run-slow label Jun 17, 2024

amyeroberts reviewed Jun 17, 2024

View reviewed changes

molbap self-requested a review July 31, 2024 15:49

molbap reviewed Aug 5, 2024

View reviewed changes

zucchini-nlp and others added 11 commits August 5, 2024 17:14

Update src/transformers/models/blip_2/modeling_blip_2.py

d60624e

Co-authored-by: Pablo Montalvo <[email protected]>

remove tmp

1973b39

merge main

a6e380f

codestyle

8e88d8b

nits

689eed9

more nits

28e8054

remove overriding in tests

637e514

comprehension when merging video

be939d8

fix-copies

232eb7c

revert changes for embeds test

385a617

fix tests after making comprehension

4831a7e

molbap reviewed Aug 6, 2024

View reviewed changes

src/transformers/models/llava_next_video/diff_llava_next_video.py Show resolved Hide resolved

zucchini-nlp requested a review from molbap August 6, 2024 09:17

molbap approved these changes Aug 7, 2024

View reviewed changes

zucchini-nlp and others added 4 commits August 8, 2024 10:12

Update src/transformers/models/blip_2/processing_blip_2.py

85fbff9

Co-authored-by: Pablo Montalvo <[email protected]>

Update src/transformers/models/blip_2/processing_blip_2.py

119178f

Co-authored-by: Pablo Montalvo <[email protected]>

more updates

2451911

fix tests

414031e

zucchini-nlp requested a review from amyeroberts August 8, 2024 08:31

amyeroberts approved these changes Aug 8, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into vlm_processors

8cfad20

zucchini-nlp merged commit a29eabd into huggingface:main Aug 13, 2024
21 checks passed

zucchini-nlp mentioned this pull request Aug 21, 2024

VLM: fixes after refactor #32907

Merged

zucchini-nlp mentioned this pull request Sep 8, 2024

Track progress for VLMs refactoring #33374

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand inputs in processors for VLMs #30962

Expand inputs in processors for VLMs #30962

zucchini-nlp commented May 22, 2024 •

edited

Loading

zucchini-nlp May 22, 2024

HuggingFaceDocBuilderDev commented May 22, 2024

molbap commented May 24, 2024

zucchini-nlp commented May 29, 2024

amyeroberts left a comment

zucchini-nlp commented Jun 10, 2024 •

edited

Loading

amyeroberts left a comment

amyeroberts Jun 17, 2024

molbap left a comment

molbap Aug 5, 2024

zucchini-nlp Aug 5, 2024

zucchini-nlp commented Aug 6, 2024

molbap commented Aug 6, 2024

molbap left a comment

molbap Aug 7, 2024

zucchini-nlp Aug 8, 2024

zucchini-nlp Aug 8, 2024

molbap Aug 9, 2024

molbap Aug 7, 2024

zucchini-nlp Aug 8, 2024 •

edited

Loading

molbap Aug 7, 2024

amyeroberts left a comment

amyeroberts Aug 8, 2024

zucchini-nlp Aug 9, 2024

zucchini-nlp commented Aug 9, 2024

	"Using processors without these attributes in the config is deprecated and will throw an error in v4.44."
	"Using processors without these attributes in the config is deprecated and will throw an error in a later version"

Expand inputs in processors for VLMs #30962

Expand inputs in processors for VLMs #30962

Conversation

zucchini-nlp commented May 22, 2024 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 22, 2024

molbap commented May 24, 2024

zucchini-nlp commented May 29, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Jun 10, 2024 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molbap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Aug 6, 2024

molbap commented Aug 6, 2024

molbap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Aug 9, 2024

zucchini-nlp commented May 22, 2024 •

edited

Loading

zucchini-nlp commented Jun 10, 2024 •

edited

Loading

zucchini-nlp Aug 8, 2024 •

edited

Loading