Adds VIP-llava to transformers #27932

younesbelkada · 2023-12-10T11:55:41Z

What does this PR do?

VIP-llava is a new Llava variant. It seems the only difference between Llava and VIP-Llava is that VIP-llava uses a projector layernorm before passing the hidden states into the MM projector. It also concatenates many hidden states from the image encoder before passing it to the multi-modal projector.

from transformers import pipeline
from PIL import Image    
import requests

model_id = "ybelkada/vip-llava-7b"
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"load_in_4bit": True, "use_flash_attention_2": True})
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nCan you please describe this image?\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 100})
print(outputs[0]["generated_text"])
>>> USER: <image> 
Can you please describe this image?
ASSISTANT: The image features a brown and white cat sitting on a green surface, with a red ball in its paw. The cat appears to be playing with the ball, possibly a sports ball, as it is positioned in a relaxed manner. The cat's eyes are wide open, indicating that it is focused on the ball and possibly in the middle of a playful moment.

The image features a brown and white cat sitting on a green surface, with a red ball in its paw. The cat appears to be playing with the ball, possibly a sports ball, as it is positioned in a relaxed manner. The cat's eyes are wide open, indicating that it is focused on the ball and possibly in the middle of a playful moment.

Also compatible with Flash Attention 2.

https://github.com/mu-cai/ViP-LLaVA

cc @ArthurZucker @NielsRogge @mu-cai @haotian-liu

ArthurZucker · 2023-12-10T16:19:44Z

src/transformers/models/llava/modeling_llava.py

@@ -92,12 +92,19 @@ class LlavaCausalLMOutputWithPast(ModelOutput):
 class LlavaMultiModalProjector(nn.Module):
    def __init__(self, config: LlavaConfig):
        super().__init__()
+        if config.projector_layernorm:


not really transformers philosophy :|
Was thinking that we could always add a layer norm, but just set the parameters to make it perform identity for the other checkpoints? (kind of a hack but might stabilize training / fine tuning ?)

Agree, Sylvain would have never allowed this so let's keep it that way.

New paper = new model = no new config attributes but rather a separate model with copied from. Although I know that philosophy is sometimes a bit extreme, but it has always worked out in the past

younesbelkada · 2023-12-10T20:38:44Z

src/transformers/models/vipllava/modeling_vipllava.py

+                # For VIP-llava, the image features are computed this way
+                # We select the features from index 1: for the layers -2, -5, -8, -11 and 6
+                image_features = [image_outputs.hidden_states[index][:, 1:] for index in [-2, -5, -8, -11, 6]]
+                image_features = torch.cat(image_features, dim=-1)


Here is an important change with respect to llava

docs/source/en/index.md

src/transformers/models/vipllava/__init__.py

src/transformers/models/auto/processing_auto.py

NielsRogge · 2023-12-10T21:08:55Z

README.md

@@ -504,6 +504,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
 1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
 1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
+1. **[VipVipLlava](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from 1University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.


Suggested change

1. **[VipVipLlava](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from 1University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

1. **[ViP-LLaVA](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

…a/transformers into add-vip-llava-model

README_zh-hans.md

README_ko.md

README_ja.md

README_hd.md

README_es.md

README_zh-hant.md

HuggingFaceDocBuilderDev · 2023-12-11T09:28:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM just needs a few nits

ArthurZucker · 2023-12-12T17:29:51Z

docs/source/en/model_doc/vipllava.md

+The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
+


This model was contributed by etcx

ArthurZucker · 2023-12-12T17:30:24Z

docs/source/en/model_doc/vipllava.md

+```
+
+The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
+


let's mention the only diffs with Llava here as well as a tip or whatever

Makes sense, done!

ArthurZucker · 2023-12-12T17:31:10Z

src/transformers/models/vipllava/convert_vipllava_weights_to_hf.py

+        new_state_dict[key] = value
+    return new_state_dict
+
+


missing copied from

ArthurZucker · 2023-12-12T17:31:40Z

tests/models/vipllava/test_modeling_vipllava.py

missing copied from here as well whenever possible

ArthurZucker · 2023-12-12T17:33:32Z

src/transformers/models/vipllava/modeling_vipllava.py

+        return model_embeds
+
+    # Ignore copy
+    def _merge_input_ids_with_image_features(


can you highlight the diff (where it is different from Llava just to help me review + might need it int he code as a comment! )

src/transformers/models/vipllava/modeling_vipllava.py

younesbelkada

Thanks @ArthurZucker for the review, I left one open question

younesbelkada · 2023-12-12T18:02:14Z

src/transformers/models/vipllava/modeling_vipllava.py

+                # For VIP-llava, the image features are computed this way
+                # We select the features from index 1: for the layers -2, -5, -8, -11 and 6
+                image_features = [image_outputs.hidden_states[index][:, 1:] for index in vision_feature_layers]
+                image_features = torch.cat(image_features, dim=-1)


the main diff is here @ArthurZucker

ArthurZucker

Just left a nit.
Not 100% sure we need to have it in the library this fast as it does not seem that popular yet.
Thanks otherwise 🤗

ArthurZucker · 2023-12-13T08:29:32Z

tests/models/vipllava/test_modeling_vipllava.py

+    @slow
+    @require_bitsandbytes
+    def test_small_model_integration_test(self):
+        from transformers import pipeline


if you want to add the pipeline test put it in the pipeline test image to text, here let's leverage generate!

* v1 * add-new-model-like * revert * fix forward and conversion script * revert * fix copies * fixup * fix * Update docs/source/en/index.md * Apply suggestions from code review * push * fix * fixes here and there * up * fixup and fix tests * Apply suggestions from code review * add docs * fixup * fixes * docstring * add docstring * fixup * docstring * fixup * nit * docs * more copies * fix copies * nit * update test

v1

53d6e22

ArthurZucker reviewed Dec 10, 2023

View reviewed changes

younesbelkada added 4 commits December 10, 2023 20:51

add-new-model-like

db3d517

revert

b29133e

fix forward and conversion script

054e836

revert

32ad16c

younesbelkada commented Dec 10, 2023

View reviewed changes

younesbelkada added 3 commits December 10, 2023 21:48

fix copies

499508a

fixup

1cc9278

fix

8e4805c

younesbelkada commented Dec 10, 2023

View reviewed changes

docs/source/en/index.md Outdated Show resolved Hide resolved

Update docs/source/en/index.md

e93dd59

younesbelkada commented Dec 10, 2023

View reviewed changes

src/transformers/models/vipllava/__init__.py Outdated Show resolved Hide resolved

younesbelkada commented Dec 10, 2023

View reviewed changes

src/transformers/models/auto/processing_auto.py Outdated Show resolved Hide resolved

Apply suggestions from code review

58ad11e

NielsRogge reviewed Dec 10, 2023

View reviewed changes

younesbelkada added 6 commits December 10, 2023 22:12

push

3dc3870

Merge branch 'add-vip-llava-model' of https://github.com/younesbelkad…

e0c4fe2

…a/transformers into add-vip-llava-model

fix

51e3521

fixes here and there

da6d73d

up

ff61901

fixup and fix tests

b366bfa

younesbelkada commented Dec 10, 2023

View reviewed changes

younesbelkada and others added 7 commits December 10, 2023 23:24

Apply suggestions from code review

a8df18b

add docs

4efd9b1

fixup

494f7fa

fixes

e5715b1

docstring

de1531a

add docstring

103c380

fixup

36f982a

younesbelkada added 2 commits December 11, 2023 09:56

docstring

b3cd1ad

fixup

5d11950

younesbelkada requested a review from ArthurZucker December 11, 2023 09:16

younesbelkada marked this pull request as ready for review December 11, 2023 09:16

younesbelkada added 2 commits December 12, 2023 17:20

Merge remote-tracking branch 'upstream/main' into add-vip-llava-model

b18aee0

nit

c38b18c

ArthurZucker reviewed Dec 12, 2023

View reviewed changes

younesbelkada added 2 commits December 12, 2023 17:48

docs

0bbdf78

more copies

c02c512

younesbelkada commented Dec 12, 2023

View reviewed changes

src/transformers/models/vipllava/modeling_vipllava.py Show resolved Hide resolved

younesbelkada commented Dec 12, 2023

View reviewed changes

younesbelkada requested a review from ArthurZucker December 12, 2023 17:53

younesbelkada added 2 commits December 12, 2023 18:00

fix copies

d5c0525

nit

42a72e5

younesbelkada commented Dec 12, 2023

View reviewed changes

ArthurZucker approved these changes Dec 13, 2023

View reviewed changes

update test

3d6b6b6

younesbelkada merged commit c7f076a into huggingface:main Dec 13, 2023
22 checks passed

younesbelkada deleted the add-vip-llava-model branch December 13, 2023 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds VIP-llava to transformers #27932

Adds VIP-llava to transformers #27932

younesbelkada commented Dec 10, 2023 •

edited

Loading

ArthurZucker Dec 10, 2023

NielsRogge Dec 10, 2023

younesbelkada Dec 10, 2023

NielsRogge Dec 10, 2023

HuggingFaceDocBuilderDev commented Dec 11, 2023

ArthurZucker left a comment

ArthurZucker Dec 12, 2023

younesbelkada Dec 12, 2023

ArthurZucker Dec 12, 2023

younesbelkada Dec 12, 2023

ArthurZucker Dec 12, 2023

younesbelkada Dec 12, 2023

ArthurZucker Dec 12, 2023

younesbelkada Dec 12, 2023

ArthurZucker Dec 12, 2023

younesbelkada left a comment

younesbelkada Dec 12, 2023

ArthurZucker left a comment

ArthurZucker Dec 13, 2023

	1. [VipVipLlava](https://huggingface.co/docs/transformers/main/model_doc/vipllava) (from 1University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
	1. [ViP-LLaVA](https://huggingface.co/docs/transformers/main/model_doc/vipllava) (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

		The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).

Adds VIP-llava to transformers #27932

Adds VIP-llava to transformers #27932

Conversation

younesbelkada commented Dec 10, 2023 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Dec 11, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada commented Dec 10, 2023 •

edited

Loading