Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds VIP-llava to transformers #27932

Merged
merged 32 commits into from
Dec 13, 2023

Conversation

younesbelkada
Copy link
Contributor

@younesbelkada younesbelkada commented Dec 10, 2023

What does this PR do?

VIP-llava is a new Llava variant. It seems the only difference between Llava and VIP-Llava is that VIP-llava uses a projector layernorm before passing the hidden states into the MM projector. It also concatenates many hidden states from the image encoder before passing it to the multi-modal projector.

from transformers import pipeline
from PIL import Image    
import requests

model_id = "ybelkada/vip-llava-7b"
pipe = pipeline("image-to-text", model=model_id, model_kwargs={"load_in_4bit": True, "use_flash_attention_2": True})
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nCan you please describe this image?\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 100})
print(outputs[0]["generated_text"])
>>> USER: <image> 
Can you please describe this image?
ASSISTANT: The image features a brown and white cat sitting on a green surface, with a red ball in its paw. The cat appears to be playing with the ball, possibly a sports ball, as it is positioned in a relaxed manner. The cat's eyes are wide open, indicating that it is focused on the ball and possibly in the middle of a playful moment.

image

The image features a brown and white cat sitting on a green surface, with a red ball in its paw. The cat appears to be playing with the ball, possibly a sports ball, as it is positioned in a relaxed manner. The cat's eyes are wide open, indicating that it is focused on the ball and possibly in the middle of a playful moment.

Also compatible with Flash Attention 2.

https://github.com/mu-cai/ViP-LLaVA

cc @ArthurZucker @NielsRogge @mu-cai @haotian-liu

@@ -92,12 +92,19 @@ class LlavaCausalLMOutputWithPast(ModelOutput):
class LlavaMultiModalProjector(nn.Module):
def __init__(self, config: LlavaConfig):
super().__init__()
if config.projector_layernorm:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really transformers philosophy :|
Was thinking that we could always add a layer norm, but just set the parameters to make it perform identity for the other checkpoints? (kind of a hack but might stabilize training / fine tuning ?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, Sylvain would have never allowed this so let's keep it that way.

New paper = new model = no new config attributes but rather a separate model with copied from. Although I know that philosophy is sometimes a bit extreme, but it has always worked out in the past

Comment on lines 396 to 399
# For VIP-llava, the image features are computed this way
# We select the features from index 1: for the layers -2, -5, -8, -11 and 6
image_features = [image_outputs.hidden_states[index][:, 1:] for index in [-2, -5, -8, -11, 6]]
image_features = torch.cat(image_features, dim=-1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an important change with respect to llava

README.md Outdated
@@ -504,6 +504,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu.
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang.
1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim.
1. **[VipVipLlava](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from 1University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **[VipVipLlava](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from 1University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
1. **[ViP-LLaVA](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

README_zh-hans.md Outdated Show resolved Hide resolved
README_ko.md Outdated Show resolved Hide resolved
README_ja.md Outdated Show resolved Hide resolved
README_hd.md Outdated Show resolved Hide resolved
README_es.md Outdated Show resolved Hide resolved
README_zh-hant.md Outdated Show resolved Hide resolved
@younesbelkada younesbelkada marked this pull request as ready for review December 11, 2023 09:16
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just needs a few nits

Comment on lines +47 to +48
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model was contributed by etcx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done!

```

The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's mention the only diffs with Llava here as well as a tip or whatever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, done!

new_state_dict[key] = value
return new_state_dict


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing copied from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing copied from here as well whenever possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

return model_embeds

# Ignore copy
def _merge_input_ids_with_image_features(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you highlight the diff (where it is different from Llava just to help me review + might need it int he code as a comment! )

Copy link
Contributor Author

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArthurZucker for the review, I left one open question

Comment on lines +404 to +407
# For VIP-llava, the image features are computed this way
# We select the features from index 1: for the layers -2, -5, -8, -11 and 6
image_features = [image_outputs.hidden_states[index][:, 1:] for index in vision_feature_layers]
image_features = torch.cat(image_features, dim=-1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main diff is here @ArthurZucker

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a nit.
Not 100% sure we need to have it in the library this fast as it does not seem that popular yet.
Thanks otherwise 🤗

@slow
@require_bitsandbytes
def test_small_model_integration_test(self):
from transformers import pipeline
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to add the pipeline test put it in the pipeline test image to text, here let's leverage generate!

@younesbelkada younesbelkada merged commit c7f076a into huggingface:main Dec 13, 2023
22 checks passed
@younesbelkada younesbelkada deleted the add-vip-llava-model branch December 13, 2023 09:42
iantbutler01 pushed a commit to BismuthCloud/transformers that referenced this pull request Dec 16, 2023
* v1

* add-new-model-like

* revert

* fix forward and conversion script

* revert

* fix copies

* fixup

* fix

* Update docs/source/en/index.md

* Apply suggestions from code review

* push

* fix

* fixes here and there

* up

* fixup and fix tests

* Apply suggestions from code review

* add docs

* fixup

* fixes

* docstring

* add docstring

* fixup

* docstring

* fixup

* nit

* docs

* more copies

* fix copies

* nit

* update test
staghado pushed a commit to staghado/transformers that referenced this pull request Jan 15, 2024
* v1

* add-new-model-like

* revert

* fix forward and conversion script

* revert

* fix copies

* fixup

* fix

* Update docs/source/en/index.md

* Apply suggestions from code review

* push

* fix

* fixes here and there

* up

* fixup and fix tests

* Apply suggestions from code review

* add docs

* fixup

* fixes

* docstring

* add docstring

* fixup

* docstring

* fixup

* nit

* docs

* more copies

* fix copies

* nit

* update test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants