-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds VIP-llava to transformers #27932
Adds VIP-llava to transformers #27932
Conversation
@@ -92,12 +92,19 @@ class LlavaCausalLMOutputWithPast(ModelOutput): | |||
class LlavaMultiModalProjector(nn.Module): | |||
def __init__(self, config: LlavaConfig): | |||
super().__init__() | |||
if config.projector_layernorm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really transformers philosophy :|
Was thinking that we could always add a layer norm, but just set the parameters to make it perform identity for the other checkpoints? (kind of a hack but might stabilize training / fine tuning ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, Sylvain would have never allowed this so let's keep it that way.
New paper = new model = no new config attributes but rather a separate model with copied from. Although I know that philosophy is sometimes a bit extreme, but it has always worked out in the past
# For VIP-llava, the image features are computed this way | ||
# We select the features from index 1: for the layers -2, -5, -8, -11 and 6 | ||
image_features = [image_outputs.hidden_states[index][:, 1:] for index in [-2, -5, -8, -11, 6]] | ||
image_features = torch.cat(image_features, dim=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an important change with respect to llava
README.md
Outdated
@@ -504,6 +504,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h | |||
1. **[VAN](https://huggingface.co/docs/transformers/model_doc/van)** (from Tsinghua University and Nankai University) released with the paper [Visual Attention Network](https://arxiv.org/abs/2202.09741) by Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu. | |||
1. **[VideoMAE](https://huggingface.co/docs/transformers/model_doc/videomae)** (from Multimedia Computing Group, Nanjing University) released with the paper [VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training](https://arxiv.org/abs/2203.12602) by Zhan Tong, Yibing Song, Jue Wang, Limin Wang. | |||
1. **[ViLT](https://huggingface.co/docs/transformers/model_doc/vilt)** (from NAVER AI Lab/Kakao Enterprise/Kakao Brain) released with the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Wonjae Kim, Bokyung Son, Ildoo Kim. | |||
1. **[VipVipLlava](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from 1University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. **[VipVipLlava](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from 1University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. | |
1. **[ViP-LLaVA](https://huggingface.co/docs/transformers/main/model_doc/vipllava)** (from University of Wisconsin–Madison) released with the paper [Making Large Multimodal Models Understand Arbitrary Visual Prompts](https://arxiv.org/abs/2312.00784) by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just needs a few nits
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model was contributed by etcx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done!
``` | ||
|
||
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's mention the only diffs with Llava here as well as a tip or whatever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, done!
new_state_dict[key] = value | ||
return new_state_dict | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing copied from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing copied from here as well whenever possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
return model_embeds | ||
|
||
# Ignore copy | ||
def _merge_input_ids_with_image_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you highlight the diff (where it is different from Llava just to help me review + might need it int he code as a comment! )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArthurZucker for the review, I left one open question
# For VIP-llava, the image features are computed this way | ||
# We select the features from index 1: for the layers -2, -5, -8, -11 and 6 | ||
image_features = [image_outputs.hidden_states[index][:, 1:] for index in vision_feature_layers] | ||
image_features = torch.cat(image_features, dim=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main diff is here @ArthurZucker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left a nit.
Not 100% sure we need to have it in the library this fast as it does not seem that popular yet.
Thanks otherwise 🤗
@slow | ||
@require_bitsandbytes | ||
def test_small_model_integration_test(self): | ||
from transformers import pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to add the pipeline test put it in the pipeline test image to text, here let's leverage generate!
* v1 * add-new-model-like * revert * fix forward and conversion script * revert * fix copies * fixup * fix * Update docs/source/en/index.md * Apply suggestions from code review * push * fix * fixes here and there * up * fixup and fix tests * Apply suggestions from code review * add docs * fixup * fixes * docstring * add docstring * fixup * docstring * fixup * nit * docs * more copies * fix copies * nit * update test
* v1 * add-new-model-like * revert * fix forward and conversion script * revert * fix copies * fixup * fix * Update docs/source/en/index.md * Apply suggestions from code review * push * fix * fixes here and there * up * fixup and fix tests * Apply suggestions from code review * add docs * fixup * fixes * docstring * add docstring * fixup * docstring * fixup * nit * docs * more copies * fix copies * nit * update test
What does this PR do?
VIP-llava is a new Llava variant. It seems the only difference between Llava and VIP-Llava is that VIP-llava uses a projector layernorm before passing the hidden states into the MM projector. It also concatenates many hidden states from the image encoder before passing it to the multi-modal projector.
Also compatible with Flash Attention 2.
https://github.com/mu-cai/ViP-LLaVA
cc @ArthurZucker @NielsRogge @mu-cai @haotian-liu