-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support image processor #4197
[Core] Support image processor #4197
Conversation
- Also add docs for basic VLM usage
4fc1801
to
a26badd
Compare
- Other data types may need to be of different dtype from that of the model
The LLaVA test passes on my end (with both outputs matching the HF output shown in CI). Does anyone have a clue what might cause it to fail in CI? Perhaps a case of floating-point error in GPU computation? |
a92952e
to
2d57f27
Compare
acc378d
to
b60e5f8
Compare
b60e5f8
to
3232231
Compare
Per offline discussion - waiting for #5118 to be merged first. |
6d00aed
to
501b11c
Compare
@DarkLight1337 Could you resolve the merge conflicts? Once that's done I think this PR is ready to merge. |
da681b5
to
3d20f6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a final pass and left a note, but everything LGTM! Thank you for the hard work on this. @DarkLight1337
I have implemented a plugin architecture (
MultiModelPlugin
) overMultiModalData
to define how each modality type should be preprocessed before being passed to the model as keyword arguments. This preserves the contract between the output of HuggingFace processor and the input into the HuggingFace model. As long as those keyword arguments do not conflict with the ones we have in vLLM, I think this is a good way to make the framework flexible enough to support other multi-modal architectures.FIX #4054 (the data is now automatically converted into the model's device)
Related Contributions
This PR is part of #3978.
This PR also implements Proposals 1 and 3 of #4194.
Features
MultiModalData
ImageFeatureData
represents the image features of LLaVA after being passed through the vision tower, but before the multi-modal projection is applied.ImagePixelData
represents the raw image (usingPIL.Image
class).AutoImageProcessor
from HuggingFace is loaded fromconfig.json
to pre-process input images before being passed to the model aspixel_values
. As with the tokenizer, you can override the default one and specify the version of image processor viaEngineConfig
; you can even disable image preprocessing altogether, which is useful if you want to pass in images that have already been preprocessed.Compatibility Changes
pillow
will be upgraded to acommon
dependency (fromdev
) to process the images.