Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raspawar/add vlm support #16751

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

raspawar
Copy link
Contributor

Support for vision-language models, those that can accept images and text as input and produce text.

these are akin to https://platform.openai.com/docs/guides/vision with three notable differences -

  • images can be passed with img tags in the regular text content
  • images can be passed as NVCF asset ids
    not all model endpoints support all features, e.g. server-side download of images not available with
    adept/fuyu-8b, google/deplot, microsoft/kosmos-2, google/paligemma; some models endpoints restrict image size; some models support one and only image; some models do not support gif or webp; kosmos-2 does not support streaming

cc: @sumitkbh @mattf @dglogo

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 30, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Oct 30, 2024
Copy link
Contributor

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address comments.

failing tests -

FAILED tests/test_multi_modal_nvidia.py::test_vlm_asset_id[invoke-content0-microsoft/phi-3-vision-128k-instruct] - TypeError: sequence item 0: expected str instance, function found
FAILED tests/test_multi_modal_nvidia.py::test_vlm_asset_id[stream-content0-microsoft/phi-3-vision-128k-instruct] - TypeError: sequence item 0: expected str instance, function found

@raspawar raspawar requested a review from mattf November 5, 2024 11:39
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh the way cicd works, including extra files like this is a huge pain

I would just create an image in-memory -- like a black or white square. Or download the images at runtime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it will do

@raspawar
Copy link
Contributor Author

@logan-markewich can u ptal. I donno why the coverage is considering the test cases also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an actual readme? Something with the install command and general usage (probably similar to the notebook)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants