-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raspawar/add vlm support #16751
base: main
Are you sure you want to change the base?
Raspawar/add vlm support #16751
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please address comments.
failing tests -
FAILED tests/test_multi_modal_nvidia.py::test_vlm_asset_id[invoke-content0-microsoft/phi-3-vision-128k-instruct] - TypeError: sequence item 0: expected str instance, function found
FAILED tests/test_multi_modal_nvidia.py::test_vlm_asset_id[stream-content0-microsoft/phi-3-vision-128k-instruct] - TypeError: sequence item 0: expected str instance, function found
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
...i_modal_llms/llama-index-multi-modal-llms-nvidia/llama_index/multi_modal_llms/nvidia/base.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh the way cicd works, including extra files like this is a huge pain
I would just create an image in-memory -- like a black or white square. Or download the images at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it will do
@logan-markewich can u ptal. I donno why the coverage is considering the test cases also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an actual readme? Something with the install command and general usage (probably similar to the notebook)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small request
Support for vision-language models, those that can accept images and text as input and produce text.
these are akin to https://platform.openai.com/docs/guides/vision with three notable differences -
not all model endpoints support all features, e.g. server-side download of images not available with
adept/fuyu-8b, google/deplot, microsoft/kosmos-2, google/paligemma; some models endpoints restrict image size; some models support one and only image; some models do not support gif or webp; kosmos-2 does not support streaming
cc: @sumitkbh @mattf @dglogo