-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vision chat completion output with odd Instruction/Output prompting. #5693
Comments
It's possible I don't understand these things:
it seems odd to have to specify these. Should be derived from the model, but vllm won't start without them. I got these values from here: vllm/examples/phi3v_example.py Line 15 in afed90a
|
Hey @pseudotensor Thank you for trying out the vision API and raising this issue.
Yea - we're working to remove the need of specifying these args as part of next multi-modality factoring milestone mentioned here
Are the two images identical but in different format? If not, can you try uploading the first image to a public registry and use url to load it instead, so I can have a better idea where the bug might be? |
It has to do with the byte encoding aspect. If I just send the url there is no such issue.
gives:
But with OpenAI or any of my own systems, that byte encoding version is fine. E.g.
gives:
i.e. same encoded thing is not working with vllm. Or it's running, but the response is oddly showing structure of prompts and is weak in terms of expected output and is not close to url version of output as it should be. Yet it kinda "sees" what it is, so some bug I guess. |
Hmm.... how did you encode your image? Could you try encoding with this function and see if it gives the same string? Line 58 in 4a30d7e
|
Here's how I'm encoding: https://github.com/h2oai/h2ogpt/blob/main/src/vision/utils_vision.py#L86-L118 It works for lmdeploy, cogvlm2's fastAPI app, OpenAI, Anthropic, Google. The encoding you pointed to has the same issue only with vllm, not with OpenAI etc.
gives:
|
I see - I will assign this to myself and take a look later this week or next week. I suspect the two image payloads don't give the same |
Hi, any progress here? Thanks. I'll stop bumping, but I'm quite interested in using phi-3 vision with vllm. |
@pseudotensor I think I figured it out - when encoding the image to base64, we cannot do it on top of from PIL import Image
import base64
import requests
from io import BytesIO
# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
"""encode image to base64 format."""
buffered = BytesIO()
if format == 'JPEG':
image = image.convert('RGB')
image.save(buffered, format)
return base64.b64encode(buffered.getvalue()).decode('utf-8')
# This is what we use in the API server to load the base64 string to image
def load_image_from_base64(image: str):
"""Load image from base64 format."""
return Image.open(BytesIO(base64.b64decode(image)))
# load image from url
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))
# correct way to encode an image from url
response = requests.get(url)
base64_correct = base64.b64encode(response.content).decode('utf-8')
image_encoded_correct = load_image_from_base64(base64_correct)
assert image == image_encoded_correct, "images are not the same"
# incorrect way to encode an image from url
base64_wrong = encode_image_base64(image)
image_encoded_wrong = load_image_from_base64(base64_wrong)
assert image == image_encoded_wrong, "images are not the same" Running the above should give you the following:
You can further use from transformers import AutoProcessor
import torch
model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
prompt = "What's in the image?<|image_1|>"
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
inputs_encoded_wrong = processor(prompt, [image_encoded_wrong], return_tensors="pt").to("cuda:0")
assert torch.equal(inputs.pixel_values, inputs_encoded_wrong.pixel_values) Could you try the correct way to encode the image, then send it through the server and see if the output is correct? |
The encoding I'm using is compatible with all normal providers: OpenAI, Anthropic, Google, lmdeploy, sglang, etc. etc. So I don't think the issue is one of encoding, since I'm using the same encoding for all these cases. As for Image -> vllm's own encoding code, I only used that because you asked me to. It's not normally what I'm doing. If what vllm is doing is not generally compatible, I think that's a major issue. However, I can follow along and help you identify the issue. |
If I just use what you did, it doesn't work, because the OpenAI API expects a valid image url like 'data:image' etc.:
gives:
|
If I add the correct prefix:
then I get:
So that's correct. |
Here's example showing the way you recommended but failing. I only changed the prompt from "What is in this image" to "What do you see?"
gives:
|
@pseudotensor Thanks for trying out the examples! To clarify, I asked you to try these examples because we need to see at where/which layer exactly the bug is. At least for this case, we'd like to make sure the input images are identical when loaded as a If the underlying model (in this case, Perhaps a good way to debug this is to test these inputs with |
cc @Isotr0py if you have any idea about this since you worked on the PR to add this model. |
Ok will compare to transformers. |
I'm unable to make transformers fail. E.g.:
gives:
If vllm really did the equivalent of transformers and (say) took the base64 and converted to an image and passed it to transformers processor, then should be all good. |
Here I go over many prompts. Transformers is always stable with the "bad encoding" image version.
gives:
None have that odd "instruction" "Output" stuff. |
@pseudotensor Thanks for getting back to me on this! I also went back and tried to repro the errors from the main branch, but I can't seem to do so from PIL import Image
import base64
import requests
from io import BytesIO
from openai import OpenAI
# The encoding function I linked previously - but we actually don't use this function in the API server
def encode_image_base64(image: Image.Image, format: str = 'JPEG') -> str:
"""encode image to base64 format."""
buffered = BytesIO()
if format == 'JPEG':
image = image.convert('RGB')
image.save(buffered, format)
return base64.b64encode(buffered.getvalue()).decode('utf-8')
client = OpenAI(base_url='http://localhost:8000/v1', api_key="EMPTY")
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))
# encode with Image.open()
base64 = encode_image_base64(image)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64}"
},
},
],
}
]
response = client.chat.completions.create(
model="microsoft/Phi-3-vision-128k-instruct",
messages=messages,
max_tokens=300,
)
print(response.choices[0])
Can you try running from the main brach and see if you can still repro this? |
Another thought I have - I wonder if this has something to do with Line 70 in e9de9dd
|
I can confirm. The same kinds of tests no longer fail on vllm main. What fixed it? |
I'm not sure but checking main branch commits I can only think of #5772. If you don't mind, please feel free to close this issue after you test with more prompts! |
I tried 100 random prompts with that same image, and no issues. |
Your current environment
latest main afed90a
run:
🐛 Describe the bug
While the latter "messages2" works, the former does not. It leads to:
So it sees the image, but the response is all messed up in terms of prompting.
The text was updated successfully, but these errors were encountered: