Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Number of available GPU blocks drop significantly for Phi3-vision #6124

Closed
CatherineSue opened this issue Jul 4, 2024 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@CatherineSue
Copy link
Contributor

CatherineSue commented Jul 4, 2024

Your current environment

Two docker containers based on images built from vllm source 3de6e6a and 3f3b6b2

🐛 Describe the bug

I passed the same model Phi-3-vision-128k-instruct to each docker container:

--tensor-parallel-size=1 \
--model=/models/Phi-3-vision-128k-instruct \

For the version needs VLMConfig, here are the parameters

--image-input-type="pixel_values" \
--image-feature-size=1921 \
--image-token-id=32044 \
--image-input-shape="1, 3, 1008, 1344" 

And with the container based on 3de6e6a more latest, it raises error:

INFO 07-04 01:04:14 gpu_executor.py:84] # GPU blocks: 5970, # CPU blocks: 682
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (95520). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

But the container based on 3f3b6b2:

INFO 07-04 01:40:03 gpu_executor.py:83] # GPU blocks: 8825, # CPU blocks: 682
INFO 07-04 01:40:05 model_runner.py:906] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
@CatherineSue CatherineSue added the bug Something isn't working label Jul 4, 2024
@CatherineSue
Copy link
Contributor Author

@ywang96 Can you share some insight? Does it have something to do with the recent changes in VLM support?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 4, 2024

There used to be a bug in the model's memory profiling where it didn't actually pass in images. During inference, this underestimation might have caused OOM.

After the fix, the available block count is reduced significantly which better reflects the true memory usage of the model. Re: your problem, this is expected as the model has 128k context length. If it can't fit in your GPU, try reducing the context length via max_model_len or the sequence count via max_num_seqs.

@CatherineSue
Copy link
Contributor Author

thanks for the explanation @DarkLight1337 !

@ywang96
Copy link
Member

ywang96 commented Jul 4, 2024

Just for future reference - the bug was discovered and fixed in #5888 and #5214.

We have also updated examples/phi3v_example.py. The current profiling strategy is rather conservative, but improving it is definitely part of the next milestone!

@2U1
Copy link

2U1 commented Jul 8, 2024

@ywang96 I get same error using with max_num_seqs=1.

Is there some way to fix it?

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (4544). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

@DarkLight1337
Copy link
Member

As stated in the error message, you may have to decrease max_model_len (e.g. 64k instead of 128k)

@2U1
Copy link

2U1 commented Jul 8, 2024

@DarkLight1337 Thanks decreasing the max_model_len solved the problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants