Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

Closed
pseudotensor opened this issue Jun 28, 2024 · 5 comments · Fixed by #5981
Labels
bug Something isn't working

Comments

@pseudotensor
Copy link

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Same launching as #5969

Only difference is hash 2cd402e (latest main as of earlier today).

GPU is totally free, so just new bug in vLLM between the e9de9dd and 2cd402e hashes

INFO 06-28 23:40:03 api_server.py:206] vLLM API server version 0.5.0.post1
INFO 06-28 23:40:03 api_server.py:207] args: Namespace(host='0.0.0.0', port=5063, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, respon>
INFO 06-28 23:40:03 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto>
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:05 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-28 23:40:06 model_runner.py:220] Loading model weights took 7.7732 GB
/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast>
  warnings.warn(
INFO 06-28 23:40:14 gpu_executor.py:83] # GPU blocks: 3184, # CPU blocks: 682
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/ubuntu/vllm/vllm/entrypoints/openai/api_server.py", line 225, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 425, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 359, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 500, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 246, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 342, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/executor/gpu_executor.py", line 86, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 207, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 344, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

@pseudotensor pseudotensor added the bug Something isn't working label Jun 28, 2024
@pseudotensor
Copy link
Author

Basically something is wrong now that was ok before. Can't even run phi-3 vision on 80GB H100 now.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 29, 2024

Hi, thanks for the report!

Can you try reverting to 96354d6 (right before 2061f0b)? I believe #5888 may be causing the issue.

@DarkLight1337 DarkLight1337 changed the title [Bug]: New bug in last few days for phi-3. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) [Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) Jun 29, 2024
@ywang96
Copy link
Member

ywang96 commented Jun 29, 2024

Hi @pseudotensor! This is in fact not a bug, but a fix to a previous bug in the initial Phi-3 PR that image payload was always None instead of actual pixel values during profiling, resulting in an incorrect over-estimation for available space for KV blocks that would result in OOM when server is under max load. (though the fixed profiling is conservative itself, but we would rather keep it that way for now instead of leaving the possibility of crashing the server).

If you limit your --max-num-seqs to a lower number (I've tested on H100 that it can go up to 17), you should still be able to launch the server with the full context-length.

@ywang96
Copy link
Member

ywang96 commented Jun 29, 2024

I've also made #5981 to avoid this confusion.

@pseudotensor
Copy link
Author

Ok, I've misunderstood max_num_seqs then. I thought hat was a max, not a required limit. So I would have expected context length to supersede the number of sequences, and so the number of sequences to be automatically reduced to accommodate my chosen context length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants