[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

pseudotensor · 2024-06-28T23:43:03Z

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Same launching as #5969

Only difference is hash 2cd402e (latest main as of earlier today).

GPU is totally free, so just new bug in vLLM between the e9de9dd and 2cd402e hashes

INFO 06-28 23:40:03 api_server.py:206] vLLM API server version 0.5.0.post1
INFO 06-28 23:40:03 api_server.py:207] args: Namespace(host='0.0.0.0', port=5063, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, respon>
INFO 06-28 23:40:03 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto>
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:05 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-28 23:40:06 model_runner.py:220] Loading model weights took 7.7732 GB
/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast>
  warnings.warn(
INFO 06-28 23:40:14 gpu_executor.py:83] # GPU blocks: 3184, # CPU blocks: 682
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/ubuntu/vllm/vllm/entrypoints/openai/api_server.py", line 225, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 425, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 359, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 500, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 246, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 342, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/executor/gpu_executor.py", line 86, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 207, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 344, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

The text was updated successfully, but these errors were encountered:

pseudotensor · 2024-06-28T23:45:01Z

Basically something is wrong now that was ok before. Can't even run phi-3 vision on 80GB H100 now.

DarkLight1337 · 2024-06-29T02:25:18Z

Hi, thanks for the report!

Can you try reverting to 96354d6 (right before 2061f0b)? I believe #5888 may be causing the issue.

ywang96 · 2024-06-29T06:19:42Z

Hi @pseudotensor! This is in fact not a bug, but a fix to a previous bug in the initial Phi-3 PR that image payload was always None instead of actual pixel values during profiling, resulting in an incorrect over-estimation for available space for KV blocks that would result in OOM when server is under max load. (though the fixed profiling is conservative itself, but we would rather keep it that way for now instead of leaving the possibility of crashing the server).

If you limit your --max-num-seqs to a lower number (I've tested on H100 that it can go up to 17), you should still be able to launch the server with the full context-length.

ywang96 · 2024-06-29T06:25:03Z

I've also made #5981 to avoid this confusion.

pseudotensor · 2024-06-29T18:10:47Z

Ok, I've misunderstood max_num_seqs then. I thought hat was a max, not a required limit. So I would have expected context length to supersede the number of sequences, and so the number of sequences to be automatically reduced to accommodate my chosen context length.

pseudotensor added the bug Something isn't working label Jun 28, 2024

ywang96 mentioned this issue Jun 29, 2024

[Misc] Update Phi-3-Vision Example #5981

Merged

DarkLight1337 closed this as completed in #5981 Jun 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

pseudotensor commented Jun 28, 2024

pseudotensor commented Jun 28, 2024

DarkLight1337 commented Jun 29, 2024 •

edited

Loading

ywang96 commented Jun 29, 2024 •

edited

Loading

ywang96 commented Jun 29, 2024

pseudotensor commented Jun 29, 2024

[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

Comments

pseudotensor commented Jun 28, 2024

Your current environment

🐛 Describe the bug

pseudotensor commented Jun 28, 2024

DarkLight1337 commented Jun 29, 2024 • edited Loading

ywang96 commented Jun 29, 2024 • edited Loading

ywang96 commented Jun 29, 2024

pseudotensor commented Jun 29, 2024

DarkLight1337 commented Jun 29, 2024 •

edited

Loading

ywang96 commented Jun 29, 2024 •

edited

Loading