Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel #8299

Merged
merged 6 commits into from
Sep 11, 2024

Conversation

Isotr0py
Copy link
Collaborator

@Isotr0py Isotr0py commented Sep 9, 2024

FILL IN THE PR DESCRIPTION HERE

FIX #8275

  • For InternVL2 with PP, we only need to process image input on first rank.
  • This PR fixed the error raised by image input processing on other ranks

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Copy link

github-actions bot commented Sep 9, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@Isotr0py
Copy link
Collaborator Author

Isotr0py commented Sep 9, 2024

@Manikandan-Thangaraj-ZS0321 Can you check if this fix works on 40B model?

I have checked it working on 4B, but I don't have the environment to run 40B model.

@Isotr0py Isotr0py changed the title [Bugfix] Fix InternVL2 pipeline parallel inference [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel Sep 9, 2024
@Manikandan-Thangaraj-ZS0321
Copy link
Contributor

Hi @Isotr0py,
Thanks a lot, I was able to run 40B model in multi node setup with these changes

vllm serve OpenGVLab/InternVL2-40B --tensor-parallel-size 1 --pipeline-parallel-size 5 --dtype bfloat16 --gpu-memory-utilization 0.9 --max-model-len 100 --enforce-eager --trust-remote-code --tokenizer-mode "auto" --cpu-offload-gb 10

INFO 09-09 10:10:28 api_server.py:495] vLLM API server version 0.6.0 
INFO 09-09 10:10:28 api_server.py:496] args: Namespace(model_tag='OpenGVLab/InternVL2-40B', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='OpenGVLab/InternVL2-40B', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=100, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=5, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=10.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16,lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, 
model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7f44beb0e200>)  
INFO 09-09 10:10:29 api_server.py:162] Multiprocessing frontend to use ipc:///tmp/860dfe7a-9aa5-43a7-920a-a88191ca0b0a for RPC Path. 
INFO 09-09 10:10:29 api_server.py:178] Started engine process with PID 1192
INFO 09-09 10:10:35 config.py:896] Defaulting to use ray for distributed inference  
WARNING 09-09 10:10:35 config.py:364] Async output processing can not be enabled with pipeline parallel 
2024-09-09 10:10:35,825 INFO worker.py:1598 -- Connecting to existing Ray cluster at address: 172.18.10.239:6380...
2024-09-09 10:10:35,840 INFO worker.py:1783 -- Connected to Ray cluster.
INFO 09-09 10:10:36 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='OpenGVLab/InternVL2-40B', speculative_config=None, tokenizer='OpenGVLab/InternVL2-40B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=100, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=5, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False),seed=0, served_model_name=OpenGVLab/InternVL2-40B, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
INFO 09-09 10:10:37 ray_gpu_executor.py:134] use_ray_spmd_worker: False
INFO 09-09 10:11:13 utils.py:977] Found nccl from library libnccl.so.2
INFO 09-09 10:11:13 pynccl.py:63] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=339, ip=172.18.10.238) INFO 09-09 10:11:13 utils.py:977] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=339, ip=172.18.10.238) INFO 09-09 10:11:13 pynccl.py:63] vLLM is using nccl==2.20.5   
INFO 09-09 10:11:13 model_runner.py:915] Starting to load model OpenGVLab/InternVL2-40B...
(RayWorkerWrapper pid=339, ip=172.18.10.238) INFO 09-09 10:11:13 model_runner.py:915] Starting to load model OpenGVLab/InternVL2-40B...
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd") 
(RayWorkerWrapper pid=339, ip=172.18.10.238) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.  
(RayWorkerWrapper pid=339, ip=172.18.10.238) @torch.library.impl_abstract("xformers_flash::flash_fwd")
(RayWorkerWrapper pid=339, ip=172.18.10.238) /usr/local/lib/python3.12/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(RayWorkerWrapper pid=339, ip=172.18.10.238)  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(RayWorkerWrapper pid=339, ip=172.18.10.238) INFO 09-09 10:11:21 weight_utils.py:235] Using model weights format ['*.safetensors']
(RayWorkerWrapper pid=337, ip=172.18.10.240) INFO 09-09 10:11:13 utils.py:977] Found nccl from library libnccl.so.2 [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) 
(RayWorkerWrapper pid=337, ip=172.18.10.240) INFO 09-09 10:11:13 pynccl.py:63] vLLM is using nccl==2.20.5 [repeated 3x across cluster]
(RayWorkerWrapper pid=337, ip=172.18.10.240) INFO 09-09 10:11:13 model_runner.py:915] Starting to load model OpenGVLab/InternVL2-40B... [repeated 3x across cluster]
INFO 09-09 10:11:21 weight_utils.py:235] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/17 [00:00<?, ?its]
Loading safetensors checkpoint shards:  24% Completed | 4/17 [00:01<00:03,  3.99it/s] 
Loading safetensors checkpoint shards:  47% Completed | 8/17 [00:01<00:02,  4.20it/s]
Loading safetensors checkpoint shards:  82% Completed | 14/17 [00:02<00:00,  7.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 17/17 [00:02<00:00,  7.94it/s] 
(RayWorkerWrapper pid=339, ip=172.18.10.238) INFO 09-09 10:11:28 model_runner.py:926] Loading model weights took 12.9584 GB
(RayWorkerWrapper pid=848, ip=172.18.10.241) INFO 09-09 10:11:26 weight_utils.py:235] Using model weights format ['*.safetensors'] [repeated 2x across cluster]
INFO 09-09 10:11:29 model_runner.py:926] Loading model weights took 13.8138 GB 
(RayWorkerWrapper pid=773, ip=172.18.10.241) INFO 09-09 10:11:34 weight_utils.py:235] Using model weights format ['*.safetensors']
(RayWorkerWrapper pid=337, ip=172.18.10.240) INFO 09-09 10:11:36 model_runner.py:926] Loading model weights took 12.9584 GB  
(RayWorkerWrapper pid=773, ip=172.18.10.241) INFO 09-09 10:11:44 model_runner.py:926] Loading model weights took 12.9584 GB [repeated 2x across cluster]
INFO 09-09 10:11:51 distributed_gpu_executor.py:57] # GPU blocks: 6502, # CPU blocks: 5461
INFO 09-09 10:11:58 api_server.py:226] vLLM to use /tmp/tmpcdxqbrea as PROMETHEUS_MULTIPROC_DIR
WARNING 09-09 10:11:58 serving_embedding.py:190] embedding_mode is False. Embedding API will not work.
INFO 09-09 10:11:58 launcher.py:20] Available routes are:
INFO 09-09 10:11:58 launcher.py:28] Route: /openapi.json, Methods: GET, HEAD
INFO 09-09 10:11:58 launcher.py:28] Route: /docs, Methods: GET, HEAD
INFO 09-09 10:11:58 launcher.py:28] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 09-09 10:11:58 launcher.py:28] Route: /redoc, Methods: GET, HEAD
INFO 09-09 10:11:58 launcher.py:28] Route: /health, Methods:GET
INFO 09-09 10:11:58 launcher.py:28] Route: /tokenize, Methods:POST
INFO 09-09 10:11:58 launcher.py:28] Route: /detokenize, Methods:POST
INFO 09-09 10:11:58 launcher.py:28] Route: /v1/models, Methods:GET
INFO 09-09 10:11:58 launcher.py:28] Route: /version, Methods:GET
INFO 09-09 10:11:58 launcher.py:28] Route: /v1/chat/completions, Methods:POST
INFO 09-09 10:11:58 launcher.py:28] Route: /v1/completions, Methods:POST
INFO 09-09 10:11:58 launcher.py:28] Route: /v1/embeddings, Methods:POST
INFO 09-09 10:11:58 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO:     Started server process [1172]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)  

@Isotr0py
Copy link
Collaborator Author

@DarkLight1337 This fix has been determined to work. PTAL :)

@DarkLight1337
Copy link
Member

Can you merge from main to fix the failing CI?

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 10, 2024 07:34
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2024
@DarkLight1337
Copy link
Member

The PP tests are consistently failing in CI

auto-merge was automatically disabled September 10, 2024 13:55

Head branch was pushed to by a user without write access

@Isotr0py
Copy link
Collaborator Author

The single node PP test failure is caused by OOM for 4B which can be fixed by limiting max_model_len. However, I'm not sure what is causing 2-nodes test failed, because it raised a TimeoutError:

[2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:960] Engine iteration timed out. This should never happen!
--
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63] Engine background task failed
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63] Traceback (most recent call last):
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 933, in run_engine_loop
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]     done, _ = await asyncio.wait(
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]               ^^^^^^^^^^^^^^^^^^^
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]   File "/usr/lib/python3.12/asyncio/tasks.py", line 464, in wait
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]     return await _wait(fs, timeout, return_when, loop)
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]   File "/usr/lib/python3.12/asyncio/tasks.py", line 550, in _wait
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]     await waiter
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63] asyncio.exceptions.CancelledError
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63] The above exception was the direct cause of the following exception:
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63] Traceback (most recent call last):
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]     return_value = task.result()
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]                    ^^^^^^^^^^^^^
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 932, in run_engine_loop
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]   File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63]     raise TimeoutError from exc_val
  | [2024-09-10T08:55:24Z] ERROR 09-10 01:55:24 async_llm_engine.py:63] TimeoutError

@DarkLight1337
Copy link
Member

Hmm, based on the logs:

[2024-09-10T16:59:57Z] The above exception was the direct cause of the following exception:
[2024-09-10T16:59:57Z]
[2024-09-10T16:59:57Z] Traceback (most recent call last):
[2024-09-10T16:59:57Z]   File "/vllm-workspace/tests/utils.py", line 418, in wrapper
[2024-09-10T16:59:57Z]     f(*args, **kwargs)
[2024-09-10T16:59:57Z]   File "/vllm-workspace/tests/distributed/test_pipeline_parallel.py", line 101, in test_compare_tp
[2024-09-10T16:59:57Z]     compare_two_settings(MODEL_NAME, pp_args, tp_args, pp_env)
[2024-09-10T16:59:57Z]   File "/vllm-workspace/tests/utils.py", line 192, in compare_two_settings
[2024-09-10T16:59:57Z]     with RemoteOpenAIServer(model,
[2024-09-10T16:59:57Z]          ^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-10T16:59:57Z]   File "/vllm-workspace/tests/utils.py", line 111, in __init__
[2024-09-10T16:59:57Z]     self._wait_for_server(url=self.url_for("health"),
[2024-09-10T16:59:57Z]   File "/vllm-workspace/tests/utils.py", line 139, in _wait_for_server
[2024-09-10T16:59:57Z]     raise RuntimeError(
[2024-09-10T16:59:57Z] RuntimeError: Server failed to start in time.
[2024-09-10T16:59:57Z] *** SIGTERM received at time=1725987597 on cpu 35 ***
[2024-09-10T16:59:57Z] PC: @     0x7f7024a5ccd7  (unknown)  __pthread_clockjoin_ex
[2024-09-10T16:59:57Z]     @     0x7f7024ab9090  (unknown)  (unknown)
[2024-09-10T16:59:57Z] [2024-09-10 09:59:57,240 E 8980 8980] logging.cc:440: *** SIGTERM received at time=1725987597 on cpu 35 ***
[2024-09-10T16:59:57Z] [2024-09-10 09:59:57,240 E 8980 8980] logging.cc:440: PC: @     0x7f7024a5ccd7  (unknown)  __pthread_clockjoin_ex
[2024-09-10T16:59:57Z] [2024-09-10 09:59:57,240 E 8980 8980] logging.cc:440:     @     0x7f7024ab9090  (unknown)  (unknown)
[2024-09-10T16:59:57Z] Fork a new process to run a test 8872
[2024-09-10T16:59:57Z] FAILED

It looks like the process got terminated somehow. Any idea about this @youkaichao ?

@youkaichao
Copy link
Member

I think this might be the root cause:

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:960] Engine iteration timed out. This should never happen!

  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] Engine background task failed
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] Traceback (most recent call last):
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 933, in run_engine_loop
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] done, _ = await asyncio.wait(
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] File "/usr/lib/python3.12/asyncio/tasks.py", line 464, in wait
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] return await _wait(fs, timeout, return_when, loop)
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] File "/usr/lib/python3.12/asyncio/tasks.py", line 550, in _wait
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] await waiter
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] asyncio.exceptions.CancelledError
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] The above exception was the direct cause of the following exception:
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] Traceback (most recent call last):
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] return_value = task.result()
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] ^^^^^^^^^^^^^
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 932, in run_engine_loop
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in aexit
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] raise TimeoutError from exc_val
  | [2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] TimeoutError
  | [2024-09-10T16:55:45Z] Exception in callback _log_task_completion(error_callback=>)(<Task finishe...imeoutError()>) at /usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py:43

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 11, 2024

Looking earlier into the logs:

[2024-09-10T16:54:45Z] INFO 09-10 09:54:45 async_llm_engine.py:206] Added request cmpl-5feb07c5dbb04a22b82398a08211337e-0.
[2024-09-10T16:54:45Z] INFO 09-10 09:54:45 async_llm_engine.py:206] Added request cmpl-5feb07c5dbb04a22b82398a08211337e-1.
[2024-09-10T16:54:51Z] INFO 09-10 09:54:51 metrics.py:351] Avg prompt throughput: 5.4 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[2024-09-10T16:55:01Z] INFO 09-10 09:55:01 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[2024-09-10T16:55:11Z] INFO 09-10 09:55:11 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[2024-09-10T16:55:21Z] INFO 09-10 09:55:21 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[2024-09-10T16:55:31Z] INFO 09-10 09:55:31 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[2024-09-10T16:55:41Z] INFO 09-10 09:55:41 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:960] Engine iteration timed out. This should never happen!
[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] Engine background task failed

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] Traceback (most recent call last):

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 933, in run_engine_loop

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]     done, _ = await asyncio.wait(

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]               ^^^^^^^^^^^^^^^^^^^

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]   File "/usr/lib/python3.12/asyncio/tasks.py", line 464, in wait

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]     return await _wait(fs, timeout, return_when, loop)

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]   File "/usr/lib/python3.12/asyncio/tasks.py", line 550, in _wait

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]     await waiter

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] asyncio.exceptions.CancelledError

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] 

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] The above exception was the direct cause of the following exception:

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] 

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] Traceback (most recent call last):

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]     return_value = task.result()

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]                    ^^^^^^^^^^^^^

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 932, in run_engine_loop

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]     async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]   File "/usr/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63]     raise TimeoutError from exc_val

[2024-09-10T16:55:45Z] ERROR 09-10 09:55:45 async_llm_engine.py:63] TimeoutError

Seems that enabling PP in this case causes the engine to timeout.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 11, 2024

Hmm, it passes now. Maybe just a network issue of communicating between nodes.

@DarkLight1337 DarkLight1337 merged commit 1230263 into vllm-project:main Sep 11, 2024
51 checks passed
@Isotr0py Isotr0py deleted the fix-internvl-pp branch September 11, 2024 02:20
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 12, 2024
MengqingCao pushed a commit to MengqingCao/vllm that referenced this pull request Sep 30, 2024
MengqingCao added a commit to MengqingCao/vllm that referenced this pull request Oct 10, 2024
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272)

[Frontend] Add progress reporting to run_batch.py (vllm-project#8060)

Co-authored-by: Adam Lugowski <[email protected]>

[Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292)

[Misc] GPTQ Activation Ordering (vllm-project#8135)

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (vllm-project#8319)

[Bugfix] Fix missing `post_layernorm` in CLIP (vllm-project#8155)

[CI/Build] enable ccache/scccache for HIP builds (vllm-project#8327)

[Frontend] Clean up type annotations for mistral tokenizer (vllm-project#8314)

[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (vllm-project#8130)

Fix ppc64le buildkite job (vllm-project#8309)

[Spec Decode] Move ops.advance_step to flash attn advance_step (vllm-project#8224)

[Misc] remove peft as dependency for prompt models (vllm-project#8162)

[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (vllm-project#8342)

[Bugfix] lookahead block table with cuda graph max capture (vllm-project#8340)

[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (vllm-project#8340)

[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (vllm-project#8172)

[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (vllm-project#8043)

[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (vllm-project#8329)

[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (vllm-project#8299)

[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (vllm-project#6112)

[model] Support for Llava-Next-Video model (vllm-project#7559)

Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (vllm-project#8347)

[Model][VLM] Add Qwen2-VL model support (vllm-project#7905)

Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (vllm-project#7257)

[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (vllm-project#8373)

[Bugfix] Add missing attributes in mistral tokenizer (vllm-project#8364)

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

Co-authored-by: Sage Moore <[email protected]>

[Misc] Move device options to a single place (vllm-project#8322)

[Speculative Decoding] Test refactor (vllm-project#8317)

Co-authored-by: youkaichao <[email protected]>

Pixtral (vllm-project#8377)

Co-authored-by: Roger Wang <[email protected]>

Bump version to v0.6.1 (vllm-project#8379)

[MISC] Dump model runner inputs when crashing (vllm-project#8305)

[misc] remove engine_use_ray (vllm-project#8126)

[TPU] Use Ray for default distributed backend (vllm-project#8389)

Fix the AMD weight loading tests (vllm-project#8390)

[Bugfix]: Fix the logic for deciding if tool parsing is used (vllm-project#8366)

[Gemma2] add bitsandbytes support for Gemma2 (vllm-project#8338)

[Misc] Raise error when using encoder/decoder model with cpu backend (vllm-project#8355)

[Misc] Use RoPE cache for MRoPE (vllm-project#8396)

[torch.compile] hide slicing under custom op for inductor (vllm-project#8384)

[Hotfix][VLM] Fixing max position embeddings for Pixtral (vllm-project#8399)

[Bugfix] Fix InternVL2 inference with various num_patches (vllm-project#8375)

Co-authored-by: DarkLight1337 <[email protected]>

[Model] Support multiple images for qwen-vl (vllm-project#8247)

Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (vllm-project#8403)

[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (vllm-project#8423)

[Bugfix] Offline mode fix (vllm-project#8376)

Signed-off-by: Joe Runde <[email protected]>

[multi-step] add flashinfer backend (vllm-project#7928)

[Core] Add engine option to return only deltas or final output (vllm-project#7381)

[Bugfix] multi-step + flashinfer: ensure cuda graph compatible  (vllm-project#8427)

[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (vllm-project#8425)

[CI/Build] Disable multi-node test for InternVL2 (vllm-project#8428)

[Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415)

[Bugfix] Fix weight loading issue by rename variable. (vllm-project#8293)

[Misc] Update Pixtral example (vllm-project#8431)

[BugFix] fix group_topk (vllm-project#8430)

[Core] Factor out input preprocessing to a separate class (vllm-project#7329)

[Bugfix] Mapping physical device indices for e2e test utils (vllm-project#8290)

[Bugfix] Bump fastapi and pydantic version (vllm-project#8435)

[CI/Build] Update pixtral tests to use JSON (vllm-project#8436)

[Bugfix] Fix async log stats (vllm-project#8417)

[bugfix] torch profiler bug for single gpu with GPUExecutor (vllm-project#8354)

bump version to v0.6.1.post1 (vllm-project#8440)

[CI/Build] Enable InternVL2 PP test only on single node (vllm-project#8437)

[doc] recommend pip instead of conda (vllm-project#8446)

[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (vllm-project#8442)

[misc][ci] fix quant test (vllm-project#8449)

[Installation] Gate FastAPI version for Python 3.8 (vllm-project#8456)

[plugin][torch.compile] allow to add custom compile backend (vllm-project#8445)

[CI/Build] Reorganize models tests (vllm-project#7820)

[Doc] Add oneDNN installation to CPU backend documentation (vllm-project#8467)

[HotFix] Fix final output truncation with stop string + streaming (vllm-project#8468)

bump version to v0.6.1.post2 (vllm-project#8473)

[Hardware][intel GPU] bump up ipex version to 2.3 (vllm-project#8365)

Co-authored-by: Yan Ma <[email protected]>

[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (vllm-project#8310)

[Model] support minicpm3 (vllm-project#8297)

Co-authored-by: DarkLight1337 <[email protected]>

[torch.compile] fix functionalization (vllm-project#8480)

[torch.compile] add a flag to disable custom op (vllm-project#8488)

[TPU] Implement multi-step scheduling (vllm-project#8489)

[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (vllm-project#8490)

[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (vllm-project#8357)

[Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032)

Co-authored-by: Dipika <[email protected]>

[Frontend] Expose revision arg in OpenAI server (vllm-project#8501)

[BugFix] Fix clean shutdown issues (vllm-project#8492)

[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506)

[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270)

[doc] update doc on testing and debugging (vllm-project#8514)

[Bugfix] Bind api server port before starting engine (vllm-project#8491)

[perf bench] set timeout to debug hanging (vllm-project#8516)

[misc] small qol fixes for release process (vllm-project#8517)

[Bugfix] Fix 3.12 builds on main (vllm-project#8510)

Signed-off-by: Joe Runde <[email protected]>

[refactor] remove triton based sampler (vllm-project#8524)

[Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525)

Signed-off-by: Alex-Brooks <[email protected]>

[Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521)

[torch.compile] register allreduce operations as custom ops (vllm-project#8526)

[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509)

Signed-off-by: Rui Qiao <[email protected]>

[Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495)

[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631)

[Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434)

[Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515)

Co-authored-by: Cyrus Leung <[email protected]>

[Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527)

[Bugfix] Fix TP > 1 for new granite (vllm-project#8544)

Signed-off-by: Joe Runde <[email protected]>

[doc] improve installation doc (vllm-project#8550)

Co-authored-by: Andy Dai <[email protected]>

[CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520)

[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012)

[CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540)

[Misc] Add argument to disable FastAPI docs (vllm-project#8554)

[CI/Build] Avoid CUDA initialization (vllm-project#8534)

[CI/Build] Update Ruff version (vllm-project#8469)

Signed-off-by: Aaron Pham <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>

[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157)

Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: [email protected] <[email protected]>
Co-authored-by: Robert Shaw <[email protected]>
Co-authored-by: Simon Mo <[email protected]>

[Core] *Prompt* logprobs support in Multi-step (vllm-project#8199)

[Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543)

Signed-off-by: Russell Bryant <[email protected]>

[Model] Support Solar Model (vllm-project#8386)

Co-authored-by: Michael Goin <[email protected]>

[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380)

Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Michael Goin <[email protected]>

[Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039)

[BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572)

[Bugfix] add `dead_error` property to engine client (vllm-project#8574)

Signed-off-by: Joe Runde <[email protected]>

[Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573)

Co-authored-by: [email protected]

[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models.  (vllm-project#8545)

Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593)

[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616)

[MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615)

[Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584)

[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577)

[Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619)

[Doc] Add documentation for GGUF quantization (vllm-project#8618)

Create SECURITY.md (vllm-project#8642)

[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551)

[Misc] guard against change in cuda library name (vllm-project#8609)

[Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571)

[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474)

[Core] Support Lora lineage and base model metadata management (vllm-project#6315)

[Model] Add OLMoE (vllm-project#7922)

[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670)

[Bugfix] Validate SamplingParam n is an int (vllm-project#8548)

[Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649)

[Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556)

[Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640)

[Doc] neuron documentation update (vllm-project#8671)

Signed-off-by: omrishiv <[email protected]>

[Hardware][AWS] update neuron to 2.20 (vllm-project#8676)

Signed-off-by: omrishiv <[email protected]>

[Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496)

[Core] Rename `PromptInputs` and `inputs`(vllm-project#8673)

[MISC] add support custom_op check (vllm-project#8557)

Co-authored-by: youkaichao <[email protected]>

[Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675)

[beam search] add output for manually checking the correctness (vllm-project#8684)

[Kernel] Build flash-attn from source (vllm-project#8245)

[VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687)

[Doc] Fix typo in AMD installation guide (vllm-project#8689)

[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646)

[dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518)

[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643)

[Bugfix] Refactor composite weight loading logic (vllm-project#8656)

[ci][build] fix vllm-flash-attn (vllm-project#8699)

[Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407)

[Misc] Use NamedTuple in Multi-image example (vllm-project#8705)

Signed-off-by: Alex-Brooks <[email protected]>

[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703)

[Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486)

Co-authored-by: litianjian <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

[SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701)

[build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713)

[misc] upgrade mistral-common (vllm-project#8715)

[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702)

[Bugfix] Fix CPU CMake build (vllm-project#8723)

Co-authored-by: Yuan <[email protected]>

[Bugfix] fix docker build for xpu (vllm-project#8652)

[Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657)

Signed-off-by: Alex-Brooks <[email protected]>

[Hardware][CPU] Refactor CPU model runner (vllm-project#8729)

[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733)

[Model] Support pp for qwen2-vl (vllm-project#8696)

[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707)

[CI/Build] use setuptools-scm to set __version__ (vllm-project#4738)

Co-authored-by: youkaichao <[email protected]>

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701)

Co-authored-by: mgoin <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>

[Kernel][LoRA]  Add assertion for punica sgmv kernels (vllm-project#7585)

[Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575)

Signed-off-by: Russell Bryant <[email protected]>

Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562)

Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335)

[Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674)

Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728)

re-implement beam search on top of vllm core (vllm-project#8726)

Co-authored-by: Brendan Wong <[email protected]>

Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750)

[MISC] Skip dumping inputs when unpicklable (vllm-project#8744)

[Core][Model] Support loading weights by ID within models (vllm-project#7931)

[Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658)

Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

[Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558)

[Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661)

Co-authored-by: mgoin <[email protected]>

[Frontend] Batch inference for llm.chat() API  (vllm-project#8648)

Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

[Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748)

[CI/Build] fix setuptools-scm usage (vllm-project#8771)

[misc] soft drop beam search (vllm-project#8763)

[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768)

[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047)

Signed-off-by: Travis Johnson <[email protected]>

[Core] Adding Priority Scheduling (vllm-project#5958)

[Bugfix] Use heartbeats instead of health checks (vllm-project#8583)

Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780)

[Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776)

Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752)

[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250)

[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770)

[Bugfix] load fc bias from config for eagle (vllm-project#8790)

[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (vllm-project#8672)

[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (vllm-project#8767)

Signed-off-by: darthhexx <[email protected]>

[Misc] Fix minor typo in scheduler (vllm-project#8765)

[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (vllm-project#8777)

[Kernel] Fullgraph and opcheck tests (vllm-project#8479)

[[Misc]] Add extra deps for openai server image (vllm-project#8792)

[VLM][Bugfix] internvl with num_scheduler_steps > 1 (vllm-project#8614)

rename PromptInputs and inputs with backward compatibility (vllm-project#8760)

[Frontend] MQLLMEngine supports profiling. (vllm-project#8761)

[Misc] Support FP8 MoE for compressed-tensors (vllm-project#8588)

Revert "rename PromptInputs and inputs with backward compatibility (vllm-project#8760) (vllm-project#8810)

[Model] Add support for the multi-modal Llama 3.2 model (vllm-project#8811)

Co-authored-by: simon-mo <[email protected]>
Co-authored-by: Chang Su <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

[Doc] Update doc for Transformers 4.45 (vllm-project#8817)

[Misc] Support quantization of MllamaForCausalLM (vllm-project#8822)

[Misc] Update config loading for Qwen2-VL and remove Granite (vllm-project#8837)

[Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814)

[Docs] Add README to the build docker image (vllm-project#8825)

[CI/Build] Fix missing ci dependencies (vllm-project#8834)

[misc][installation] build from source without compilation (vllm-project#8818)

[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872)

Signed-off-by: kevin <[email protected]>

[Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861)

[Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820)

[Bugfix] Fix print_warning_once's line info (vllm-project#8867)

fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568)

[Bugfix] Fixup advance_step.cu warning (vllm-project#8815)

[BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829)

[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764)

[Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343)

Signed-off-by: Max de Bayser <[email protected]>

[Core] rename`PromptInputs` and `inputs` (vllm-project#8876)

[misc] fix collect env (vllm-project#8894)

[MISC] Fix invalid escape sequence '\' (vllm-project#8830)

Signed-off-by: Peter Pan <[email protected]>

[Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892)

[TPU] Update pallas.py to support trillium (vllm-project#8871)

[torch.compile] use empty tensor instead of None for profiling (vllm-project#8875)

[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271)

[Bugfix] fix for deepseek w4a16 (vllm-project#8906)

Co-authored-by: mgoin <[email protected]>

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378)

Co-authored-by: Varun Sundar Rabindranath <[email protected]>

[misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911)

[Core] Priority-based scheduling in async engine (vllm-project#8850)

[misc] fix wheel name (vllm-project#8919)

[Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824)

Signed-off-by: tylertitsworth <[email protected]>
Co-authored-by: youkaichao <[email protected]>

[Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921)

[Bugfix] Fix code for downloading models from modelscope (vllm-project#8443)

[Bugfix] Fix PP for Multi-Step (vllm-project#8887)

[CI/Build] Update models tests & examples (vllm-project#8874)

Co-authored-by: Roger Wang <[email protected]>

[Frontend] Make beam search emulator temperature modifiable (vllm-project#8928)

Co-authored-by: Eduard Balzin <[email protected]>

[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891)

[doc] organize installation doc and expose per-commit docker (vllm-project#8931)

[Core] Improve choice of Python multiprocessing method (vllm-project#8823)

Signed-off-by: Russell Bryant <[email protected]>
Co-authored-by: youkaichao <[email protected]>

[Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824)

[Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741)

Co-authored-by: Tyler Michael Smith <[email protected]>

[CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925)

[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930)

[Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896)

[Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199)

[BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870)

Co-authored-by: Roger Wang <[email protected]>

[Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944)

[Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942)

[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533)
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
4 participants