[Bug]: Gemma model fails with GPTQ marlin #5088

arunpatala · 2024-05-28T13:55:47Z

🐛 Describe the bug

Using docker and gemma finetuned model

--model /data/merged_model_GPTQ --max-model-len 8192 --max-num-seqs 1024 --served-model-name model --quantization gptq_marlin

fails with

RuntimeError: Some weights are not initialized from checkpoints: {'model.layers.3.mlp.gate_up_proj.g_idx_sort_indices', 'model.layers.8.self_attn.qkv_proj.g_idx_sort_indices', 'model.layers.9.mlp.gate_up_proj.g_idx_sort_indices .....

The same works with --quantization gptq

robertgshaw2-neuralmagic · 2024-05-28T14:53:42Z

🐛 Describe the bug

Using docker and gemma finetuned model

--model /data/merged_model_GPTQ --max-model-len 8192 --max-num-seqs 1024 --served-model-name model --quantization gptq_marlin

fails with

RuntimeError: Some weights are not initialized from checkpoints: {'model.layers.3.mlp.gate_up_proj.g_idx_sort_indices', 'model.layers.8.self_attn.qkv_proj.g_idx_sort_indices', 'model.layers.9.mlp.gate_up_proj.g_idx_sort_indices .....

The same works with --quantization gptq

@alexm-neuralmagic

@arunpatala Can you share the model checkpoint so we can take a look?

arunpatala · 2024-05-29T06:26:24Z

volume=$HF_HOME
docker run --runtime nvidia --gpus all \
    -v $volume:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model TechxGenus/gemma-1.1-2b-it-GPTQ \
    --max-model-len 8192 \
    --max-num-seqs 32 \
    --quantization gptq_marlin

You can check with this public model. The engine throws error when using either gptq_marlin or marlin as quantization.
RuntimeError: Some weights are not initialized from checkpoints: {'model.layers.6.self_attn.qkv_proj.g_idx_sort_indices',

The same model works with gptq as quantization.

volume=$HF_HOME
docker run --runtime nvidia --gpus all \
    -v $volume:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model TechxGenus/gemma-1.1-2b-it-GPTQ \
    --max-model-len 8192 \
    --max-num-seqs 32 \
    --quantization gptq

robertgshaw2-neuralmagic · 2024-05-29T13:31:25Z

Thanks

@alexm-neuralmagic can you take a look?

alexm-neuralmagic · 2024-05-29T18:05:01Z

@arunpatala Here is the fix #5108 (will land soon)

arunpatala · 2024-05-29T19:32:51Z

thanks a lot.

alexm-neuralmagic · 2024-05-29T19:45:56Z

no problem

)

* [Hardware][Intel] Optimize CPU backend and add more performance tips (vllm-project#4971) Co-authored-by: Jianan Gu <[email protected]> * [Docs] Add 4th meetup slides (vllm-project#5509) * [Misc] Add vLLM version getter to utils (vllm-project#5098) * [CI/Build] Simplify OpenAI server setup in tests (vllm-project#5100) * [Doc] Update LLaVA docs (vllm-project#5437) Co-authored-by: Roger Wang <[email protected]> * [Kernel] Factor out epilogues from cutlass kernels (vllm-project#5391) Co-authored-by: Michael Goin <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: Robert Shaw <[email protected]> * [MISC] Remove FP8 warning (vllm-project#5472) Co-authored-by: Philipp Moritz <[email protected]> * Seperate dev requirements into lint and test (vllm-project#5474) * Revert "[Core] Remove unnecessary copies in flash attn backend" (vllm-project#5478) * [misc] fix format.sh (vllm-project#5511) * [CI/Build] Disable test_fp8.py (vllm-project#5508) * [Kernel] Disable CUTLASS kernels for fp8 (vllm-project#5505) * Add `cuda_device_count_stateless` (vllm-project#5473) * [Hardware][Intel] Support CPU inference with AVX2 ISA (vllm-project#5452) * [Misc] Fix arg names in quantizer script (vllm-project#5507) * bump version to v0.5.0.post1 (vllm-project#5522) * [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (vllm-project#5073) Co-authored-by: simon-mo <[email protected]> * [CI/Build] Disable LLaVA-NeXT CPU test (vllm-project#5529) * [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (vllm-project#5516) * [Misc] Fix arg names (vllm-project#5524) * [ Misc ] Rs/compressed tensors cleanup (vllm-project#5432) Co-authored-by: mgoin <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> * [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (vllm-project#5401) * [mis] fix flaky test of test_cuda_device_count_stateless (vllm-project#5546) * [Core] Remove duplicate processing in async engine (vllm-project#5525) * [misc][distributed] fix benign error in `is_in_the_same_node` (vllm-project#5512) * [Docs] Add ZhenFund as a Sponsor (vllm-project#5548) * [Doc] Update documentation on Tensorizer (vllm-project#5471) * [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (vllm-project#5460) Signed-off-by: Thomas Parnell <[email protected]> * [Bugfix] Fix typo in Pallas backend (vllm-project#5558) * [Core][Distributed] improve p2p cache generation (vllm-project#5528) * Add ccache to amd (vllm-project#5555) * [Core][Bugfix]: fix prefix caching for blockv2 (vllm-project#5364) Signed-off-by: Lei Wen <[email protected]> Co-authored-by: Lei Wen <[email protected]> * [mypy] Enable type checking for test directory (vllm-project#5017) * [CI/Build] Test both text and token IDs in batched OpenAI Completions API (vllm-project#5568) * [misc] Do not allow to use lora with chunked prefill. (vllm-project#5538) Co-authored-by: Cyrus Leung <[email protected]> * add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145) * [BugFix] Don't start a Ray cluster when not using Ray (vllm-project#5570) * [Fix] Correct OpenAI batch response format (vllm-project#5554) * Add basic correctness 2 GPU tests to 4 GPU pipeline (vllm-project#5518) * [CI][BugFix] Flip is_quant_method_supported condition (vllm-project#5577) * [build][misc] limit numpy version (vllm-project#5582) * [Doc] add debugging tips for crash and multi-node debugging (vllm-project#5581) * Fix w8a8 benchmark and add Llama-3-8B (vllm-project#5562) * [Model] Rename Phi3 rope scaling type (vllm-project#5595) * Correct alignment in the seq_len diagram. (vllm-project#5592) Co-authored-by: Liqian Chen <[email protected]> * [Kernel] `compressed-tensors` marlin 24 support (vllm-project#5435) * [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (vllm-project#5588) * [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (vllm-project#3814) Co-authored-by: Jiang Li <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> * [CI/BUILD] Support non-AVX512 vLLM building and testing (vllm-project#5574) * [CI] the readability of benchmarking and prepare for dashboard (vllm-project#5571) [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (vllm-project#5571) * [bugfix][distributed] fix 16 gpus local rank arrangement (vllm-project#5604) * [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (vllm-project#5584) * [Bugfix] Fix KV head calculation for MPT models when using GQA (vllm-project#5142) * [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (vllm-project#5606) * [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (vllm-project#5131) * [Model] Initialize Phi-3-vision support (vllm-project#4986) * [Kernel] Add punica dimensions for Granite 13b (vllm-project#5559) Signed-off-by: Joe Runde <[email protected]> * [misc][typo] fix typo (vllm-project#5620) * [Misc] Fix typo (vllm-project#5618) * [CI] Avoid naming different metrics with the same name in performance benchmark (vllm-project#5615) * [bugfix][distributed] improve p2p capability test (vllm-project#5612) [bugfix][distributed] do not error if two processes do not agree on p2p capability (vllm-project#5612) * [Misc] Remove import from transformers logging (vllm-project#5625) * [CI/Build][Misc] Update Pytest Marker for VLMs (vllm-project#5623) * [ci] Deprecate original CI template (vllm-project#5624) Signed-off-by: kevin <[email protected]> * [Misc] Add OpenTelemetry support (vllm-project#4687) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here * [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (vllm-project#5542) * [ci] Setup Release pipeline and build release wheels with cache (vllm-project#5610) Signed-off-by: kevin <[email protected]> * [Model] LoRA support added for command-r (vllm-project#5178) * [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (vllm-project#5639) Signed-off-by: Thomas Parnell <[email protected]> * [Doc] Added cerebrium as Integration option (vllm-project#5553) * [Bugfix] Fix CUDA version check for mma warning suppression (vllm-project#5642) * [Bugfix] Fix w8a8 benchmarks for int8 case (vllm-project#5643) * [Bugfix] Fix Phi-3 Long RoPE scaling implementation (vllm-project#5628) * [Bugfix] Added test for sampling repetition penalty bug. (vllm-project#5659) Signed-off-by: Thomas Parnell <[email protected]> * [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (vllm-project#5641) * [misc][distributed] use 127.0.0.1 for single-node (vllm-project#5619) * [Model] Add FP8 kv cache for Qwen2 (vllm-project#5656) * [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (vllm-project#5684) * [Misc]Add param max-model-len in benchmark_latency.py (vllm-project#5629) * [CI/Build] Add tqdm to dependencies (vllm-project#5680) * [ci] Add A100 queue into AWS CI template (vllm-project#5648) Signed-off-by: kevin <[email protected]> * [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (vllm-project#5688) * [ci][distributed] add tests for custom allreduce (vllm-project#5689) * [Bugfix] AsyncLLMEngine hangs with asyncio.run (vllm-project#5654) * [Doc] Update docker references (vllm-project#5614) Signed-off-by: Rafael Vasquez <[email protected]> * [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (vllm-project#5650) * [ci] Limit num gpus if specified for A100 (vllm-project#5694) Signed-off-by: kevin <[email protected]> * [Misc] Improve conftest (vllm-project#5681) * [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (vllm-project#5703) * [Kernel] Update Cutlass int8 kernel configs for SM90 (vllm-project#5514) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Model] Port over CLIPVisionModel for VLMs (vllm-project#5591) * [Kernel] Update Cutlass int8 kernel configs for SM80 (vllm-project#5275) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (vllm-project#5715) * [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (vllm-project#5718) * [distributed][misc] use fork by default for mp (vllm-project#5669) * [Model] MLPSpeculator speculative decoding support (vllm-project#4947) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Davis Wertheimer <[email protected]> * [Kernel] Add punica dimension for Qwen2 LoRA (vllm-project#5441) * [BugFix] Fix test_phi3v.py (vllm-project#5725) * [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (vllm-project#5665) Co-authored-by: Antoni Baum <[email protected]> * [Core][Distributed] add shm broadcast (vllm-project#5399) Co-authored-by: Cody Yu <[email protected]> * [Kernel][CPU] Add Quick `gelu` to CPU (vllm-project#5717) * [Doc] Documentation on supported hardware for quantization methods (vllm-project#5745) * [BugFix] exclude version 1.15.0 for modelscope (vllm-project#5668) * [ci][test] fix ca test in main (vllm-project#5746) * [LoRA] Add support for pinning lora adapters in the LRU cache (vllm-project#5603) * [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (vllm-project#5616) * [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (vllm-project#5710) Co-authored-by: Roger Wang <[email protected]> * [Misc] Remove vllm-project#4789 workaround left in vllm/entrypoints/openai/run_batch.py (vllm-project#5756) * [Bugfix] Fix pin_lora error in TPU executor (vllm-project#5760) * [Docs][TPU] Add installation tip for TPU (vllm-project#5761) * [core][distributed] improve shared memory broadcast (vllm-project#5754) * [BugFix] [Kernel] Add Cutlass2x fallback kernels (vllm-project#5744) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Distributed] Add send and recv helpers (vllm-project#5719) * [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (vllm-project#5772) * [doc][faq] add warning to download models for every nodes (vllm-project#5783) * post-rebase api adjustments * [Doc] Add "Suggest edit" button to doc pages (vllm-project#5789) * [Doc] Add Phi-3-medium to list of supported models (vllm-project#5788) * [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (vllm-project#5795) * [ci] Remove aws template (vllm-project#5757) Signed-off-by: kevin <[email protected]> * [Doc] Add notice about breaking changes to VLMs (vllm-project#5818) * [Speculative Decoding] Support draft model on different tensor-parallel size than target model (vllm-project#5414) * add pin_lora to habana components * add WA for model loader * fix api mismatches with ray * tensor parallel fixes * workers cpu alignment fix * [Misc] Remove useless code in cpu_worker (vllm-project#5824) * prefill/decode metadata fixes * [Core] Add fault tolerance for `RayTokenizerGroupPool` (vllm-project#5748) * re-enable attn metadata trimming * worker_use_ray fix * [doc][distributed] add both gloo and nccl tests (vllm-project#5834) * [CI/Build] Add unit testing for FlexibleArgumentParser (vllm-project#5798) * [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (vllm-project#5794) * [Hardware][TPU] Refactor TPU backend (vllm-project#5831) * [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (vllm-project#5422) * [Hardware][TPU] Raise errors for unsupported sampling params (vllm-project#5850) * [CI/Build] Add E2E tests for MLPSpeculator (vllm-project#5791) Signed-off-by: Thomas Parnell <[email protected]> * [Bugfix] Fix assertion in NeuronExecutor (vllm-project#5841) * [Core] Refactor Worker and ModelRunner to consolidate control plane communication (vllm-project#5408) Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Stephanie <[email protected]> Co-authored-by: Stephanie <[email protected]> * [Misc][Doc] Add Example of using OpenAI Server with VLM (vllm-project#5832) * [bugfix][distributed] fix shm broadcast when the queue size is full (vllm-project#5801) * [Bugfix] Fix embedding to support 2D inputs (vllm-project#5829) * [Bugfix][TPU] Fix KV cache size calculation (vllm-project#5860) * [CI/Build] Refactor image test assets (vllm-project#5821) * [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (vllm-project#5560) Co-authored-by: Chih-Chieh-Yang <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> * [Frontend] Add tokenize/detokenize endpoints (vllm-project#5054) * [Hardware][TPU] Support parallel sampling & Swapping (vllm-project#5855) * [Bugfix][TPU] Fix CPU cache allocation (vllm-project#5869) * Support CPU inference with VSX PowerPC ISA (vllm-project#5652) * [doc] update usage of env var to avoid conflict (vllm-project#5873) * [Misc] Add example for LLaVA-NeXT (vllm-project#5879) * [BugFix] Fix cuda graph for MLPSpeculator (vllm-project#5875) Co-authored-by: Abhinav Goyal <[email protected]> * [Doc] Add note about context length in Phi-3-Vision example (vllm-project#5887) * [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (vllm-project#5880) Signed-off-by: Xiaowei Jiang <[email protected]> * [Model] Add base class for LoRA-supported models (vllm-project#5018) * [Bugfix] Fix img_sizes Parsing in Phi3-Vision (vllm-project#5888) * [CI/Build] [1/3] Reorganize entrypoints tests (vllm-project#5526) * add collective crash WA * add comment to the weird mark_step * [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (vllm-project#5896) * [doc][misc] add note for Kubernetes users (vllm-project#5916) * [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (vllm-project#5876) * [BugFix] Fix `min_tokens` behaviour for multiple eos tokens (vllm-project#5849) * [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (vllm-project#5922) * [Model] Add Gemma 2 (vllm-project#5908) * [core][misc] remove logical block (vllm-project#5882) * [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (vllm-project#5932) * [Hardware][TPU] Optimize KV cache swapping (vllm-project#5878) * [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. (vllm-project#5905) Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (vllm-project#5956) * [Core] Registry for processing model inputs (vllm-project#5214) Co-authored-by: ywang96 <[email protected]> * Unmark fused_moe config json file as executable (vllm-project#5960) * [Hardware][Intel] OpenVINO vLLM backend (vllm-project#5379) * [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high (vllm-project#5894) Signed-off-by: Thomas Parnell <[email protected]> * [CI/Build] [2/3] Reorganize entrypoints tests (vllm-project#5904) * [Distributed] Make it clear that % should not be in tensor dict keys. (vllm-project#5927) Signed-off-by: Xiaowei Jiang <[email protected]> * [Spec Decode] Introduce DraftModelRunner (vllm-project#5799) * [Bugfix] Fix compute datatype for cutlass 3.x epilogues (vllm-project#5931) * [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (vllm-project#5928) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (vllm-project#5921) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * Support Deepseek-V2 (vllm-project#4650) Co-authored-by: Philipp Moritz <[email protected]> * [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (vllm-project#5936) * Unmark more files as executable (vllm-project#5962) * [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (vllm-project#5963) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (vllm-project#4628) Co-authored-by: LiuXiaoxuanPKU <[email protected]>, bong-furiosa <[email protected]> * [Bugfix][TPU] Fix TPU sampler output (vllm-project#5978) * [Bugfix][TPU] Fix pad slot id (vllm-project#5977) * [Bugfix] fix missing last itl in openai completions benchmark (vllm-project#5926) * [Misc] Extend vLLM Metrics logging API (vllm-project#5925) Co-authored-by: Antoni Baum <[email protected]> * [Kernel] Add punica dimensions for Granite 3b and 8b (vllm-project#5930) Signed-off-by: Joe Runde <[email protected]> * [Bugfix] Fix precisions in Gemma 1 (vllm-project#5913) * [Misc] Update Phi-3-Vision Example (vllm-project#5981) Co-authored-by: Cyrus Leung <[email protected]> * [Bugfix] Support `eos_token_id` from `config.json` (vllm-project#5954) * [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (vllm-project#5974) * [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (vllm-project#5939) * [ CI/Build ] Added E2E Test For Compressed Tensors (vllm-project#5839) Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [CI/Build] Add TP test for vision models (vllm-project#5892) * [ CI/Build ] LM Eval Harness Based CI Testing (vllm-project#5838) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (vllm-project#5949) * [CI/Build] Temporarily Remove Phi3-Vision from TP Test (vllm-project#5989) * [CI/Build] Reuse code for checking output consistency (vllm-project#5988) * [CI/Build] [3/3] Reorganize entrypoints tests (vllm-project#5966) * [ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (vllm-project#5991) * [Frontend]: Support base64 embedding (vllm-project#5935) Co-authored-by: Cyrus Leung <[email protected]> * [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (vllm-project#5909) Co-authored-by: sang <[email protected]> * [ CI ] Temporarily Disable Large LM-Eval Tests (vllm-project#6005) Co-authored-by: [email protected] <rshaw@neuralmagic> * [Misc] Fix `get_min_capability` (vllm-project#5971) * [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (vllm-project#5940) Co-authored-by: Robert Shaw <rshaw@neuralmagic> * [misc][cuda] use nvml to avoid accidentally cuda initialization (vllm-project#6007) * [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (vllm-project#5348) * Revert test changes * cleanup * llm engine cleanup * utils.py cleanup * custom ops refactor * move xops to ops * remove vllm/hpu/attn_bias.py * whitespace fix * revert accidental changes in rmsnorm * Fix hpugraph hashing * add trim_attn_metadata comment * fix prompt bucketing: * [ CI ] Re-enable Large Model LM Eval (vllm-project#6031) * [doc][misc] remove deprecated api server in doc (vllm-project#6037) * [Misc] update benchmark backend for scalellm (vllm-project#6018) * [doc][misc] further lower visibility of simple api server (vllm-project#6041) Co-authored-by: Simon Mo <[email protected]> * [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (vllm-project#6039) * [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (vllm-project#6029) * add FAQ doc under 'serving' (vllm-project#5946) * [Bugfix][Doc] Fix Doc Formatting (vllm-project#6048) * [Bugfix] Add explicit `end_forward` calls to flashinfer (vllm-project#6044) * [BugFix] Ensure worker model loop is always stopped at the right time (vllm-project#5987) * [Frontend] Relax api url assertion for openai benchmarking (vllm-project#6046) * [Model] Changes to MLPSpeculator to support tie_weights and input_scale (vllm-project#5965) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Joshua Rosenkranz <[email protected]> * [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (vllm-project#5602) * [Frontend] Add template related params to request (vllm-project#5709) * [VLM] Remove `image_input_type` from VLM config (vllm-project#5852) Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Doc] Reinstate doc dependencies (vllm-project#6061) * guard model loader wa for hpu --------- Signed-off-by: Thomas Parnell <[email protected]> Signed-off-by: Lei Wen <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: kevin <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Signed-off-by: Stephanie Wang <[email protected]> Signed-off-by: Stephanie <[email protected]> Signed-off-by: Xiaowei Jiang <[email protected]> Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Li, Jiang <[email protected]> Co-authored-by: Jianan Gu <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: Robert Shaw <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Jie Fu (傅杰) <[email protected]> Co-authored-by: Allen.Dou <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Sanger Steel <[email protected]> Co-authored-by: Thomas Parnell <[email protected]> Co-authored-by: leiwen83 <[email protected]> Co-authored-by: Lei Wen <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Co-authored-by: Alexander Matveev <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Amit Garg <[email protected]> Co-authored-by: Charles Riggins <[email protected]> Co-authored-by: Liqian Chen <[email protected]> Co-authored-by: zhyncs <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Abhilash Majumder <[email protected]> Co-authored-by: Bruce Fontaine <[email protected]> Co-authored-by: zifeitong <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Chang Su <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Ronen Schaffer <[email protected]> Co-authored-by: sergey-tinkoff <[email protected]> Co-authored-by: milo157 <[email protected]> Co-authored-by: Shukant Pal <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: DearPlanet <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Joshua Rosenkranz <[email protected]> Co-authored-by: Davis Wertheimer <[email protected]> Co-authored-by: Jinzhen Lin <[email protected]> Co-authored-by: Jee Li <[email protected]> Co-authored-by: rohithkrn <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Woo-Yeon Lee <[email protected]> Co-authored-by: Matt Wong <[email protected]> Co-authored-by: aws-patlange <[email protected]> Co-authored-by: Stephanie Wang <[email protected]> Co-authored-by: Stephanie <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Chih-Chieh-Yang <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: sasha0552 <[email protected]> Co-authored-by: Chip Kerchner <[email protected]> Co-authored-by: Abhinav Goyal <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: wangding zeng <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: LiuXiaoxuanPKU <[email protected]>, bong-furiosa <[email protected]> Co-authored-by: mcalman <[email protected]> Co-authored-by: William Lin <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: llmpros <[email protected]> Co-authored-by: sang <[email protected]> Co-authored-by: Avshalom Manevich <[email protected]> Co-authored-by: James Whedbee <[email protected]> Co-authored-by: Joshua Rosenkranz <[email protected]> Co-authored-by: danieljannai21 <[email protected]>

)

c8a7e932 [core][scheduler] simplify and improve scheduler (#6867) 3c10591e [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954) 0437492e PP comm optimization: replace send with partial send + allgather (#6695) 630dd9e0 [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings (#6758) 23993a79 [Bugfix][TPU] Do not use torch.Generator for TPUs (#6981) 1d2e7fb7 [Model] Pipeline parallel support for Qwen2 (#6924) 7ecee343 [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 7eb0cb4a Revert "[Frontend] Factor out code for running uvicorn" (#7012) a0dce938 [Misc] Add compressed-tensors to optimized quant list (#7006) 35e9c12b [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996) 93548eb3 [Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950) 460c1884 [Bugfix] Support cpu offloading with fp8 quantization (#6960) bd700134 [MISC] Introduce pipeline parallelism partition strategies (#6920) 2ee8d3ba [Model] use FusedMoE layer in Jamba (#6935) daed30c4 [Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982) 2f4e108f [Bugfix] Clean up MiniCPM-V (#6939) 6512937d Support W4A8 quantization for vllm (#5218) c0644cf9 [Bugfix] fix logit processor excceed vocab size issue (#6927) 533d1932 [Bugfix][TPU] Set readonly=True for non-root devices (#6980) 9f0e69b6 [CI/Build] Fix mypy errors (#6968) f230cc2c [Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836) da1f7cc1 [mypy] Enable following imports for some directories (#6681) c32ab8be [Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding (#6964) fb4f530b [CI] [nightly benchmark] Do not re-download sharegpt dataset if exists (#6706) 79319ced [Nightly benchmarking suite] Remove pkill python from run benchmark suite (#6965) 40c27a7c [Build] Temporarily Disable Kernels and LoRA tests (#6961) 6ca8031e [core][misc] improve free_finished_seq_groups (#6865) d7a299ed [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) 052b6f8c [Bugfix] Fix tensorizer memory profiling bug during testing (#6881) 5895b246 [OpenVINO] Updated OpenVINO requirements and build docs (#6948) cbbc9044 [Kernel] Squash a few more warnings (#6914) 5cf9254a [BugFix] Fix use of per-request seed with pipeline parallel (#6698) f0584036 [Doc] Super tiny fix doc typo (#6949) c66c7f86 [Bugfix] Fix PaliGemma MMP (#6930) 6e063ea3 [TPU] Fix greedy decoding (#6933) af647fb8 [Kernel] Tuned int8 kernels for Ada Lovelace (#6848) 61a97c32 [Kernel] Fix marlin divide-by-zero warnings (#6904) 4fbf4aa1 [ci] GHA workflow to remove ready label upon "/notready" comment (#6921) aae6d36f [Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908) 9f69d824 [Frontend] New `allowed_token_ids` decoding request parameter (#6753) 9a7e2d05 [Bugfix] Allow vllm to still work if triton is not installed. (#6786) 7f8d612d [TPU] Support tensor parallelism in async llm engine (#6891) 60d1c6e5 [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (#6901) db9e5708 [Core] Reduce unnecessary compute when logprobs=None (#6532) 766435e6 [Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677) 7cbd9ec7 [Model] Initialize support for InternVL2 series models (#6514) 3eeb148f [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (#6871) b1366a95 Add Nemotron to PP_SUPPORTED_MODELS (#6863) 75acdaa4 [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) fad5576c [TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856) f954d071 [Docs] Add RunLLM chat widget (#6857) 1ad86acf [Model] Initial support for BLIP-2 (#5920) ecb33a28 [CI/Build][Doc] Update CI and Doc for VLM example changes (#6860) a57d7582 [bugfix] make args.stream work (#6831) 925de97e [Bugfix] Fix VLM example typo (#6859) aa46953a [Misc][VLM][Doc] Consolidate offline examples for vision language models (#6858) 593e79e7 [Bugfix] torch.set_num_threads() in multiproc_gpu_executor (#6802) c53041ae [Doc] Add missing mock import to docs `conf.py` (#6834) 52f07e3d [Hardware][TPU] Implement tensor parallelism with Ray (#5871) 14dbd5a7 [Model] H2O Danube3-4b (#6451) ed94e4f4 [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba (#6784) 3c301239 [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron (#6844) ced36cd8 [ROCm] Upgrade PyTorch nightly version (#6845) 969d0322 [Bugfix]: Fix Tensorizer test failures (#6835) 55712941 [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852) 981b0d56 [Frontend] Factor out code for running uvicorn (#6828) d09b94ca [TPU] Support collective communications in XLA devices (#6813) bb549467 enforce eager mode with bnb quantization temporarily (#6846) b5f49ee5 Update README.md (#6847) 150a1ffb [Doc] Update SkyPilot doc for wrong indents and instructions for update service (#4283) 281977bd [Doc] Add Nemotron to supported model docs (#6843) 3bbb4936 [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation (#6125) aa486779 [Misc][TPU] Support TPU in initialize_ray_cluster (#6812) 71734f1b [Build/CI][ROCm] Minor simplification to Dockerfile.rocm (#6811) 50704f52 [Bugfix][Kernel] Promote another index to int64_t (#6838) 07278c37 [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611) 85ad7e2d [doc][debugging] add known issues for hangs (#6816) 89a84b0b [Core] Use array to speedup padding (#6779) 084a01fd [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. (#6770) 062a1d0f Fix ReplicatedLinear weight loading (#6793) 2eb9f4ff [ci] Mark tensorizer as soft fail and separate from grouped test (#6810) 443c7cf4 [ci][distributed] fix flaky tests (#6806) 1adddb14 [Core] Fix ray forward_dag error mssg (#6792) b7215de2 [Docs] Publish 5th meetup slides (#6799) f3ff63c3 [doc][distributed] improve multinode serving doc (#6804) cd7edc4e [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors (#6798) 6a1e25b1 [Doc] Add documentations for nightly benchmarks (#6412) 95db75de [Bugfix] Add synchronize to prevent possible data race (#6788) 65b1f121 [Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761) 889da130 [ Misc ] `fp8-marlin` channelwise via `compressed-tensors` (#6524) b75e314f [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V (#6787) 316a41ac [Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755) 0310029a [Bugfix] Fix awq_marlin and gptq_marlin flags (#6745) 309aaef8 [Bugfix] Fix decode tokens w. CUDA graph (#6757) 9e169a4c [Model] Adding support for MiniCPM-V (#4087) 5689e256 [Frontend] Represent tokens with identifiable strings (#6626) 740374d4 [core][distributed] fix zmq hang (#6759) d88c458f [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users (#6754) 421e218b [Bugfix] Bump transformers to 4.43.2 (#6752) 5448f676 [Core] Tweaks to model runner/input builder developer APIs (#6712) 0e63494c Add fp8 support to `reshape_and_cache_flash` (#6667) ee812580 [Frontend] split run_server into build_server and run_server (#6740) 40468b13 [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. (#6686) 2cf0df33 [Bugfix] Fix speculative decode seeded test (#6743) 54514634 Adding f-string to validation error which is missing (#6748) f4f8a9d8 [Bugfix]fix modelscope compatible issue (#6730) b5708117 [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. (#6711) ccc4a732 [Docs][ROCm] Detailed instructions to build from source (#6680) 0a740a11 [Bugfix] Fix token padding for chameleon (#6724) c882a7f5 [SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714) 5e8ca973 [Bugfix] fix flashinfer cudagraph capture for PP (#6708) 87525fab [bitsandbytes]: support read bnb pre-quantized model (#5753) 2f808e69 [Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645) 01c16ede [CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) 72fc7048 [build] relax wheel size limit (#6704) 1bedf210 Bump `transformers` version for Llama 3.1 hotfix and patch Chameleon (#6690) 507ef787 [Model] Pipeline Parallel Support for DeepSeek v2 (#6519) 58f53034 [Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652) 0eb0757b [Misc] Add ignored layers for `fp8` quantization (#6657) 38c4b7e8 Bump version to 0.5.3.post1 (#6696) a112a84a [BugFix] Fix RoPE error in Llama 3.1 (#6693) 461089a2 [Bugfix] Fix a log error in chunked prefill (#6694) 71950af7 [doc][distributed] fix doc argument order (#6691) cb1362a8 [Docs] Announce llama3.1 support (#6688) bb2fc080 Bump version to v0.5.3 (#6674) 3eda4ec7 support ignore patterns in model loader (#6673) 22fa2e35 [VLM][Model] Support image input for Chameleon (#6633) c5201240 [misc] only tqdm for first rank (#6672) 97234be0 [Misc] Manage HTTP connections in one place (#6600) c051bfe4 [doc][distributed] doc for setting up multi-node environment (#6529) 9e0b558a [Misc] Support FP8 kv cache scales from compressed-tensors (#6528) e519ae09 add tqdm when loading checkpoint shards (#6569) 7c2749a4 [misc] add start loading models for users information (#6670) 729171ae [Misc] Enable chunked prefill by default for long context models (#6666) c5e83309 [Bugfix] Fix null `modules_to_not_convert` in FBGEMM Fp8 quantization (#6665) e0c15758 [Core] Modulize prepare input and attention metadata builder (#6596) bdf5fd13 [Misc] Remove deprecation warning for beam search (#6659) 5a96ee52 [ci][build] add back vim in docker (#6661) 42c7f66a [Core] Support dynamically loading Lora adapter from HuggingFace (#6234) 69d5ae38 [ci] Use different sccache bucket for CUDA 11.8 wheel build (#6656) fea59c77 [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) 739b61a3 [Frontend] Refactor prompt processing (#4028) 89c1c6a1 [Bugfix] Fix `vocab_size` field access in `llava_next.py` (#6624) 42de2cef [Misc] Add a wrapper for torch.inference_mode (#6618) c9eef37f [Model] Initial Support for Chameleon (#5770) 396d92d5 [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 25e778aa [Model] Refactor and decouple phi3v image embedding (#6621) b6df37f9 [Misc] Remove abused noqa (#6619) 14f91fe6 [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) d7f4178d [Frontend] Move chat utils (#6602) 082ecd80 [ Bugfix ] Fix AutoFP8 fp8 marlin (#6609) f952bbc8 [Misc] Fix input_scale typing in w8a8_utils.py (#6579) 9364f74e [ Kernel ] Enable `fp8-marlin` for `fbgemm-fp8` models (#6606) 06d6c5fe [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) 683e3cb9 [ Misc ] `fbgemm` checkpoints (#6559) 9042d683 [Misc] Consolidate and optimize logic for building padded tensors (#6541) 3f8d42c8 Pipeline Parallel: Guard for KeyErrors at request abort (#6587) 7bd82002 [Core] Allow specifying custom Executor (#6557) 2e265642 [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) e81522e8 [build] add ib in image for out-of-the-box infiniband support (#6599) 45ceb85a [Docs] Update PP docs (#6598) 4cc24f01 [ Kernel ] Enable Dynamic Per Token `fp8` (#6547) 07eb6f19 [bugfix][distributed] fix multi-node bug for shared memory (#6597) f0bbfaf9 [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578) 30efe415 [Docs] Update docs for wheel location (#6580) 9ed82e70 [Misc] Small perf improvements (#6520) 51f8aa90 [Bugfix][Frontend] remove duplicate init logger (#6581) a5314e86 [Model] RowParallelLinear: pass bias to quant_method.apply (#6327) a921e863 [BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369) 6366efc6 [Bugfix][Frontend] Fix missing `/metrics` endpoint (#6463) dbe55885 [ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515) d4201e06 [Bugfix] Make spec. decode respect per-request seed. (#6034) b5672a11 [Core] Multiprocessing Pipeline Parallel support (#6130) c5df56f8 Add support for a rope extension method (#6553) 1689219e [CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517) 4ffffccb [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552) f53b8f0d [ci][test] add correctness test for cpu offloading (#6549) 2d4733ba Fix PR comment bot (#6554) 15c6a079 [Model] Support Mistral-Nemo (#6548) ecdb462c [ci] Reword Github bot comment (#6534) 58ca6632 [ Misc ] Improve Min Capability Checking in `compressed-tensors` (#6522) 4634c872 [TPU] Refactor TPU worker & model runner (#6506) c8a7d51c [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501) e2fbaee7 [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227) 8a74c68b [Misc] Minor patch for draft model runner (#6523) 61e59274 [Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032) d25877dd [BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530) 1c27d25f [core][model] yet another cpu offload implementation (#6496) 18fecc35 [ Kernel ] Fp8 Channelwise Weight Support (#6487) b5af8c22 [Model] Pipeline parallel support for Mixtral (#6516) b5241e41 [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) e76466dd [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) 5f0b9933 [Bugfix] Fix Ray Metrics API usage (#6354) a38524f3 [DOC] - Add docker image to Cerebrium Integration (#6510) 2fa4623d [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) a9a2e74d [Misc] Use `torch.Tensor` for type annotation (#6505) e09ce759 [TPU] Remove multi-modal args in TPU backend (#6504) 5fa6e987 [Bugfix] Fix for multinode crash on 4 PP (#6495) 5bf35a91 [Doc][CI/Build] Update docs and tests to use `vllm serve` (#6431) a19e8d37 [Misc][Speculative decoding] Typos and typing fixes (#6467) 10383887 [ROCm] Cleanup Dockerfile and remove outdated patch (#6482) 1d094fd7 [Distributed][PP] only create embedding & lm head when necessary (#6455) ce37be7b [misc][distributed] add seed to dummy weights (#6491) 7f62077a [misc][distributed] improve tests (#6488) 09c2eb85 [ci][distributed] add pipeline parallel correctness test (#6410) 978aed53 [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) 160e1d8c [Misc] Log spec decode metrics (#6454) 94162beb [Doc] Fix the lora adapter path in server startup script (#6230) c467dff2 [Hardware][TPU] Support MoE with Pallas GMM kernel (#6457) 9f4ccec7 [doc][misc] remind to cancel debugging environment variables (#6481) 38ef9488 [CI/Build] Remove "boardwalk" image asset (#6460) 2bb0489c [Core] Use numpy to speed up padded token processing (#6442) 7508a3dc [Misc] Fix typos in spec. decode metrics logging. (#6470) 7a3d2a5b [Frontend] Support for chat completions input in the tokenize endpoint (#5923) d9701151 [CI/Build] vLLM cache directory for images (#6444) 37d77660 [Docs] Announce 5th meetup (#6458) d92b3c5c [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419) 9ad32dac [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) d6f3b3d5 Pin sphinx-argparse version (#6453) 4552e37b [CI/Build][TPU] Add TPU CI test (#6277) ec9933f4 [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289) 3dee97b0 [Docs] Add Google Cloud to sponsor list (#6450) 4cf256ae [misc][distributed] fix pp missing layer condition (#6446) 64fdc08c bump version to v0.5.2 (#6433) 4ef95b0f [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409) eaec4b91 [Bugfix] Add custom Triton cache manager to resolve MoE MP issue (#6140) a63a4c63 [Misc] Use 0.0.9 version for flashinfer (#6447) c8fd97f2 [Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270) 94b82e8c [doc][distributed] add suggestion for distributed inference (#6418) 6ae1597d [VLM] Minor space optimization for `ClipVisionModel` (#6436) 22e79ee8 [doc][misc] doc update (#6439) de199163 [Bugfix] Convert image to RGB by default (#6430) 69672f11 [core][distributed] simplify code to support pipeline parallel (#6406) 44874a0b [Doc] add env docs for flashinfer backend (#6437) b47008b4 [BugFix] BatchResponseData body should be optional (#6345) 9bfece89 Add FUNDING.yml (#6435) 32c9d7f7 Report usage for beam search (#6404) ccb20db8 [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428) a754dc2c [CI/Build] Cross python wheel (#6394) 61e85dba [Doc] xpu backend requires running setvars.sh (#6393) dbfe254e [Feature] vLLM CLI (#5090) 73030b7d [ Misc ] Enable Quantizing All Layers of DeekSeekv2 (#6423) ccd3c045 [ci][build] fix commit id (#6420) 9dad5cc8 [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384) 6ef3bf91 Remove unnecessary trailing period in spec_decode.rst (#6405) 540c0368 [Model] Initialize Fuyu-8B support (#3924) fb6af8bc [ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417) eeceadae [Misc] Add deprecation warning for beam search (#6402) babf52da [ Misc ] More Cleanup of Marlin (#6359) 9da4aad4 Updating LM Format Enforcer version to v10.3 (#6411) 41708e50 [ci] try to add multi-node tests (#6280) d80aef37 [Docs] Clean up latest news (#6401) e1684a76 [Bugfix] Fix hard-coded value of x in context_attention_fwd (#6373) a27f87da [Doc] Fix Typo in Doc (#6392) 16ff6bd5 [ci] Fix wording for GH bot (#6398) f8f9ff57 [Bugfix][TPU] Fix megacore setting for v5e-litepod (#6397) 6bc9710f Fix release pipeline's dir permission (#6391) 111fc6e7 [Misc] Add generated git commit hash as `vllm.__commit__` (#6386) 75f64d8b [Bugfix] Fix illegal memory access in FP8 MoE kernel (#6382) 21b2dced Fix release pipeline's -e flag (#6390) 07b35af8 Fix interpolation in release pipeline (#6389) bb1a784b Fix release-pipeline.yaml (#6388) d719ba24 Build some nightly wheels by default (#6380) aa48e502 [MISC] Upgrade dependency to PyTorch 2.3.1 (#5327) 4dbebd03 [ci] Add GHA workflows to enable full CI run (#6381) b75bce10 [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline (#6365) b039cbbc [Misc] add fixture to guided processor tests (#6341) f9d25c25 [Build/CI] Checking/Waiting for the GPU's clean state (#6379) 024ad87c [Bugfix] Fix dtype mismatch in PaliGemma (#6367) aea19f09 [ Misc ] Support Models With Bias in `compressed-tensors` integration (#6356) f7160d94 [Misc][Bugfix] Update transformers for tokenizer issue (#6364) 6047187c [ Misc ] Remove separate bias add (#6353) b6c16cf8 [ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352) d26a8b3f [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub (#6350) d59eb984 [Model][Phi3-Small] Remove scipy from blocksparse_attention (#6343) adf32e0a [Bugfix] Fix usage stats logging exception warning with OpenVINO (#6349) 2b0fb534 [distributed][misc] be consistent with pytorch for libcudart.so (#6346) d6ab5289 [Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351) 7ed6a4f0 [ BugFix ] Prompt Logprobs Detokenization (#6223) a4feba92 [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362) 2d23b42d [doc] update pipeline parallel in readme (#6347) 1df43de9 [bug fix] Fix llava next feature size calculation. (#6339) 52b7fcb3 Benchmark: add H100 suite (#6047) b675069d [ Misc ] Refactor Marlin Python Utilities (#6082) 55f692b4 [BugFix] get_and_reset only when scheduler outputs are not empty (#6266) 8a1415cf [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. (#6326) 546b101f [BugFix]: fix engine timeout due to request abort (#6255) 3963a533 [Misc] refactor(config): clean up unused code (#6320) c4774eb8 [Bugfix] Fix snapshot download in serving benchmark (#6318) fc17110b [BugFix]: set outlines pkg version (#6262) 439c8458 [Doc] Update description of vLLM support for CPUs (#6003) 99ded1e1 [Doc] Remove comments incorrectly copied from another project (#6286) 997df46a [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor (#6313) ae151d73 [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765) 44cc7661 [Bugfix] Fix OpenVINOExecutor abstractmethod error (#6296) b422d496 [CI/Build] Enable mypy typing for remaining folders (#6268) c38eba30 [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. (#6303) e72ae80b [Bugfix] Support 2D input shape in MoE layer (#6287) 8a924d22 [Doc] Guide for adding multi-modal plugins (#6205) 5ed3505d [Bugfix][TPU] Add prompt adapter methods to TPUExecutor (#6279) da78caec [core][distributed] zmq fallback for broadcasting large objects (#6183) 2416b26e [Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978) d3a24513 [Bugfix]fix and needs_scalar_to_array logic check (#6238) 673dd4ca [Docs] Docs update for Pipeline Parallel (#6222) 4d6ada94 [CORE] Adding support for insertion of soft-tuned prompts (#4645) a0550cbc Add support for multi-node on CI (#5955) 08c5bdec [Bugfix][TPU] Fix outlines installation in TPU Dockerfile (#6256) 5d5b4c5f [Bugfix][TPU] Add missing None to model input (#6245) 70c232f8 [core][distributed] fix ray worker rank assignment (#6235) a3c9435d [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability (#6216) 4f0e0ea1 Add FlashInfer to default Dockerfile (#6172) ddc369fb [Bugfix] Mamba cache Cuda Graph padding (#6214) 185ad31f [Bugfix] use diskcache in outlines _get_guide #5436 (#6203) 543aa485 [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888) f7a8fa39 [Kernel] reloading fused_moe config on the last chunk (#6210) 717f4bce Feature/add benchmark testing (#5947) 16620f43 do not exclude `object` field in CompletionStreamResponse (#6196) 3b08fe2b [misc][frontend] log all available endpoints (#6195) abfe705a [ Misc ] Support Fp8 via `llm-compressor` (#6110) 333306a2 add benchmark for fix length input and output (#5857) 6206dcb2 [Model] Add PaliGemma (#5189) 93893800 [Doc] Move guide for multimodal model and other improvements (#6168) 175c43ec [Doc] Reorganize Supported Models by Type (#6167) bc96d5c3 Move release wheel env var to Dockerfile instead (#6163) f0250620 Fix release wheel build env var (#6162) 2de490d6 Update wheel builds to strip debug (#6161) 79d406e9 [Docs] Fix readthedocs for tag build (#6158) abad5746 bump version to v0.5.1 (#6157) e58294dd [Bugfix] Add verbose error if scipy is missing for blocksparse attention (#5695) f1e15da6 [Frontend] Continuous usage stats in OpenAI completion API (#5742) 0097bb18 [Bugfix] Use templated datasource in grafana.json to allow automatic imports (#6136) ea4b5704 [VLM] Cleanup validation and update docs (#6149) a41357e9 [VLM] Improve consistency between feature size calculation and dummy data for profiling (#6146) ae96ef8f [VLM] Calculate maximum number of multi-modal tokens by model (#6121) 69ec3ca1 [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051) 81d7a50f [Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008) 27902d42 [misc][doc] try to add warning for latest html (#5979) 56b325e9 [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention (#6043) 3dd50708 [CI/Build] Cleanup VLM tests (#6107) 0ed646b7 [Distributed][Core] Support Py39 and Py38 for PP (#6120) 1dab9bc8 [Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing (#6109) 3de6e6a3 [core][distributed] support n layers % pp size != 0 (#6115) 966fe721 [doc][misc] bump up py version in installation doc (#6119) 62963d12 [ Misc ] Clean Up `CompressedTensorsW8A8` (#6113) d9e98f42 [vlm] Remove vision language config. (#6089) 3c6325f0 [core][distributed] custom allreduce when pp size > 1 (#6117) 47f0954a [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) 7cd2ebb0 [Bugfix] Fix `compute_logits` in Jamba (#6093) f1c78138 [Doc] Fix Mock Import (#6094) 3a86b54f [VLM][Frontend] Proper Image Prompt Formatting from OpenAI API (#6091) f6662071 [misc][distributed] error on invalid state (#6092) d830656a [BugFix] Avoid unnecessary Ray import warnings (#6079) d18bab35 [CI] Fix base url doesn't strip "/" (#6087) 9831aec4 [Core] Dynamic image size support for VLMs (#5276) 482045ee [hardware][misc] introduce platform abstraction (#6080) 9d6a8daa [Model] Jamba support (#4115) ee93f4f9 [CORE] Quantized lm-head Framework (#4442) 7c008c51 [ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) 4d26d806 Update conftest.py (#6076) c5832d2a [Core] Pipeline Parallel Support (#4412) 15aba081 [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050) 31354e56 [Doc] Reinstate doc dependencies (#6061) 98d6682c [VLM] Remove `image_input_type` from VLM config (#5852) 2c37540a [Frontend] Add template related params to request (#5709) 3476ed08 [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) 54600709 [Model] Changes to MLPSpeculator to support tie_weights and input_scale (#5965) e373853e [Frontend] Relax api url assertion for openai benchmarking (#6046) c87ebc3e [BugFix] Ensure worker model loop is always stopped at the right time (#5987) c4059ea5 [Bugfix] Add explicit `end_forward` calls to flashinfer (#6044) 8e0817c2 [Bugfix][Doc] Fix Doc Formatting (#6048) 83bdcb6a add FAQ doc under 'serving' (#5946) 12a59959 [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029) dec6fc6f [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (#6039) 8893130b [doc][misc] further lower visibility of simple api server (#6041) bb603268 [Misc] update benchmark backend for scalellm (#6018) 4050d646 [doc][misc] remove deprecated api server in doc (#6037) d76084c1 [ CI ] Re-enable Large Model LM Eval (#6031) 80ca1e6a [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) 614aa512 [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) af9ad46f [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) 7836fdcc [Misc] Fix `get_min_capability` (#5971) deacb7ec [ CI ] Temporarily Disable Large LM-Eval Tests (#6005) f5e73c9f [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909) c6c240aa [Frontend]: Support base64 embedding (#5935) 2be6955a [ci][distributed] fix device count call 9d47f64e [CI/Build] [3/3] Reorganize entrypoints tests (#5966) cff6a1fe [CI/Build] Reuse code for checking output consistency (#5988) bcc6a09b [CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989) 9def1066 [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949) 75aa1442 [ CI/Build ] LM Eval Harness Based CI Testing (#5838) 99397da5 [CI/Build] Add TP test for vision models (#5892) 8dbfcd35 [ CI/Build ] Added E2E Test For Compressed Tensors (#5839) f7dac83d [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939) 7c01f706 [Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (#5974) 51e971d3 [Bugfix] Support `eos_token_id` from `config.json` (#5954) 329df38f [Misc] Update Phi-3-Vision Example (#5981) 580353da [Bugfix] Fix precisions in Gemma 1 (#5913) ba499444 [Kernel] Add punica dimensions for Granite 3b and 8b (#5930) 906a19cd [Misc] Extend vLLM Metrics logging API (#5925) c4bca740 [Bugfix] fix missing last itl in openai completions benchmark (#5926) 7f83f40d [Bugfix][TPU] Fix pad slot id (#5977) 54814fd8 [Bugfix][TPU] Fix TPU sampler output (#5978) 7041de43 [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628) 6a62cb82 [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (#5963) 5d2a1a9c Unmark more files as executable (#5962) 4bf35ed9 [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (#5936) be0b3af9 Support Deepseek-V2 (#4650) 2cd402e1 [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921) b1852307 [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (#5928) 6a2d659d [Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931) b2c62023 [Spec Decode] Introduce DraftModelRunner (#5799) b90d8cd8 [Distributed] Make it clear that % should not be in tensor dict keys. (#5927) 3b752a65 [CI/Build] [2/3] Reorganize entrypoints tests (#5904) ec1ad004 [Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high (#5894) 57f09a41 [Hardware][Intel] OpenVINO vLLM backend (#5379) 59326344 Unmark fused_moe config json file as executable (#5960) 5cbe8d15 [Core] Registry for processing model inputs (#5214) 0d0e3a42 [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (#5956) 74d55c06 [VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. (#5905) f136da15 [Hardware][TPU] Optimize KV cache swapping (#5878) c3dde367 [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932) 64e8d2a7 [core][misc] remove logical block (#5882) 79c92c7c [Model] Add Gemma 2 (#5908) 736ed388 [CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922) 365791ff [BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849) 691e29ec [BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5876) 3fd02bda [doc][misc] add note for Kubernetes users (#5916) 98cf2ed6 [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) e9d32d07 [CI/Build] [1/3] Reorganize entrypoints tests (#5526) 2061f0b8 [Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888) 96354d6a [Model] Add base class for LoRA-supported models (#5018) d12af207 [VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (#5880) 6eabc6cb [Doc] Add note about context length in Phi-3-Vision example (#5887) 2110557d [BugFix] Fix cuda graph for MLPSpeculator (#5875) b9e84259 [Misc] Add example for LLaVA-NeXT (#5879) 294104c3 [doc] update usage of env var to avoid conflict (#5873) 38a1674a Support CPU inference with VSX PowerPC ISA (#5652) f5c8628f [Bugfix][TPU] Fix CPU cache allocation (#5869) cbc53b6b [Hardware][TPU] Support parallel sampling & Swapping (#5855) c54269d9 [Frontend] Add tokenize/detokenize endpoints (#5054) 5bfd1bbc [Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560) 6984c02a [CI/Build] Refactor image test assets (#5821) 3439c5a8 [Bugfix][TPU] Fix KV cache size calculation (#5860) 6806998b [Bugfix] Fix embedding to support 2D inputs (#5829) 515080ad [bugfix][distributed] fix shm broadcast when the queue size is full (#5801) 3aa7b6cf [Misc][Doc] Add Example of using OpenAI Server with VLM (#5832) dda48115 [Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408) 82079729 [Bugfix] Fix assertion in NeuronExecutor (#5841) c2a8ac75 [CI/Build] Add E2E tests for MLPSpeculator (#5791) f178e56c [Hardware][TPU] Raise errors for unsupported sampling params (#5850) dd793d1d [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) bc34937d [Hardware][TPU] Refactor TPU backend (#5831) dd248f76 [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794) d9b34bae [CI/Build] Add unit testing for FlexibleArgumentParser (#5798) c18ebfdd [doc][distributed] add both gloo and nccl tests (#5834) 67882dbb [Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748) 7b993143 [Misc] Remove useless code in cpu_worker (#5824) 2ce5d668 [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) f23871e9 [Doc] Add notice about breaking changes to VLMs (#5818) e9de9dd5 [ci] Remove aws template (#5757) ba991d5c [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (#5795) 1744cc99 [Doc] Add Phi-3-medium to list of supported models (#5788) e72dc6cb [Doc] Add "Suggest edit" button to doc pages (#5789) c2462129 [doc][faq] add warning to download models for every nodes (#5783) edd5fe5f [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772) 5d4d9053 [Distributed] Add send and recv helpers (#5719) 6c916ac8 [BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744) 832ea88f [core][distributed] improve shared memory broadcast (#5754) 8c00f9c1 [Docs][TPU] Add installation tip for TPU (#5761) 0cbc1d2b [Bugfix] Fix pin_lora error in TPU executor (#5760) ff9ddbce [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py (#5756) 9c62db07 [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (#5710) cf90ae01 [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616) f5dda63e [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) 71875073 [ci][test] fix ca test in main (#5746) f1e72cc1 [BugFix] exclude version 1.15.0 for modelscope (#5668) 5b15bde5 [Doc] Documentation on supported hardware for quantization methods (#5745) bd620b01 [Kernel][CPU] Add Quick `gelu` to CPU (#5717) d9a252bc [Core][Distributed] add shm broadcast (#5399) 67005a07 [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665) c35e4a3d [BugFix] Fix test_phi3v.py (#5725) 1f567421 [Kernel] Add punica dimension for Qwen2 LoRA (#5441) b12518d3 [Model] MLPSpeculator speculative decoding support (#4947) 6c5b7af1 [distributed][misc] use fork by default for mp (#5669) 8065a7e2 [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) 3f3b6b21 [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) a7dcc620 [Kernel] Update Cutlass int8 kernel configs for SM80 (#5275) ad137cd1 [Model] Port over CLIPVisionModel for VLMs (#5591) 111af1fa [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514) 1b2eaac3 [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703) 3730a1c8 [Misc] Improve conftest (#5681) 949e49a6 [ci] Limit num gpus if specified for A100 (#5694) 4a30d7e3 [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650) e83db9e7 [Doc] Update docker references (#5614) 78687504 [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) d571ca01 [ci][distributed] add tests for custom allreduce (#5689) afed90a0 [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) 3ee5c4bc [ci] Add A100 queue into AWS CI template (#5648) e9c2732b [CI/Build] Add tqdm to dependencies (#5680) d8714530 [Misc]Add param max-model-len in benchmark_latency.py (#5629) 7d46c8d3 [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684) da971ec7 [Model] Add FP8 kv cache for Qwen2 (#5656) 3eea7488 [misc][distributed] use 127.0.0.1 for single-node (#5619) f758aed0 [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (#5641) e5150f2c [Bugfix] Added test for sampling repetition penalty bug. (#5659) 59a1eb59 [Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628) 6820724e [Bugfix] Fix w8a8 benchmarks for int8 case (#5643) b23ce920 [Bugfix] Fix CUDA version check for mma warning suppression (#5642) 2bd231a7 [Doc] Added cerebrium as Integration option (#5553) 8a173382 [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639) 07feecde [Model] LoRA support added for command-r (#5178) 19091efc [ci] Setup Release pipeline and build release wheels with cache (#5610) 95db455e [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542) 7879f24d [Misc] Add OpenTelemetry support (#4687) 13db4369 [ci] Deprecate original CI template (#5624) 4ad7b53e [CI/Build][Misc] Update Pytest Marker for VLMs (#5623) f0cc0e68 [Misc] Remove import from transformers logging (#5625) db5ec52a [bugfix][distributed] improve p2p capability test (#5612) 114d7270 [CI] Avoid naming different metrics with the same name in performance benchmark (#5615) 32c86e49 [Misc] Fix typo (#5618) 8eadcf0b [misc][typo] fix typo (#5620) 5002175e [Kernel] Add punica dimensions for Granite 13b (#5559) daef218b [Model] Initialize Phi-3-vision support (#4986) fa9e3852 [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131) 26e1188e [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606) a3e8a05d [Bugfix] Fix KV head calculation for MPT models when using GQA (#5142) e441bad6 [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584) 1b44aaf4 [bugfix][distributed] fix 16 gpus local rank arrangement (#5604) 9e4e6fe2 [CI] the readability of benchmarking and prepare for dashboard (#5571) ab66536d [CI/BUILD] Support non-AVX512 vLLM building and testing (#5574) 728c4c8a [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) 1f12122b [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (#5588) 890d8d96 [Kernel] `compressed-tensors` marlin 24 support (#5435) 9e74d9d0 Correct alignment in the seq_len diagram. (#5592) 9333fb8e [Model] Rename Phi3 rope scaling type (#5595) e2b85cf8 Fix w8a8 benchmark and add Llama-3-8B (#5562) 845a3f26 [Doc] add debugging tips for crash and multi-node debugging (#5581) f07d5133 [build][misc] limit numpy version (#5582) 4a676905 [CI][BugFix] Flip is_quant_method_supported condition (#5577) f31c1f90 Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518) 3ce2c050 [Fix] Correct OpenAI batch response format (#5554) 1c0afa13 [BugFix] Don't start a Ray cluster when not using Ray (#5570) d919ecc7 add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145) e691918e [misc] Do not allow to use lora with chunked prefill. (#5538) 81fbb365 [CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568) 0e9164b4 [mypy] Enable type checking for test directory (#5017) 1b8a0d71 [Core][Bugfix]: fix prefix caching for blockv2 (#5364) bd7efe95 Add ccache to amd (#5555) f5bb85b4 [Core][Distributed] improve p2p cache generation (#5528) 28c145eb [Bugfix] Fix typo in Pallas backend (#5558) e2afb03c [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460) 6e2527a7 [Doc] Update documentation on Tensorizer (#5471) cdab68dc [Docs] Add ZhenFund as a Sponsor (#5548) d1c3d7d1 [misc][distributed] fix benign error in `is_in_the_same_node` (#5512) 77490c6f [Core] Remove duplicate processing in async engine (#5525) 48f589e1 [mis] fix flaky test of test_cuda_device_count_stateless (#5546) 348616ac [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401) 15985680 [ Misc ] Rs/compressed tensors cleanup (#5432) d74674bb [Misc] Fix arg names (#5524) 703475f6 [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) d47af2bc [CI/Build] Disable LLaVA-NeXT CPU test (#5529) 319ad7f1 [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (#5073) 0f0d8bc0 bump version to v0.5.0.post1 (#5522) 55d6361b [Misc] Fix arg names in quantizer script (#5507) cd9c0d65 [Hardware][Intel] Support CPU inference with AVX2 ISA (#5452) 50eed24d Add `cuda_device_count_stateless` (#5473) e38042d4 [Kernel] Disable CUTLASS kernels for fp8 (#5505) 33e3b372 [CI/Build] Disable test_fp8.py (#5508) 1696efe6 [misc] fix format.sh (#5511) 6b0511a5 Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478) a8fda4f6 Seperate dev requirements into lint and test (#5474) 30299a41 [MISC] Remove FP8 warning (#5472) 85657b56 [Kernel] Factor out epilogues from cutlass kernels (#5391) 0ce7b952 [Doc] Update LLaVA docs (#5437) 39873476 [CI/Build] Simplify OpenAI server setup in tests (#5100) 03dccc88 [Misc] Add vLLM version getter to utils (#5098) a65634d3 [Docs] Add 4th meetup slides (#5509) 80aa7e91 [Hardware][Intel] Optimize CPU backend and add more performance tips (#4971) bd439735 [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497) 23ec72fa [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466) c2637a61 [Kernel] `w4a16` support for `compressed-tensors` (#5385) 88407532 [Bugfix]if the content is started with ":"(response of ping), client should i… (#5303) 916d219d [ci] Use sccache to build images (#5419) ea3890a5 [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) 2135cacb [Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451) 7d19de2e [Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425) 94a07bbd [Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470) b8d4dfff [Doc] Update debug docs (#5438) 622d4512 [misc] add hint for AttributeError (#5462) 51602eef [Frontend] [Core] Support for sharded tensorized models (#4990) 5cc50a53 [Bugfix] TYPE_CHECKING for MultiModalData (#5444) 5985e342 [Kernel] Vectorized FP8 quantize kernel (#5396) 8b82a899 [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464) c3c2903e [Bugfix] Add device assertion to TorchSDPA (#5402) 1a8bfd92 [Hardware] Initial TPU integration (#5292) 847cdcca [CI] Upgrade codespell version. (#5381) e3c12bf6 Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463) 3dd6853b [CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253) 8f89d720 [Doc] add common case for long waiting time (#5430) 99dac099 [Core][Doc] Default to multiprocessing for single-node distributed case (#5230) c4bd03c7 [Core][Distributed] add same-node detection (#5369) dcbf4286 [Frontend] Customizable RoPE theta (#5197) 00e6a2dc [Bugfix] fix lora_dtype value type in arg_utils.py (#5398) 2e02311a [Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1 (#5254) 89ec06c3 [Docs] [Spec decode] Fix docs error in code example (#5427) 9fde251b [Doc] Add an automatic prefix caching section in vllm documentation (#5324) 4c2ffb28 [Speculative decoding] Initial spec decode docs (#5400) 246598a6 [CI] docfix (#5410) 8bab4959 [Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389) 3c4cebf7 [Doc][Typo] Fixing Missing Comma (#5403) d8f31f2f [Doc] add debugging tips (#5409) 640052b0 [Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026) 351d5e7b [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) a0086298 [Misc] Various simplifications and typing fixes (#5368) 76477a93 [ci] Fix Buildkite agent path (#5392) 77c87beb [Doc] Add documentation for FP8 W8A8 (#5388) 114332b8 Bump version to v0.5.0 (#5384) cb77ad83 [Docs] Alphabetically sort sponsors (#5386) 856c9900 [Docs] Add Docs on Limitations of VLM Support (#5383) c5602f0b [ci] Mount buildkite agent on Docker container to upload benchmark results (#5330) f7f9c5f9 [ci] Use small_cpu_queue for doc build (#5331) 2c0d9335 [Bugfix] Fix LLaVA-NeXT (#5380) 774d1035 [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319) 6b29d6fe [Model] Initial support for LLaVA-NeXT (#4199) 0bfa1c4f [Misc] Improve error message when LoRA parsing fails (#5194) c81da5f5 [misc][typo] fix typo (#5372) 68bc8170 [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (#5374) 5884c2b4 [Misc] Update to comply with the new `compressed-tensors` config (#5350) 45f92c00 [Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164) 5467ac31 [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) 5d7e3d01 [mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361) 0373e183 [Core][CUDA Graph] add output buffer for cudagraph (#5074) c09dade2 [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) 8ea5e44a [CI/Test] improve robustness of test (vllm_runner) (#5357) 9fb900f9 [CI/Test] improve robustness of test (hf_runner) (#5347) c96fc067 [ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965) b3376e5c [Misc] Add args for selecting distributed executor to benchmarks (#5335) e69ded7d [Bug Fix] Fix the support check for FP8 CUTLASS (#5352) 767c727a fix DbrxFusedNormAttention missing cache_config (#5340) 6840a716 [Misc] Remove unused cuda_utils.h in CPU backend (#5345) 7a9cb294 [Frontend] Add OpenAI Vision API Support (#5237) ca3ea51b [Kernel] Dynamic Per-Token Activation Quantization (#5037) dc49fb89 Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) 18a277b5 Remove Ray health check (#4693) 8d75fe48 [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) 388596c9 [Misc][Utils] allow get_open_port to be called for multiple times (#5333) baa15a9e [Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135) 15063741 [Misc] Missing error message for custom ops import (#5282) ccdc490d [Core] Change LoRA embedding sharding to support loading methods (#5038) a31cab75 [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) 828da0d4 [Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) abe855d6 [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) 4efff036 Bugfix: fix broken of download models from modelscope (#5233) 89c92078 [CI/Build] Update vision tests (#5307) 7b0a0dfb [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) 3a6ae1d3 [CI] Disable flash_attn backend for spec decode (#5286) 8f1729b8 [Docs] Add Ray Summit CFP (#5295) 6a7c7711 [Misc] Skip for logits_scale == 1.0 (#5291) 0f83ddd4 [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290) 065aff6c [Bugfix] Make EngineArgs use named arguments for config construction (#5285) 3d33e372 [BugFix] Fix log message about default max model length (#5284) faf71bcd [Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252) f270a395 [Docs] Add Sequoia as sponsors (#5287) 51a08e7d [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) eb8fcd26 [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207) 5563a4de [Model] Correct Mixtral FP8 checkpoint loading (#5231) ccd4f129 [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) 02cc3b51 [misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263) d5b1eb08 [CI] Add nightly benchmarks (#5260) f0a50054 [Frontend] OpenAI API server: Add `add_special_tokens` to ChatCompletionRequest (default False) (#5278) c65146e7 [Misc] Fix docstring of get_attn_backend (#5271) 41ca62cf [Misc] Add CustomOp interface for device portability (#5255) 974fc9b8 [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226) fee4dcc3 [Misc] update collect env (#5261) 650a4cc5 [Misc] Add transformers version to collect_env.py (#5259) 9ca62d86 [CI] mark AMD test as softfail to prevent blockage (#5256) 45c35f0d [CI/Build] Reducing CPU CI execution time (#5241) 9ba093b4 [CI/Build] Simplify model loading for `HfRunner` (#5251) 27208be6 [Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242) 87d5abef [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend (#5249) ec784b25 [CI/Build] Add inputs tests (#5215) a58f24e5 [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229) f42a006b [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210) 3a434b07 [Kernel] Enhance MoE benchmarking & tuning script (#4921) bd0e7802 [Bugfix] Add warmup for prefix caching example (#5235) 06b2550c [Bugfix] Support `prompt_logprobs==0` (#5217) f775a07e [FRONTEND] OpenAI `tools` support named functions (#5032) 4f0d17c0 New CI template on AWS stack (#5110) 10c38e3e [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) cafb8e06 [CI/BUILD] enable intel queue for longer CPU tests (#4113) cbb2f59c [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) 0ab278ca [Core] Remove unnecessary copies in flash attn backend (#5138) 7a64d24a [Core] Support image processor (#4197) dfbe60dc [Misc] Simplify code and fix type annotations in `conftest.py` (#5118) a66cf40b [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) f790ad3c [Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643) ed59a7ed Update test_ignore_eos (#4898) 044793d8 [BugFix] Prevent `LLM.encode` for non-generation Models (#5184) c2d6d2f9 [Bugfix]: Fix issues related to prefix caching example (#5177) (#5180) 8279078e [Bugfix] Remove deprecated @abstractproperty (#5174) b9c0605a [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) 37464a0f [Bugfix] Fix call to init_logger in openai server (#4765) c3540728 [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) f081c3ce [Kernel] Update Cutlass fp8 configs (#5144) 260d119e [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) a360ff80 [CI/Build] CMakeLists: build all extensions' cmake targets at the same time (#5034) 1197e021 [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168) 65757911 [Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171) e9899fb7 [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) a377f0bd [Misc]: optimize eager mode host time (#4196) e9d3aa04 Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149) a22dea54 [Model] Support MAP-NEO model (#5081) 533c2177 Fix cutlass sm_90a vesrion in CMakeList 6d21fa1c [Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136) b35be540 [Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120) 45a1a69b [Build] Disable sm_90a in cu11 (#5141) 87a658c8 Bump version to v0.4.3 (#5046) 429d8972 add doc about serving option on dstack (#3074) a9bcc7af [Doc] Use intersphinx and update entrypoints docs (#5125) d79d9eaa [Misc] remove duplicate definition of `seq_lens_tensor` in model_runner.py (#5129) f758505c [CI/Build] increase wheel size limit to 200 MB (#5130) d910816c [Bugfix] Automatically Detect SparseML models (#5119) 87d41c84 [BUGFIX] [FRONTEND] Correct chat logprobs (#5029) e07aff9e [CI/Build] Docker cleanup functionality for amd servers (#5112) 5bf185a1 [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108) 4fbcb0f2 [Doc][Build] update after removing vllm-nccl (#5103) 7c3604fb [Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031) b1c25563 [Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099) eb6c50cd [Bugfix][CI/Build] Fix codespell failing to skip files in `git diff` (#5097) eecd8643 [Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096) ae495c74 [Doc]Replace deprecated flag in readme (#4526) 4238bc82 [Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837) 594392d2 [Core][Distributed] improve p2p access check (#4992) 18c1f16d [Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092) 5bd3c650 [Core][Optimization] remove vllm-nccl (#5091) 616e600e [Misc] add gpu_memory_utilization arg (#5079) dfba529b [Bugfix] Remove the last EOS token unless explicitly specified (#5077) 5ae5ed1e [Core] Consolidate prompt arguments to LLM engines (#4328) 290f4ada [Docs] Add Dropbox as sponsors (#5089) dd8de11f [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951) 9ba41558 [BugFix] Fix Embedding Models with TP>1 (#5075) d4f39859 [Core] Sliding window for block manager v2 (#4545) 890aa93d [Model] Add support for falcon-11B (#5069) fbdb7b3e [Core] Allow AQLM on Pascal (#5058) 1102bef2 [Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) git-subtree-dir: vllm git-subtree-split: c8a7e93273ff4338d6f89f8a63ff16426ac240b8

)

arunpatala added the bug Something isn't working label May 28, 2024

alexm-neuralmagic added a commit to neuralmagic/nm-vllm that referenced this issue May 29, 2024

gptq_marlin: Fix bug report vllm-project#5088 (comment)

a44225e

alexm-neuralmagic mentioned this issue May 29, 2024

[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter #5108

Merged

alexm-neuralmagic mentioned this issue May 30, 2024

Add gptq_marlin test to cover bug report #5088 #5121

Closed

alexm-neuralmagic added a commit to neuralmagic/nm-vllm that referenced this issue May 30, 2024

add gptq_marlin test to cover bug report vllm-project#5088

93e16b5

alexm-neuralmagic mentioned this issue May 30, 2024

Add gptq_marlin test to cover bug report https://github.com/vllm-project/vllm/issues/5088 #5122

Closed

alexm-neuralmagic added a commit to neuralmagic/nm-vllm that referenced this issue May 30, 2024

add gptq_marlin test for bug report vllm-project#5088

290b2d2

alexm-neuralmagic mentioned this issue May 30, 2024

add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 #5145

Merged

mgoin closed this as completed Jun 4, 2024

mgoin pushed a commit that referenced this issue Jun 15, 2024

add gptq_marlin test for bug report #5088 (#5145)

d919ecc

joerunde pushed a commit to IBM/vllm that referenced this issue Jun 18, 2024

add gptq_marlin test for bug report vllm-project/vllm#5088 (#5145)

cd92de7

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this issue Jun 23, 2024

add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145

beb3b21

)

xjpang pushed a commit to xjpang/vllm that referenced this issue Jul 8, 2024

add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145

b12bd85

)

xjpang pushed a commit to xjpang/vllm that referenced this issue Jul 24, 2024

add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145

de9d711

)

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this issue Sep 6, 2024

add gptq_marlin test for bug report vllm-project#5088 (vllm-project#5145

9958a77

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Gemma model fails with GPTQ marlin #5088

[Bug]: Gemma model fails with GPTQ marlin #5088

arunpatala commented May 28, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented May 28, 2024

🐛 Describe the bug

arunpatala commented May 29, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented May 29, 2024 •

edited

Loading

alexm-neuralmagic commented May 29, 2024

arunpatala commented May 29, 2024 •

edited

Loading

alexm-neuralmagic commented May 29, 2024

[Bug]: Gemma model fails with GPTQ marlin #5088

[Bug]: Gemma model fails with GPTQ marlin #5088

Comments

arunpatala commented May 28, 2024 • edited Loading

🐛 Describe the bug

robertgshaw2-neuralmagic commented May 28, 2024

🐛 Describe the bug

arunpatala commented May 29, 2024 • edited Loading

robertgshaw2-neuralmagic commented May 29, 2024 • edited Loading

alexm-neuralmagic commented May 29, 2024

arunpatala commented May 29, 2024 • edited Loading

alexm-neuralmagic commented May 29, 2024

arunpatala commented May 28, 2024 •

edited

Loading

arunpatala commented May 29, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented May 29, 2024 •

edited

Loading

arunpatala commented May 29, 2024 •

edited

Loading