[Roadmap] vLLM Roadmap Q3 2024 #5805

simon-mo · 2024-06-25T00:08:09Z

Anything you want to discuss about vllm.

This document includes the features in vLLM's roadmap for Q3 2024. Please feel free to discuss and contribute, as this roadmap is shaped by the vLLM community.

Themes.

As before, we categorized our roadmap into 6 broad themes:

Broad model support: vLLM should support a wide range of transformer based models. It should be kept up to date as much as possible. This includes new auto-regressive decoder models, encoder-decoder models, hybrid architectures, and models supporting multi-modal inputs.
Excellent hardware coverage: vLLM should run on a wide range of accelerators for production AI workload. This includes GPUs, tensor accelerators, and CPUs. We will work closely with hardware vendors to ensure vLLM utilizes the greatest performance out of the chip.
Performance optimization:vLLM should be kept up to date with the latest performance optimization techniques. Users of vLLM can trust its performance to be competitive and strong.
Production level engine: vLLM should be the go-to choice for production level serving engine with a suite of features bridging the gaps from single forward pass to 24/7 service.
Strong OSS product: vLLM is and will be a true community project. We want it to be a healthy project with regular release cadence, good documentation, and adding new reviewers to the codebase.
Extensible architectures: For vLLM to grow at an even faster pace, it needs good abstractions to support a wide range of scheduling policies, hardware backends, and inference optimizations. We will work on refactoring the codebase to support that.

Broad Model Support

Support Large Models (Arctic, Nemotron4, Llama3 400B+ when released)
- Via Pipeline Parallelism [Core] Pipeline Parallel Support #4412
- Via FP8
New Attention Mechanism (Jamba, Phi3-Small, etc)
Encoder Decoder ([Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837, [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) #4888, [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) #4942)
Multi-Modal [RFC]: Multi-modality Support Refactoring #4194

Help wanted:

Whisper and the audio API
Arbitrary HF model
Chameleon ([Model] Initial Support for Chameleon #5770)
Multi token prediction
Reward model API
Embedding Model Expansion (Bert, XLMRoberta) ([Model] Bert Embedding Model #5447)

Hardware Support

A feature matrix for all the hardware that vLLM supports, and their maturity level
Enhanced performance benchmark across hardwares
Expanding features support on various hardwares
- PagedAttention and Chunked Prefill on Inferentia
- Chunked Prefill on Intel CPU/GPU
- PagedAttention on Intel Gaudi
- TP and INT8 on TPU
- Bug fixes and GEMM tuning on AMD GPUs

Performance Optimizations

Production Features

Help wanted

Support multiple models in the same server
[Feedback wanted] Disaggregated prefill: please discuss with us your use case and in what scenario it is preferred over chunked prefill.

OSS Community

Reproducible performance benchmark on realistic workload
CI enhancements
Release process: minimize breaking changes and include deprecations

Help wanted

Documentation enhancements in general (styling, UI, explainers, tutorials, examples, etc)

Extensible Architecture

KV cache transfer [RFC]: Implement disaggregated prefilling via KV cache transfer #5557
Distributed execution [RFC]: A Flexible Architecture for Distributed Inference #5775
Improvements to scheduler and memory manager supporting new attention mechanisms
Performance enhancement for multi-modal processing

If any of the item you wanted is not on the roadmap, your suggestion and contribution is still welcomed! Please feel free to comment in this thread, open feature request, or create an RFC.

Jeffwan · 2024-06-25T01:07:41Z

Support multiple models in the same server

Does vLLM need the multi-model support similar like what FastChat does or something else?

CSEEduanyu · 2024-06-25T02:11:15Z

#2809 hello,how about this？

jeejeelee · 2024-06-26T15:09:54Z

Hi, the issues were mentioned in #5036 and should be taken into account.

MeJerry215 · 2024-06-27T06:49:32Z

Will vLLM use Triton more to optimize operators' performance in future, or will it consider using the torch.compile mechanism more?

And are there any plans for this?

ashim-mahara · 2024-06-27T19:45:36Z

Hi! Is there or will there be support for the OpenAI Batch API ?

huseinzol05 · 2024-06-28T11:22:27Z

I am doing for Whisper, my fork at https://github.com/mesolitica/vllm-whisper, the frontend later should compatible with OpenAI API plus able to stream output tokens, few hiccups, still trying to figure out based on T5 branch,

vllm/vllm/model_executor/layers/enc_dec_attention.py

Line 83 in 9f20ccf

out = xops.memory_efficient_attention_forward(

still try to figure out kv cache for Encoder hidden state or else each steps will recompute Encoder hidden state.
No non causal attention for Encoder and Cross Attention in Decoder, seems like all attention implementation in VLLM is for causal
Reuse KV Cache Cross Attention from the first step for the next steps.

huseinzol05 · 2024-06-28T14:44:30Z

Able to load and infer, https://github.com/mesolitica/vllm-whisper/blob/main/examples/whisper_example.py, but the output is still trash, might be bugs related to weights or the attention, still debugging

jkl375 · 2024-07-01T10:19:29Z

Do you have plans to support Ascend 910B in the future?

hibukipanim · 2024-07-03T08:08:33Z

Please consider prioritizing dynamic / just-in-time 8-bit quantization like EETQ which don't require offline quantization step.
In example a current advantage of TGI is that you can load an original 16-bit hf model as int8 by just passing the --quantize eetq arg. AFAIK It's custom kernels handle outliers in higher precision during runtime, allowing it loose very little precision.

Previous mention in issues: #3261 (comment)
PR for it was opened but eventually closed: #3614

tutu329 · 2024-07-09T00:43:56Z

deepseek-v2 and deepseek-coder-v2 are supported now. but awq or gptq version are not supported so these model are still not usable due to their huge 236B.

also MLA(Multihead Latent Attention) of there model is not supported yet.

amritap-ef · 2024-07-11T08:17:29Z

Support for DoLa would be great!

robertgshaw2-neuralmagic · 2024-07-12T13:41:15Z

Please consider prioritizing dynamic / just-in-time 8-bit quantization like EETQ which don't require offline quantization step. In example a current advantage of TGI is that you can load an original 16-bit hf model as int8 by just passing the --quantize eetq arg. AFAIK It's custom kernels handle outliers in higher precision during runtime, allowing it loose very little precision.

Previous mention in issues: #3261 (comment) PR for it was opened but eventually closed: #3614

Have you tried fp8 marlin? Run with --quantization fp8 and we will quantize the weights to fp8 in place. This will be faster and more accurate than eetq [note: requires ampere +]

kaifronsdal · 2024-07-13T23:45:29Z

Please consider supporting transformer-based value models such as in the vllm fork https://github.com/MARIO-Math-Reasoning/vllm and the huggingface implementation https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead. The only thing that changes is adding a head to the end of the model to predict a value instead of logits. This would be a powerful addition to support very fast generation search and open up the possibility of more effective methods such as MCTS compared to traditional prompt based approaches such as self-consistency, CoT, ToT, etc.

haichuan1221 · 2024-07-14T01:42:30Z

Please consider supporting transformer-based value models such as in the vllm fork https://github.com/MARIO-Math-Reasoning/vllm and the huggingface implementation https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead. The only thing that changes is adding a head to the end of the model to predict a value instead of logits. This would be a powerful addition to support very fast generation search and open up the possibility of more effective methods such as MCTS compared to traditional prompt based approaches such as self-consistency, CoT, ToT, etc.

Thank you for your nice contribution! I wonder whether it is possible for you to fork a branch from vllm instead of creating new one so that anyone can see what changes in new contribution?

hibukipanim · 2024-07-14T13:04:54Z

Have you tried fp8 marlin? Run with --quantization fp8 and we will quantize the weights to fp8 in place. This will be faster and more accurate than eetq [note: requires ampere +]

yes thanks @robertgshaw2-neuralmagic, was trying it in recent days and it does look promising. happy to hear you believe it's more accurate than EETQ. I can confirm that Llama-70B-Instruct got almost same MMLU score with fp8 (80.56 vs 80.7).

Would be great if it could load and quant the layers iteratively, as now if the 16bit model can't fit in the GPU, we have to quant it offline first. But the fact there is an option to do "dynamic" quant without calibration data is great. thanks for this

robertgshaw2-neuralmagic · 2024-07-14T13:14:56Z

Have you tried fp8 marlin? Run with --quantization fp8 and we will quantize the weights to fp8 in place. This will be faster and more accurate than eetq [note: requires ampere +]

yes thanks @robertgshaw2-neuralmagic, was trying it in recent days and it does look promising. happy to hear you believe it's more accurate than EETQ. I can confirm that Llama-70B-Instruct got almost same MMLU score with fp8 (80.56 vs 80.7).

Would be great if it could load and quant the layers iteratively, as now if the 16bit model can't fit in the GPU, we have to quant it offline first. But the fact there is an option to do "dynamic" quant without calibration data is great. thanks for this

It should be more accurate and much much faster - so I think we will not prioritizing adding eetq ourselves (though we will of course accept a contribution)

Iterative quantization is on my list, ideally this week.

DarkLight1337 · 2024-07-17T07:55:33Z

Hi! Is there or will there be support for the OpenAI Batch API ?

vLLM currently has partial support for this (#4794).

w013nad · 2024-07-17T13:31:00Z

Hi! Is there or will there be support for the OpenAI Batch API ?

vLLM currently has partial support for this (#4794).

This requires a completely new instance of vLLM, It would be nice if we could just call an existing API with a batch request like you do with the OpenAI Batch API.

ashim-mahara · 2024-07-17T13:51:23Z

Hi! Is there or will there be support for the OpenAI Batch API ?

vLLM currently has partial support for this (#4794).

This requires a completely new instance of vLLM, It would be nice if we could just call an existing API with a batch request like you do with the OpenAI Batch API.

Exactly my thoughts. I could help with the build. I already have a nano-library that does interface with OpenAI directly at ashim-mahara/odbg.

The primary problem I have identified is with tracking the request origins in-case of dynamic batching by VLLM. The first one is easier if batches are executed sequentially but they would still need to be saved on the disk somewhere for retrieval later.

simon-mo · 2024-07-17T18:41:38Z

an existing API with a batch request like you do with the OpenAI Batch API.

@w013nad (or others), please feel free to open an RFC for this to discuss the ideal API. The main challenge is around file storage I believe.

warlockedward · 2024-07-23T06:48:39Z

Hopefully, the function_call and tool_choice features will be implemented faster and will additionally support models like Qwen2

akhilreddy0703 · 2024-07-30T18:42:17Z

Hi all,

CPU Optimizations to support GGUF models !!

My thoughts are, Adding CPU optimizations to the vLLM makes it more robust.

I know that ipex has already been added to the project
Project like Llamacpp has been a go to inference server when it comes to running models in lower precisions on CPU, even it is providing a http server to host a gguf model, but the problem with Llamacpp is it won't handle parallel requests like vLLM handles it.
I've tested Llamacpp server for Performance values for llama3-8b quantized model (with int4 precision), results are very promissing.
Adding the support for running quantized models (GGUF) on CPU using vLLM server would be a very considerable object for this roadmap

If anyone already looking into this please let me know, I want to work on this part, I'm open to help/contribute to this

Thanks

dongfangduoshou123 · 2024-07-31T09:38:15Z

Hopefully, the function_call and tool_choice features will be implemented faster and will additionally support models like Qwen2

ollama already support tool use in from version 0.3.0
see: https://ollama.com/blog/tool-support

fodevac33 · 2024-08-02T16:00:15Z

Any chance that you guys can implement Dry Repetition Penalty? I sorely miss it from backends like Oobabooga or Kobold.

yiakwy-xpu-ml-framework-team · 2024-08-06T07:51:41Z

We want to see more improvement on compiler since this is the major gap between vLLM and TRT-LLM (with meylin compiler) support.

B.t.w, what's your opinion with SGLang (they extensively use torch.compile to optimize the ML workload) and their released benchmark? @simon-mo

DarkLight1337 · 2024-08-06T10:44:24Z

Hi all,

CPU Optimizations to support GGUF models !!

My thoughts are, Adding CPU optimizations to the vLLM makes it more robust.

I know that ipex has already been added to the project

Project like Llamacpp has been a go to inference server when it comes to running models in lower precisions on CPU, even it is providing a http server to host a gguf model, but the problem with Llamacpp is it won't handle parallel requests like vLLM handles it.

I've tested Llamacpp server for Performance values for llama3-8b quantized model (with int4 precision), results are very promissing.

Adding the support for running quantized models (GGUF) on CPU using vLLM server would be a very considerable object for this roadmap

If anyone already looking into this please let me know, I want to work on this part, I'm open to help/contribute to this

Thanks

@akhilreddy0703 #5191 has just been merged, providing support for GGUF models.

gabrielmbmb · 2024-08-08T15:42:23Z

Hi, I would like to contribute to the Reward model API, do you have any suggestions or ideas in mind for this feature?

tsaoyu · 2024-08-09T16:41:12Z

Hi, I would like to contribute to the Reward model API, do you have any suggestions or ideas in mind for this feature?

A good start point might be some API similar to this https://github.com/OpenRLHF/OpenRLHF/pull/391/files

tsaoyu · 2024-08-09T16:48:00Z

Support multiple models in the same server

Does vLLM need the multi-model support similar like what FastChat does or something else?

Up for this, support multiple models or models at different version had good use case in the era of synthetic data. But I would suggest expose this feature in Engine level. My current recipe is using LangChain to abstract a layer on top of Ray, Ray is in charge of distributed model loading and inference.

amritap-ef · 2024-08-13T21:21:18Z

Is there a way to pass in custom decoding config in offline inference mode for different prompts i.e. use outlines to generate custom json output per prompt? It seems that currently, it is only possible to pass in a single decoding config to use for all prompts so would be great to have this feature!

yiakwy-xpu-ml-framework-team · 2024-08-14T07:36:30Z

Is there a way to pass in custom decoding config in offline inference mode for different prompts i.e. use outlines to generate custom json output per prompt? It seems that currently, it is only possible to pass in a single decoding config to use for all prompts so would be great to have this feature!

For offline inference mode will it be more efficient to organize data and create engine backend for each type of the prompts ?

I am more interested in online decision of the decoding config for different type of coming inputs. Instead of using a chain of inference , one to make such judgement one to do inference, it is worthy of trying to do it before prefill or with a few round of generations.

yiakwy-xpu-ml-framework-team · 2024-08-14T08:30:24Z

Hi, I would like to contribute to the Reward model API, do you have any suggestions or ideas in mind for this feature?

A good start point might be some API similar to this https://github.com/OpenRLHF/OpenRLHF/pull/391/files

Though you can accelerate generation of reward/critic from limited hands experiences with our MegatronPPOTrainerEngine, Reward model is exclusive to alignment of LLM, which is out of the scope of vLLM.

The challenge is huge memory required both for host cpu and its co-processor.

The memory pressure comes from the fact that shards of optimizers of actor (finetuned GPT head), critic model (initialized with reward model parameters) co-exist with the shards of model parameters (no DDP copies on other gpu parallel groups).

And in the last stage of pipeline of model, we need a full copy of an actor and a reward, which achieves the peak memory usage of whole PPO training PP stages.

It is very complex situation; you cannot simply tackle this by hosting the frozen model outside of training gpus. vLLM does provide serving mode and you can make use of it.

So my suggestion is, keep the relevant alignment features solely in the relevant repositories.

@gabrielmbmb

amritap-ef · 2024-08-14T13:12:19Z

Is there a way to pass in custom decoding config in offline inference mode for different prompts i.e. use outlines to generate custom json output per prompt? It seems that currently, it is only possible to pass in a single decoding config to use for all prompts so would be great to have this feature!

For offline inference mode will it be more efficient to organize data and create engine backend for each type of the prompts ?

I am more interested in online decision of the decoding config for different type of coming inputs. Instead of using a chain of inference , one to make such judgement one to do inference, it is worthy of trying to do it before prefill or with a few round of generations.

The trouble in my use case is that each prompt requires a slightly different schema for the json depending on input to the prompt. Would be great if this could be treated similar to online inference in that sense.

agm-eratosth · 2024-08-21T19:13:39Z

Hi what happened to "ARM aarch-64 support for AWS Graviton based instances and GH200" from the Q2 2024 roadmap? #3861

ayush9818 · 2024-08-24T15:23:26Z

Hi, I wanted to contribute to Multi token prediction feature. Is there any feature requirement or starting point for this ?

Here is what I have got: #5683. What kind of LLM Class can be a good starting point for this?

nivibilla · 2024-08-25T08:30:05Z

Hey can this be looked at please. I'm not able to run any mixture of experts models on L4 gpus (EC2 G6) instances due to the Triton issue mentioned

niuzheng168 · 2024-09-06T13:21:44Z

More and more speech model is using a LLM to predict non-text tokens. Like ChatTTS or FishTTS, they are all using a llama to predict speech tokens.
But unlike llama for text, the speech-llama will use a multiple lm_head to predict more than 1 tokens in parallel, and therefor sum the n-tokens embedding when processing the llm input embedding .
I am currently trying to make chattts running with vllm, see here, but lots code need to update and seems break some fundamental design. So maybe you can consider support it officially. It will definitely make more impact to the speech solutions.

ChengyuZhu6 · 2024-09-11T02:56:03Z

Support multiple models in the same server

Does vLLM need the multi-model support similar like what FastChat does or something else?

Up for this, support multiple models or models at different version had good use case in the era of synthetic data. But I would suggest expose this feature in Engine level. My current recipe is using LangChain to abstract a layer on top of Ray, Ray is in charge of distributed model loading and inference.

I think this is the difference in implementation at different granularities.

Shreyansh1311 · 2024-09-18T22:05:56Z

Any chance that you guys can implement Dry Repetition Penalty? I sorely miss it from backends like Oobabooga or Kobold.

Hi, it would be really great to have DRY implemented in vLLM, DRY has been a game changer for all the small models, since they tend to repeat much more. It's a really effective sampling method. It would be really useful to have it here as well

simon-mo added misc and removed misc labels Jun 25, 2024

simon-mo pinned this issue Jun 25, 2024

simon-mo mentioned this issue Jun 25, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

youkaichao mentioned this issue Jul 19, 2024

[Performance]: GPU utilization is low when running large batches on H100 #6560

Open

CharlesRiggins mentioned this issue Aug 29, 2024

Do we have a roadmap? vllm-project/llm-compressor#128

Closed

ShangmingCai mentioned this issue Sep 12, 2024

[WIP][Spec Decode] Add multi-proposer support for variable and flexible speculative decoding #7947

Open

[Roadmap] vLLM Roadmap Q3 2024 #5805

[Roadmap] vLLM Roadmap Q3 2024 #5805

Comments

simon-mo commented Jun 25, 2024 • edited Loading

Anything you want to discuss about vllm.

Themes.

Broad Model Support

Hardware Support

Performance Optimizations

Production Features

OSS Community

Extensible Architecture

Jeffwan commented Jun 25, 2024

CSEEduanyu commented Jun 25, 2024

jeejeelee commented Jun 26, 2024

MeJerry215 commented Jun 27, 2024 • edited Loading

ashim-mahara commented Jun 27, 2024

huseinzol05 commented Jun 28, 2024 • edited Loading

huseinzol05 commented Jun 28, 2024

jkl375 commented Jul 1, 2024

hibukipanim commented Jul 3, 2024

tutu329 commented Jul 9, 2024

amritap-ef commented Jul 11, 2024

robertgshaw2-neuralmagic commented Jul 12, 2024 • edited Loading

kaifronsdal commented Jul 13, 2024

haichuan1221 commented Jul 14, 2024

hibukipanim commented Jul 14, 2024

robertgshaw2-neuralmagic commented Jul 14, 2024

DarkLight1337 commented Jul 17, 2024

w013nad commented Jul 17, 2024

ashim-mahara commented Jul 17, 2024

simon-mo commented Jul 17, 2024

warlockedward commented Jul 23, 2024

akhilreddy0703 commented Jul 30, 2024 • edited Loading

CPU Optimizations to support GGUF models !!

dongfangduoshou123 commented Jul 31, 2024 • edited Loading

fodevac33 commented Aug 2, 2024 • edited Loading

yiakwy-xpu-ml-framework-team commented Aug 6, 2024 • edited Loading

DarkLight1337 commented Aug 6, 2024 • edited Loading

CPU Optimizations to support GGUF models !!

gabrielmbmb commented Aug 8, 2024 • edited Loading

tsaoyu commented Aug 9, 2024

tsaoyu commented Aug 9, 2024

amritap-ef commented Aug 13, 2024

yiakwy-xpu-ml-framework-team commented Aug 14, 2024

yiakwy-xpu-ml-framework-team commented Aug 14, 2024 • edited Loading

amritap-ef commented Aug 14, 2024

agm-eratosth commented Aug 21, 2024

ayush9818 commented Aug 24, 2024 • edited Loading

nivibilla commented Aug 25, 2024

niuzheng168 commented Sep 6, 2024

ChengyuZhu6 commented Sep 11, 2024

Shreyansh1311 commented Sep 18, 2024 • edited Loading

simon-mo commented Jun 25, 2024 •

edited

Loading

MeJerry215 commented Jun 27, 2024 •

edited

Loading

huseinzol05 commented Jun 28, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jul 12, 2024 •

edited

Loading

akhilreddy0703 commented Jul 30, 2024 •

edited

Loading

dongfangduoshou123 commented Jul 31, 2024 •

edited

Loading

fodevac33 commented Aug 2, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Aug 6, 2024 •

edited

Loading

DarkLight1337 commented Aug 6, 2024 •

edited

Loading

gabrielmbmb commented Aug 8, 2024 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Aug 14, 2024 •

edited

Loading

ayush9818 commented Aug 24, 2024 •

edited

Loading

Shreyansh1311 commented Sep 18, 2024 •

edited

Loading