[Performance]: What can we learn from OctoAI #5167

hmellor · 2024-05-31T21:10:29Z

OctoAI use vLLM as a benchmark to demonstrate how fast they are https://octo.ai/blog/acceleration-is-all-you-need-techniques-powering-octostacks-10x-performance-boost:

Single User Throughput	Multi-user Throughput	Inter-Token Latency

Their main optimisations appear to be:

FP8 quantisation of the model (currently we only support KV cache)
The CustomAllReduce kernel from Nvidia TRT LLM
CUDA graphs
Speculative decoding (which we have thanks to @cadedaniel!)
Dynamic SplitFuse (A.K.A. Chunked Prefill, which we have thanks to @rkooo567!)

My question is, what do we need to do to reach performance parity?

Some clear things are:

Make all of these features compatible with eachother
See what can be learned from the TRT LLM CustomAllReduce
Support executing models in FP8

Notable issues:

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-05-31T21:15:53Z

@KuntaiDu is creating our own benchmarks in realworld models and high-end GPUs. We need to know at first the current speed of vLLM. Companies may use an old version of vLLM or don't know how to set some advanced flags in vLLM, leading to poor performance in their benchmark (and they are incentivized to do so :) ).

hmellor · 2024-05-31T21:50:42Z

That is an excellent point, I've noticed that too in other comparisons.

Will these benchmarks be made available in https://github.com/vllm-project/vllm/tree/main/benchmarks? I've love for that directory to be tidied up a bit and generalised so that they can be used both offline (as most of them are today) and online (which would be more useful).

youkaichao · 2024-05-31T21:56:43Z

You can track it via #5073 .

rkooo567 · 2024-06-04T01:24:11Z

+1. It is very easy to trick this kind of benchmark tbh. It is the best we compare it ourselves in a fair way

ywang96 · 2024-06-05T17:44:31Z

ICYMI - they were using vLLM 0.3.3 for this benchmark.

zhyncs · 2024-06-14T07:29:45Z

Make all of these features compatible with each other

Make sense. Currently, the biggest issue with vLLM is that many features are not compatible for simultaneous use. For example, when baseline (vanilla fp16) + automatic prefix cache + chunked prefill + int8 kv cache + awq + speculative decoding are all enabled at the same time, there will be significant benefits compared to just using the baseline (vanilla fp16). ref #2614 (comment)

The main issue here is that, when adding each feature, from design and implementation to review, compatibility has not been given enough attention. ref InternLM/lmdeploy#1450 (comment)

At the same time, the advantages of vLLM are also very obvious, more like a higher-performance transformers. In terms of model support, different hardware backend support and community activity, it is so great.

zhyncs · 2024-06-14T08:07:16Z

In fact, our team started using vLLM in the early part of last year, around July 2023. At that time, we also submitted PRs for W8A8 and KV Cache Int8 in September 2023 #1112. Later, to facilitate review, the PR was split into two parts #1507 #1508. This year, we also submitted a PR for W4A8 #5218.

TensorRT LLM has some closed-source components, such as batch manager and attention kernel, and its usability is average. LMDeploy TurboMind has excellent performance but supports fewer models, for example, it lacks support for MOE models. It can be said that each framework has its own advantages and disadvantages. At that time, we combined our own business needs. For example, we did not use MOE models in the short term because our algorithm colleagues found that after applying SFT to the model, the effect on MOE was not as good as Dense models in the short term (this is another topic).

Currently, many startups write blogs to demonstrate that their LLM Inference framework is better, such as FireWorks AI, FriendliAI and the OctoAI you mentioned above. At this time, they will naturally choose the most popular vLLM in the community and then construct scenarios that are favorable to themselves in testing environments and software versions. I don't think these performance comparison blogs have much significance. It's more about Public Relations.

hmellor added the performance Performance-related issues label May 31, 2024

hmellor closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance]: What can we learn from OctoAI #5167

[Performance]: What can we learn from OctoAI #5167

hmellor commented May 31, 2024 •

edited

Loading

youkaichao commented May 31, 2024

hmellor commented May 31, 2024

youkaichao commented May 31, 2024

rkooo567 commented Jun 4, 2024

ywang96 commented Jun 5, 2024 •

edited

Loading

zhyncs commented Jun 14, 2024

zhyncs commented Jun 14, 2024

[Performance]: What can we learn from OctoAI #5167

[Performance]: What can we learn from OctoAI #5167

Comments

hmellor commented May 31, 2024 • edited Loading

youkaichao commented May 31, 2024

hmellor commented May 31, 2024

youkaichao commented May 31, 2024

rkooo567 commented Jun 4, 2024

ywang96 commented Jun 5, 2024 • edited Loading

zhyncs commented Jun 14, 2024

zhyncs commented Jun 14, 2024

hmellor commented May 31, 2024 •

edited

Loading

ywang96 commented Jun 5, 2024 •

edited

Loading