Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: What can we learn from OctoAI #5167

Closed
hmellor opened this issue May 31, 2024 · 7 comments
Closed

[Performance]: What can we learn from OctoAI #5167

hmellor opened this issue May 31, 2024 · 7 comments
Labels
performance Performance-related issues

Comments

@hmellor
Copy link
Collaborator

hmellor commented May 31, 2024

OctoAI use vLLM as a benchmark to demonstrate how fast they are https://octo.ai/blog/acceleration-is-all-you-need-techniques-powering-octostacks-10x-performance-boost:

Single User Throughput Multi-user Throughput Inter-Token Latency

Their main optimisations appear to be:

  • FP8 quantisation of the model (currently we only support KV cache)
  • The CustomAllReduce kernel from Nvidia TRT LLM
  • CUDA graphs
  • Speculative decoding (which we have thanks to @cadedaniel!)
  • Dynamic SplitFuse (A.K.A. Chunked Prefill, which we have thanks to @rkooo567!)

My question is, what do we need to do to reach performance parity?

Some clear things are:

  • Make all of these features compatible with eachother
  • See what can be learned from the TRT LLM CustomAllReduce
  • Support executing models in FP8

Notable issues:

@hmellor hmellor added the performance Performance-related issues label May 31, 2024
@youkaichao
Copy link
Member

@KuntaiDu is creating our own benchmarks in realworld models and high-end GPUs. We need to know at first the current speed of vLLM. Companies may use an old version of vLLM or don't know how to set some advanced flags in vLLM, leading to poor performance in their benchmark (and they are incentivized to do so :) ).

@hmellor
Copy link
Collaborator Author

hmellor commented May 31, 2024

That is an excellent point, I've noticed that too in other comparisons.

Will these benchmarks be made available in https://github.com/vllm-project/vllm/tree/main/benchmarks? I've love for that directory to be tidied up a bit and generalised so that they can be used both offline (as most of them are today) and online (which would be more useful).

@youkaichao
Copy link
Member

You can track it via #5073 .

@rkooo567
Copy link
Collaborator

rkooo567 commented Jun 4, 2024

+1. It is very easy to trick this kind of benchmark tbh. It is the best we compare it ourselves in a fair way

@ywang96
Copy link
Member

ywang96 commented Jun 5, 2024

ICYMI - they were using vLLM 0.3.3 for this benchmark.

@zhyncs
Copy link
Contributor

zhyncs commented Jun 14, 2024

  • Make all of these features compatible with each other

Make sense. Currently, the biggest issue with vLLM is that many features are not compatible for simultaneous use. For example, when baseline (vanilla fp16) + automatic prefix cache + chunked prefill + int8 kv cache + awq + speculative decoding are all enabled at the same time, there will be significant benefits compared to just using the baseline (vanilla fp16). ref #2614 (comment)

The main issue here is that, when adding each feature, from design and implementation to review, compatibility has not been given enough attention. ref InternLM/lmdeploy#1450 (comment)

At the same time, the advantages of vLLM are also very obvious, more like a higher-performance transformers. In terms of model support, different hardware backend support and community activity, it is so great.

@zhyncs
Copy link
Contributor

zhyncs commented Jun 14, 2024

In fact, our team started using vLLM in the early part of last year, around July 2023. At that time, we also submitted PRs for W8A8 and KV Cache Int8 in September 2023 #1112. Later, to facilitate review, the PR was split into two parts #1507 #1508. This year, we also submitted a PR for W4A8 #5218.

TensorRT LLM has some closed-source components, such as batch manager and attention kernel, and its usability is average. LMDeploy TurboMind has excellent performance but supports fewer models, for example, it lacks support for MOE models. It can be said that each framework has its own advantages and disadvantages. At that time, we combined our own business needs. For example, we did not use MOE models in the short term because our algorithm colleagues found that after applying SFT to the model, the effect on MOE was not as good as Dense models in the short term (this is another topic).

Currently, many startups write blogs to demonstrate that their LLM Inference framework is better, such as FireWorks AI, FriendliAI and the OctoAI you mentioned above. At this time, they will naturally choose the most popular vLLM in the community and then construct scenarios that are favorable to themselves in testing environments and software versions. I don't think these performance comparison blogs have much significance. It's more about Public Relations.

@hmellor hmellor closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

5 participants