Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Adding Eagle, Medusa, Look Ahead decoding ( improvements of Speculative decoding) #2791

Open
HamidShojanazeri opened this issue Feb 6, 2024 · 7 comments
Labels

Comments

@HamidShojanazeri
Copy link

HamidShojanazeri commented Feb 6, 2024

Thanks for the great work team. I wonder if there is any plan to add new improvements to speculative decoding such as Eagle, Medusa, look ahead decoding. These could result in accumulative speed ups for VLLM.

cc: @WoosukKwon

@simon-mo
Copy link
Collaborator

simon-mo commented Feb 6, 2024

Yes. The plan is here #2188

@HamidShojanazeri
Copy link
Author

thanks for sharing @simon-mo that sounds great! I also wonder if newer methods also can improve on speculative decoding with removing the need for a draft model and we are exploring that path as well?

Copy link
Collaborator

simon-mo commented Feb 8, 2024

The speculative decoding framework is designed to support a wide range of draft model and draft model free algorithms. Once the immediate features are in place (by @cadedaniel), we welcome community's contribution for more methods!

@cadedaniel
Copy link
Collaborator

Correct! And yes, speculation methods without a draft model have benefits in both performance and usability. Unclear right now which specific approach will end up being the best but vLLM should support it.

@caliber1313
Copy link

I would like to suggest Hydra in your project alongside with medusa. Please find hydra repository here: https://github.com/zankner/Hydra.

Thank you for your consideration

@josephrocca
Copy link

josephrocca commented Jun 5, 2024

For those interested in some ranking data of the different methods, below is a copy-paste from a neat project by @hemingkx called Spec-Bench. The ranking when running 33B models is similar. Please see the linked repo for latest data - just pasting here for those who are skimming this thread.

  • Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
  • Testing environment: Pytorch 2.0.1, under CUDA 11.8
  • Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1
Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.44x 1.81x 2.13x 2.11x 2.54x 1.82x 3.57 2.16x
SpS🥈 1.98x 1.37x 2.00x 1.95x 1.89x 1.76x 2.29 1.83x
Hydra🥉 2.04x 1.67x 1.56x 1.81x 2.16x 1.48x 3.26 1.80x
PLD 1.57x 1.07x 2.31x 1.25x 1.62x 1.56x 1.74 1.55x
Medusa 1.60x 1.38x 1.28x 1.46x 1.64x 1.22x 2.32 1.44x
REST 1.49x 1.18x 1.21x 1.46x 1.35x 1.27x 1.63 1.32x
Lookahead 1.13x 0.97x 1.05x 1.07x 1.29x 0.98x 1.65 1.08x

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants