Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for prompt-lookup speculative decoding #2469

Closed
wasertech opened this issue Jan 17, 2024 · 7 comments
Closed

Add support for prompt-lookup speculative decoding #2469

wasertech opened this issue Jan 17, 2024 · 7 comments
Labels
enhancement New feature or request performance Performance-related issues

Comments

@wasertech
Copy link

So transformers has introduced support for speculative decoding of ngrams.

huggingface/transformers#27979

It's as simple as passing prompt_lookup_num_tokens=10 to model.generate in newer version of transformers.

Why would this be useful?

Most often it will speed up inference by up to 3x!

I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM. At least the speed up can make the trouble worthwhile.

Let me know what you think.

@simon-mo
Copy link
Collaborator

@cadedaniel is in charge of adding overall support for speculative decoding here: #2188, I would imagine after this PR, ngram support should be very straightforward.

@wasertech
Copy link
Author

@simon-mo Thanks for letting me know!

@wasertech
Copy link
Author

wasertech commented Jan 17, 2024

I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM.

Like always it's a bit more complicated than I initially anticipated but I am glad to see its in the works.

I'll close this issue even if its not there yet, as the community already knows about it and is well on its way to archive it; speculative decoding w/ vLLM. 🎉

@cadedaniel
Copy link
Collaborator

thanks for bringing this up @wasertech ! we have an internal prototype for exactly this and it shows good results, but it's blocked on #2188 at the moment

@wasertech
Copy link
Author

Looking forward to test it on my hardware. I am training atm, but I will give your branch a try later @cadedaniel Thanks for your amazing contribution 🚀!

@wasertech
Copy link
Author

wasertech commented Jan 17, 2024

You know what lets keep this issue open so that people who are wondering too know what's up. I (or someone with auth) can close it once #2188 (and the PR that uses it to introduce ngram speculation) is are merged ^^

@wasertech wasertech reopened this Jan 17, 2024
@simon-mo simon-mo changed the title Add support for speculative decoding Add support for prompt-lookup speculative decoding Jan 17, 2024
@cadedaniel cadedaniel added enhancement New feature or request performance Performance-related issues labels Jan 17, 2024
@cadedaniel
Copy link
Collaborator

Closing as duplicate, see #1802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

3 participants