Add support for prompt-lookup speculative decoding #2469

wasertech · 2024-01-17T17:16:06Z

So transformers has introduced support for speculative decoding of ngrams.

It's as simple as passing prompt_lookup_num_tokens=10 to model.generate in newer version of transformers.

Why would this be useful?

Most often it will speed up inference by up to 3x!

I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM. At least the speed up can make the trouble worthwhile.

Let me know what you think.

The text was updated successfully, but these errors were encountered:

simon-mo · 2024-01-17T17:59:59Z

@cadedaniel is in charge of adding overall support for speculative decoding here: #2188, I would imagine after this PR, ngram support should be very straightforward.

wasertech · 2024-01-17T18:04:43Z

@simon-mo Thanks for letting me know!

wasertech · 2024-01-17T18:10:36Z

I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM.

Like always it's a bit more complicated than I initially anticipated but I am glad to see its in the works.

I'll close this issue even if its not there yet, as the community already knows about it and is well on its way to archive it; speculative decoding w/ vLLM. 🎉

cadedaniel · 2024-01-17T21:37:08Z

thanks for bringing this up @wasertech ! we have an internal prototype for exactly this and it shows good results, but it's blocked on #2188 at the moment

wasertech · 2024-01-17T21:40:49Z

Looking forward to test it on my hardware. I am training atm, but I will give your branch a try later @cadedaniel Thanks for your amazing contribution 🚀!

wasertech · 2024-01-17T21:48:35Z

You know what lets keep this issue open so that people who are wondering too know what's up. I (or someone with auth) can close it once #2188 (and the PR that uses it to introduce ngram speculation) is are merged ^^

cadedaniel · 2024-01-18T23:21:54Z

Closing as duplicate, see #1802

wasertech closed this as completed Jan 17, 2024

wasertech reopened this Jan 17, 2024

simon-mo changed the title ~~Add support for speculative decoding~~ Add support for prompt-lookup speculative decoding Jan 17, 2024

cadedaniel added enhancement New feature or request performance Performance-related issues labels Jan 17, 2024

cadedaniel closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for prompt-lookup speculative decoding #2469

Add support for prompt-lookup speculative decoding #2469

wasertech commented Jan 17, 2024

simon-mo commented Jan 17, 2024

wasertech commented Jan 17, 2024

wasertech commented Jan 17, 2024 •

edited

Loading

cadedaniel commented Jan 17, 2024

wasertech commented Jan 17, 2024

wasertech commented Jan 17, 2024 •

edited

Loading

cadedaniel commented Jan 18, 2024

Add support for prompt-lookup speculative decoding #2469

Add support for prompt-lookup speculative decoding #2469

Comments

wasertech commented Jan 17, 2024

Why would this be useful?

simon-mo commented Jan 17, 2024

wasertech commented Jan 17, 2024

wasertech commented Jan 17, 2024 • edited Loading

cadedaniel commented Jan 17, 2024

wasertech commented Jan 17, 2024

wasertech commented Jan 17, 2024 • edited Loading

cadedaniel commented Jan 18, 2024

wasertech commented Jan 17, 2024 •

edited

Loading

wasertech commented Jan 17, 2024 •

edited

Loading