[Speculative decoding 1/9] Optimized rejection sampler #2336

cadedaniel · 2024-01-04T00:37:32Z

Speculative decoding

This PR is a part of a larger series of PRs implementing speculative decoding, contributed to open source vLLM by Anyscale. See #2188 and Speculative decoding open sourcing plan for more information.

Rejection sampling

This PR implements optimized rejection sampling, including the following features:

Implementation of modified rejection sampling as described in https://arxiv.org/pdf/2302.01318.pdf
All operations are batched on GPU, allowing non-blocking computation.
Efficient collection of metrics regarding acceptance rate and number of emitted tokens.

It also contributes tests which verify the rejection sampler's ability to approximate distributions, given enough samples.

The following people contributed to it: @cadedaniel @Yard1 @amogkam

Details

The basic idea behind rejection sampling is that one can sample from the target distribution (larger model) using samples from a proposal distribution (smaller draft model), while guaranteeing the output distribution is equivalent to the target distribution.

"Modified" rejection sampling is introduced in the paper. It ensures that at least one token will always be emitted from the rejection sampling routine, even if all proposal tokens are rejected.

With LLMs, modified rejection sampling can reduce latency because multiple proposal sequences can be evaluated at once (batching on the GPU).

Finally, the paper introduces the notion of a "bonus" token. In the case where all proposed tokens are accepted, an additional token can be emitted. This is possible by having the target model predict the next token given the entire proposed sequence as context.

Visual confirmation that modified rejection sampling approximates the target distribution:

code for visualizations: https://gist.github.com/cadedaniel/07c1cd4ac003f51140b205580ac02613

cadedaniel · 2024-01-04T00:42:12Z

cc @LiuXiaoxuanPKU @WoosukKwon @zhuohan123 @Yard1

vllm/model_executor/layers/rejection_sampler.py

cadedaniel · 2024-01-06T00:00:04Z

The next PR will be cadedaniel#1, will create it once this is merged.

LiuXiaoxuanPKU · 2024-01-06T07:17:46Z

vllm/model_executor/layers/rejection_sampler.py

+
+        # Create masks using the indices.
+        indices = torch.arange(k, device=accepted.device).unsqueeze(0)
+        accepted_mask = indices < limits.unsqueeze(1)


what's the difference between accepted and accepted_mask?

accepted is the result of the rejection sampling condition. accepted_mask is True up until the first position rejected by the rejection sampling condition.

Example for k=3, bs=5:

>>> accepted tensor([[ True, False, True], [False, False, False], [ True, True, False], [ True, True, True], [False, True, False]]) >>> accepted_mask tensor([[ True, False, False], [False, False, False], [ True, True, False], [ True, True, True], [False, False, False]])

LiuXiaoxuanPKU · 2024-01-06T07:26:06Z

vllm/model_executor/layers/rejection_sampler.py

+        super().__init__()
+        self.probs_dtype = torch.float32
+        self.token_id_dtype = torch.int64
+        self._num_bonus_tokens = 1


when can num_bonus_tokens > 1? Is it the last generated token by the target model iff all drafted tokens are accepted?

when can num_bonus_tokens > 1?

It is always 1. This variable is for readability only. I'll add a comment.

Is it the last generated token by the target model iff all drafted tokens are accepted?

Yep!

zhaoyang-star · 2024-01-09T00:57:40Z

Very exciting work! I hope this feature can be merged soon as many other framworks such as TGI, TRT-LLM, llama.cpp, gpt-fast have supported Speculative sampling.

LiuXiaoxuanPKU · 2024-01-09T01:41:18Z

vllm/model_executor/layers/rejection_sampler.py

+        f = torch.clamp(difference, min=self._smallest_positive_value)
+
+        # shape [batch_size, k, vocab_size]
+        recovered_probs = f / torch.sum(f, dim=-1).reshape(-1, k, 1)


Nit: torch.multinomial does not require the probability to be normalized.

I will leave this in to keep the maths consistent with https://arxiv.org/pdf/2302.01318.pdf. This operation is not the compute or scheduling bottleneck.

.github/workflows/scripts/test_scripts/run_spec_decode_tests.sh

…2336)

pcmoritz reviewed Jan 4, 2024

View reviewed changes

vllm/model_executor/layers/rejection_sampler.py Outdated Show resolved Hide resolved

cadedaniel added 7 commits January 5, 2024 15:36

rejection sampler

8719da2

remove header

eb882ee

moving util func to end of file

a82a7af

moving test runner to .github

26a12a9

device check fix

9bb7962

lint

876bf2e

updating link

434c525

cadedaniel force-pushed the rejection-sampler branch from d032887 to 434c525 Compare January 5, 2024 23:39

LiuXiaoxuanPKU reviewed Jan 6, 2024

View reviewed changes

cadedaniel added 2 commits January 5, 2024 23:58

pr feedback

0a10a80

fix

a96266a

LiuXiaoxuanPKU reviewed Jan 9, 2024

View reviewed changes

.github/workflows/scripts/test_scripts/run_spec_decode_tests.sh Outdated Show resolved Hide resolved

Removing test bash script

853180f

LiuXiaoxuanPKU approved these changes Jan 9, 2024

View reviewed changes

LiuXiaoxuanPKU merged commit 79d64c4 into vllm-project:main Jan 9, 2024
2 of 4 checks passed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Jan 18, 2024

[Speculative decoding 1/9] Optimized rejection sampler (vllm-project#…

eea3155

…2336)

sighingnow mentioned this pull request Jan 30, 2024

Fixes the error in num_accepted_tokens calculation in reject sampler #2658

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

[Speculative decoding 1/9] Optimized rejection sampler (vllm-project#…

e55427f

…2336)

sighingnow mentioned this pull request Feb 25, 2024

Introduce speculative decoding with draft models to vLLM #3029

Closed

cadedaniel mentioned this pull request Aug 19, 2024

[Documentation request]: Add documentation on lossless guarantees of speculative decoding in vLLM #7627

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding 1/9] Optimized rejection sampler #2336

[Speculative decoding 1/9] Optimized rejection sampler #2336

cadedaniel commented Jan 4, 2024

cadedaniel commented Jan 4, 2024

cadedaniel commented Jan 6, 2024

LiuXiaoxuanPKU Jan 6, 2024

cadedaniel Jan 6, 2024

LiuXiaoxuanPKU Jan 6, 2024

cadedaniel Jan 6, 2024

zhaoyang-star commented Jan 9, 2024 •

edited

Loading

LiuXiaoxuanPKU Jan 9, 2024

cadedaniel Jan 9, 2024

[Speculative decoding 1/9] Optimized rejection sampler #2336

[Speculative decoding 1/9] Optimized rejection sampler #2336

Conversation

cadedaniel commented Jan 4, 2024