Turbomind prefix caching #1450

ispobock · 2024-04-18T04:03:58Z

Motivation

#1407

Modification

Turbomind change
Add cli option after PyTorch Engine hash table based prefix caching #1429 merged
Benchmark and evaluation
Compatibility testing with AWQ, online KV Cache Int4/Int8 and tp

zhyncs · 2024-04-18T04:38:33Z

TODO: need to add compatibility testing with AWQ, online KV Cache Int4 and Int8 @ispobock

src/turbomind/models/llama/SequenceManager.cc

src/turbomind/models/llama/BlockTrie.cc

src/turbomind/models/llama/SequenceManager.h

zhyncs · 2024-04-18T05:31:36Z

TODO: need to add compatibility testing with AWQ, online KV Cache Int4 and Int8 @ispobock

Also need to test the case when TP is turned on.

lmdeploy/messages.py

src/turbomind/models/llama/LlamaBatch.cc

ispobock · 2024-04-19T02:31:45Z

Benchmark with method mentioned in #1407 (comment).
Settings:

engine: Turbomind
model: llama2-13B-chat
num prompts: 1000

Use LMDeploy benchmark script (used in #1429 (comment)):
w/o prefix caching:

concurrency: 128
elapsed_time: 168.270s

number of prompt tokens: 332115
number of completion tokens: 241536
token throughput (completion token): 1435.405 token/s
token throughput (prompt + completion token): 3409.104 token/s
RPS (request per second): 5.943 req/s
RPM (request per minute): 356.569 req/min

with prefix caching:

concurrency: 128
elapsed_time: 146.064s

number of prompt tokens: 332115
number of completion tokens: 241536
token throughput (completion token): 1653.630 token/s
token throughput (prompt + completion token): 3927.392 token/s
RPS (request per second): 6.846 req/s
RPM (request per minute): 410.779 req/min

Use vLLM benchmark script (used in #1407 (comment)):
w/o prefix caching:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  112.31
Total input tokens:                      336509
Total generated tokens:                  160192
Request throughput (req/s):              8.90
Input token throughput (tok/s):          2996.25
Output token throughput (tok/s):         1426.34
---------------Time to First Token----------------
Mean TTFT (ms):                          39691.85
Median TTFT (ms):                        35011.06
P99 TTFT (ms):                           101250.62

with prefix caching:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  97.21
Total input tokens:                      336509
Total generated tokens:                  160178
Request throughput (req/s):              10.29
Input token throughput (tok/s):          3461.66
Output token throughput (tok/s):         1647.75
---------------Time to First Token----------------
Mean TTFT (ms):                          33815.43
Median TTFT (ms):                        31043.85
P99 TTFT (ms):                           86314.32

We can see almost 15% throughput improvement when enable prefix caching for Turbomind engine. Actually, the token_id length of system prompts added in #1407 (comment) is 116, which means only 1 block will be reused. The improvement will be more significant when using longer system prompts.

ispobock · 2024-04-19T02:45:17Z

Evaluation result for Internlm2-7b with prefix caching:

dataset                                 version    metric         mode    internlm2-7b-turbomind
--------------------------------------  ---------  -------------  ------  ------------------------
--------- 考试 Exam ---------           -          -              -       -
ceval                                   -          naive_average  gen     64.29
agieval                                 -          -              -       -
mmlu                                    -          naive_average  gen     62.46
GaokaoBench                             -          -              -       -
ARC-c                                   -          -              -       -
--------- 语言 Language ---------       -          -              -       -
WiC                                     d06864     accuracy       gen     56.43
summedits                               -          -              -       -
chid-dev                                -          -              -       -
afqmc-dev                               -          -              -       -
bustm-dev                               -          -              -       -
cluewsc-dev                             -          -              -       -
WSC                                     7902a7     accuracy       gen     53.85
winogrande                              -          -              -       -
flores_100                              -          -              -       -
--------- 知识 Knowledge ---------      -          -              -       -
BoolQ                                   -          -              -       -
commonsense_qa                          -          -              -       -
nq                                      -          -              -       -
triviaqa                                2121ce     score          gen     61.45
--------- 推理 Reasoning ---------      -          -              -       -
cmnli                                   -          -              -       -
ocnli                                   -          -              -       -
ocnli_fc-dev                            -          -              -       -
AX_b                                    -          -              -       -
AX_g                                    -          -              -       -
CB                                      -          -              -       -
RTE                                     -          -              -       -
story_cloze                             -          -              -       -
COPA                                    -          -              -       -
ReCoRD                                  -          -              -       -
hellaswag                               -          -              -       -
piqa                                    -          -              -       -
siqa                                    -          -              -       -
strategyqa                              -          -              -       -
math                                    -          -              -       -
gsm8k                                   1d7fe4     accuracy       gen     71.19
TheoremQA                               -          -              -       -
openai_humaneval                        -          -              -       -
mbpp                                    -          -              -       -
bbh                                     -          -              -       -
--------- 理解 Understanding ---------  -          -              -       -
C3                                      -          -              -       -
CMRC_dev                                -          -              -       -
DRCD_dev                                -          -              -       -
MultiRC                                 -          -              -       -
race-middle                             9a54b6     accuracy       gen     22.77
race-high                               9a54b6     accuracy       gen     22.53
openbookqa_fact                         -          -              -       -
csl_dev                                 -          -              -       -
lcsts                                   -          -              -       -
Xsum                                    -          -              -       -
eprstmt-dev                             -          -              -       -
lambada                                 -          -              -       -
tnews-dev                               -          -              -       -

lvhan028 · 2024-04-19T04:35:40Z

Fantastic job! Thanks so much.
We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference.
Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

zhyncs · 2024-04-19T04:37:19Z

Fantastic job! Thanks so much. We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference. Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

ok

zhyncs · 2024-04-19T04:38:00Z

And may you help review the code and give some suggestions? Thanks. @lvhan028 @lzhangzz @grimoire

ispobock · 2024-04-19T05:14:55Z

We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference.
Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

@lvhan028 Is there any planned features on Turbomind engine in the next month? Hopefully there won't be too many code conflicts.

zhyncs · 2024-04-19T05:17:16Z

We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference.
Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

@lvhan028 Is there any planned features on Turbomind engine in the next month? Hopefully there won't be too many code conflicts.

especially in LlamaBatch

lvhan028 · 2024-04-19T05:26:51Z

There are definitely conflicts due to #1458

zhyncs · 2024-04-19T05:29:05Z

There are definitely conflicts due to #1458

There is almost no impact, as long as the refactoring of LlamaBatch, decoupling batch and model will be after v0.5.0, there will not be any major impact.

ispobock · 2024-04-19T05:31:20Z

There are definitely conflicts due to #1458

Got it. It seems no big conflict with this feature.

ispobock · 2024-04-22T02:13:39Z

The evaluation result for Turbomind prefix caching + AWQ + online kv cache int4 + tp2:

dataset                                 version    metric         mode    internlm2-chat-7b-4bits-turbomind
--------------------------------------  ---------  -------------  ------  ----------------------------------
--------- 考试 Exam ---------           -          -              -       -
ceval                                   -          naive_average  gen     51.35
agieval                                 -          -              -       -
mmlu                                    -          naive_average  gen     53.39
GaokaoBench                             -          -              -       -
ARC-c                                   -          -              -       -
--------- 语言 Language ---------       -          -              -       -
WiC                                     d06864     accuracy       gen     52.19
summedits                               -          -              -       -
chid-dev                                -          -              -       -
afqmc-dev                               -          -              -       -
bustm-dev                               -          -              -       -
cluewsc-dev                             -          -              -       -
WSC                                     7902a7     accuracy       gen     63.46
winogrande                              -          -              -       -
flores_100                              -          -              -       -
--------- 知识 Knowledge ---------      -          -              -       -
BoolQ                                   -          -              -       -
commonsense_qa                          -          -              -       -
nq                                      -          -              -       -
triviaqa                                2121ce     score          gen     40.64
--------- 推理 Reasoning ---------      -          -              -       -
cmnli                                   -          -              -       -
ocnli                                   -          -              -       -
ocnli_fc-dev                            -          -              -       -
AX_b                                    -          -              -       -
AX_g                                    -          -              -       -
CB                                      -          -              -       -
RTE                                     -          -              -       -
story_cloze                             -          -              -       -
COPA                                    -          -              -       -
ReCoRD                                  -          -              -       -
hellaswag                               -          -              -       -
piqa                                    -          -              -       -
siqa                                    -          -              -       -
strategyqa                              -          -              -       -
math                                    -          -              -       -
gsm8k                                   1d7fe4     accuracy       gen     39.73
TheoremQA                               -          -              -       -
openai_humaneval                        -          -              -       -
mbpp                                    -          -              -       -
bbh                                     -          -              -       -
--------- 理解 Understanding ---------  -          -              -       -
C3                                      -          -              -       -
CMRC_dev                                -          -              -       -
DRCD_dev                                -          -              -       -
MultiRC                                 -          -              -       -
race-middle                             9a54b6     accuracy       gen     74.16
race-high                               9a54b6     accuracy       gen     67.87
openbookqa_fact                         -          -              -       -
csl_dev                                 -          -              -       -
lcsts                                   -          -              -       -
Xsum                                    -          -              -       -
eprstmt-dev                             -          -              -       -
lambada                                 -          -              -       -
tnews-dev                               -          -              -       -

The evaluation result for AWQ + online kv cache int4 + tp2, without Turbomind prefix caching:

dataset                                 version    metric         mode    internlm2-chat-7b-4bits-turbomind
--------------------------------------  ---------  -------------  ------  -----------------------------------
--------- 考试 Exam ---------           -          -              -       -
ceval                                   -          naive_average  gen     50.92
agieval                                 -          -              -       -
mmlu                                    -          naive_average  gen     53.68
GaokaoBench                             -          -              -       -
ARC-c                                   -          -              -       -
--------- 语言 Language ---------       -          -              -       -
WiC                                     d06864     accuracy       gen     53.29
summedits                               -          -              -       -
chid-dev                                -          -              -       -
afqmc-dev                               -          -              -       -
bustm-dev                               -          -              -       -
cluewsc-dev                             -          -              -       -
WSC                                     7902a7     accuracy       gen     67.31
winogrande                              -          -              -       -
flores_100                              -          -              -       -
--------- 知识 Knowledge ---------      -          -              -       -
BoolQ                                   -          -              -       -
commonsense_qa                          -          -              -       -
nq                                      -          -              -       -
triviaqa                                2121ce     score          gen     40.48
--------- 推理 Reasoning ---------      -          -              -       -
cmnli                                   -          -              -       -
ocnli                                   -          -              -       -
ocnli_fc-dev                            -          -              -       -
AX_b                                    -          -              -       -
AX_g                                    -          -              -       -
CB                                      -          -              -       -
RTE                                     -          -              -       -
story_cloze                             -          -              -       -
COPA                                    -          -              -       -
ReCoRD                                  -          -              -       -
hellaswag                               -          -              -       -
piqa                                    -          -              -       -
siqa                                    -          -              -       -
strategyqa                              -          -              -       -
math                                    -          -              -       -
gsm8k                                   1d7fe4     accuracy       gen     40.03
TheoremQA                               -          -              -       -
openai_humaneval                        -          -              -       -
mbpp                                    -          -              -       -
bbh                                     -          -              -       -
--------- 理解 Understanding ---------  -          -              -       -
C3                                      -          -              -       -
CMRC_dev                                -          -              -       -
DRCD_dev                                -          -              -       -
MultiRC                                 -          -              -       -
race-middle                             9a54b6     accuracy       gen     74.30
race-high                               9a54b6     accuracy       gen     67.52
openbookqa_fact                         -          -              -       -
csl_dev                                 -          -              -       -
lcsts                                   -          -              -       -
Xsum                                    -          -              -       -
eprstmt-dev                             -          -              -       -
lambada                                 -          -              -       -
tnews-dev                               -          -              -       -

The result diff is mainly caused by the sampling settings in the evaluation code.
The results are close with and w/o prefix caching, which indicates these features are compatible.

zhyncs · 2024-04-22T02:18:29Z

Fantastic job! Thanks so much. We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference. Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

We need to regularly merge the main branch before approving to avoid conflicts.

src/turbomind/models/llama/SequenceManager.h

src/turbomind/models/llama/SequenceManager.cc

lzhangzz · 2024-04-26T18:01:46Z

src/turbomind/models/llama/LlamaBatch.cc

+        if (input_length) {
+            // update tokens in sequence
+            seq.tokens.resize(history_length+input_length);
+            std::copy_n(input_ids, input_length, &seq.tokens[history_length]);


updating seq.tokens here will currupt the sequence state if it gets canceled before finishing.

Could you point out what sequence state would be corrupted? And any suggestions to handle this?

Could you point out what sequence state would be corrupted?

seq.tokens will be inconsistent with what's filled in the KV cache.

And any suggestions to handle this?

May add a prompt field for Sequence to hold a copy of input ids for newly created sequence (start_flag == true ). It can be cleared after CacheIfEnabled. This way both stage and interact_count can be eliminated.

Also, be careful when there exists input_embeddings. Prefix matching must avoid matching embedding tokens as the supplied token ids are duplicated model specific dummy id. So the copying of prompt tokens stops at first embedding token.

@irexyc What's the best way to get the first occurance of embedding tokens in ProcessInferRequests?

@irexyc What's the best way to get the first occurance of embedding tokens in ProcessInferRequests?

We could only enable Prefix Cache for LLM and not use for VLM.

updating seq.tokens here will currupt the sequence state if it gets canceled before finishing.

Do you mean stop requests? If the sequence gets canceled before finishing, the sequence will be interrupted and erased.

Stop request only stop the current request, the sequence won't be earsed. It can be resumed from any step in it's history.

src/turbomind/models/llama/SequenceManager.h

lzhangzz · 2024-04-26T18:58:37Z

src/turbomind/models/llama/SequenceManager.cc

-    freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end());
+    // if prefix cache enabled, blocks will be shared by sequences, cannot be freed immediately
+    if (!block_trie_->enabled()) {
+        freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end());


This still need to be handled, otherwise much more blocks than needed will be allocated when on demand allocation mode is used. We used to have a Block::ref_count for counting inactive references (Block::use_count for active references), this is the case where it's useful.

This still need to be handled, otherwise much more blocks than needed will be allocated when on demand allocation mode is used.

What's the on demand allocation mode? Now the allocation seems to only happen in Materialize.

When cache_chunk_size is set to positive value, instead of pre-allocating all blocks, the block manager will only allocate another chunk when there are not enough free blocks. Fail to release cache blocks (cached -> free) of dead sequences will make the manager allocate more blocks than needed.

But we cannot move these blocks which were just cached to free_ here. It's better to use allocated free blocks and then evict the cached blocks if the gpu memory is not enough.

When cache_chunk_size is set to positive value, instead of pre-allocating all blocks, the block manager will only allocate another chunk when there are not enough free blocks. Fail to release cache blocks (cached -> free) of dead sequences will make the manager allocate more blocks than needed.

@lzhangzz The Prefix Cache has been basically modified according to your suggestions. We just need to discuss this part, when would it be convenient for you to take a look? Thanks.

@lzhangzz When Prefix Cache is enabled, we are unable to add the sequence blocks to freed_ here. Even though we did not explicitly perform this action, the status of freed_ will be updated by the next Materialize. I think it's acceptable and just a tradeoff.

src/turbomind/models/llama/SequenceManager.cc

lzhangzz · 2024-04-28T12:59:08Z

preempt logic can be applied to solve the starvation problem. C with higher priority, so the blocks in later A_x will be preempted.

Preemption won't work because BlockTrie holds a dummy reference and BlockTrie::evict operates directly on the actual use_count while preemption logic operates on a snapshot of use_count.

zhyncs

LGTM cc @lzhangzz

lzhangzz · 2024-04-30T14:20:51Z

We still have some un-closed issues

With a batch of sequences sharing previously unseen (or evicted) prefixes, neither computation nor cache blocks are shared.
When something in the prefix cache gets evicted, it can't get into the cache again until a request with the same prefix as prompt comes.

ispobock · 2024-04-30T15:48:44Z

@lzhangzz

The first one is expected. Now we only cache and reuse the computed blocks to avoid write conflict. In this design, the blocks reuse may have one iteration delay. If a new request want to match prefix blocks, the blocks should be cached in the previous iterations. I don't think it will affect overall performance too much.
For the second one, I am not sure I fully understand. Blocks in the prefix cache will not be evicted when use_count=0. They will only be evicted when they are re-allocated (indicated by unique_id mismatch during verification). Before the re-allocation, they still can be reused. If a block is re-allocated, we don't need to get it back to cache.

zhyncs · 2024-04-30T18:03:59Z

We still have some un-closed issues

With a batch of sequences sharing previously unseen (or evicted) prefixes, neither computation nor cache blocks are shared.

When something in the prefix cache gets evicted, it can't get into the cache again until a request with the same prefix as prompt comes.

@lzhangzz The original intention of this feature design is to solve the problem of system prefix cache. Especially for search scenarios in Internet companies, the current classic practice is to conduct SFT based on a SOTA chat model and use prompt engineering to make it a question and answer assistant in the vertical field. This usually means that in the request, there will usually be one or more common, similar system prefix caches. The current implementation is compatible with the existing designs and implementations of LMDeploy, and meets the requirements mentioned above. We currently believe it meets expectations. What you are discussing is a more universal cache. And we could also discuss your suggestion again. Regarding the first point, I also agree with what @ispobock said, as it does not have a significant impact on performance.

zhyncs · 2024-05-09T04:48:48Z

Hi @irexyc May you help review the PR? Thanks.

lzhangzz · 2024-05-10T12:55:56Z

benchmark/profile_generation.py

@@ -336,13 +336,15 @@ def parse_args():
    cache_count_act = ArgumentHelper.cache_max_entry_count(pt_group)
    cache_block_seq_len_act = ArgumentHelper.cache_block_seq_len(pt_group)
    session_len_act = ArgumentHelper.session_len(pt_group, default=2048)
+    prefix_caching_act = ArgumentHelper.enable_prefix_caching(pt_group)


Adding the argument for the pytorch engine?

Yes, for other benchmark scripts we also add the argument for both pytorch engine and turbomind engine. @grimoire pls help check.

lzhangzz · 2024-05-10T13:03:31Z

@irexyc Pls help to check this will not break VLM when embedding inputs are present.

irexyc · 2024-05-15T07:10:37Z

Please add cc to the extensions and fix the format issue.

docker run -it --rm --workdir /src -v $(pwd):/src clang-format-lint --clang-format-executable /clang-format/clang-format11 -r  --inplace True --style=file --extensions "h,c,cpp,hpp,cu,cuh,cc" src

modified:   src/turbomind/models/llama/BlockTrie.cc
modified:   src/turbomind/models/llama/LlamaBatch.cc
modified:   src/turbomind/models/llama/SequenceManager.cc