Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turbomind prefix caching #1450

Merged
merged 17 commits into from
May 15, 2024
Merged

Conversation

ispobock
Copy link
Contributor

@ispobock ispobock commented Apr 18, 2024

Motivation

#1407

Modification

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 18, 2024

TODO: need to add compatibility testing with AWQ, online KV Cache Int4 and Int8 @ispobock

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 18, 2024

TODO: need to add compatibility testing with AWQ, online KV Cache Int4 and Int8 @ispobock

Also need to test the case when TP is turned on.

lmdeploy/messages.py Outdated Show resolved Hide resolved
@ispobock
Copy link
Contributor Author

ispobock commented Apr 19, 2024

Benchmark with method mentioned in #1407 (comment).
Settings:

engine: Turbomind
model: llama2-13B-chat
num prompts: 1000

Use LMDeploy benchmark script (used in #1429 (comment)):
w/o prefix caching:

concurrency: 128
elapsed_time: 168.270s

number of prompt tokens: 332115
number of completion tokens: 241536
token throughput (completion token): 1435.405 token/s
token throughput (prompt + completion token): 3409.104 token/s
RPS (request per second): 5.943 req/s
RPM (request per minute): 356.569 req/min

with prefix caching:

concurrency: 128
elapsed_time: 146.064s

number of prompt tokens: 332115
number of completion tokens: 241536
token throughput (completion token): 1653.630 token/s
token throughput (prompt + completion token): 3927.392 token/s
RPS (request per second): 6.846 req/s
RPM (request per minute): 410.779 req/min

Use vLLM benchmark script (used in #1407 (comment)):
w/o prefix caching:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  112.31
Total input tokens:                      336509
Total generated tokens:                  160192
Request throughput (req/s):              8.90
Input token throughput (tok/s):          2996.25
Output token throughput (tok/s):         1426.34
---------------Time to First Token----------------
Mean TTFT (ms):                          39691.85
Median TTFT (ms):                        35011.06
P99 TTFT (ms):                           101250.62

with prefix caching:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  97.21
Total input tokens:                      336509
Total generated tokens:                  160178
Request throughput (req/s):              10.29
Input token throughput (tok/s):          3461.66
Output token throughput (tok/s):         1647.75
---------------Time to First Token----------------
Mean TTFT (ms):                          33815.43
Median TTFT (ms):                        31043.85
P99 TTFT (ms):                           86314.32

We can see almost 15% throughput improvement when enable prefix caching for Turbomind engine. Actually, the token_id length of system prompts added in #1407 (comment) is 116, which means only 1 block will be reused. The improvement will be more significant when using longer system prompts.

@ispobock
Copy link
Contributor Author

Evaluation result for Internlm2-7b with prefix caching:

dataset                                 version    metric         mode    internlm2-7b-turbomind
--------------------------------------  ---------  -------------  ------  ------------------------
--------- 考试 Exam ---------           -          -              -       -
ceval                                   -          naive_average  gen     64.29
agieval                                 -          -              -       -
mmlu                                    -          naive_average  gen     62.46
GaokaoBench                             -          -              -       -
ARC-c                                   -          -              -       -
--------- 语言 Language ---------       -          -              -       -
WiC                                     d06864     accuracy       gen     56.43
summedits                               -          -              -       -
chid-dev                                -          -              -       -
afqmc-dev                               -          -              -       -
bustm-dev                               -          -              -       -
cluewsc-dev                             -          -              -       -
WSC                                     7902a7     accuracy       gen     53.85
winogrande                              -          -              -       -
flores_100                              -          -              -       -
--------- 知识 Knowledge ---------      -          -              -       -
BoolQ                                   -          -              -       -
commonsense_qa                          -          -              -       -
nq                                      -          -              -       -
triviaqa                                2121ce     score          gen     61.45
--------- 推理 Reasoning ---------      -          -              -       -
cmnli                                   -          -              -       -
ocnli                                   -          -              -       -
ocnli_fc-dev                            -          -              -       -
AX_b                                    -          -              -       -
AX_g                                    -          -              -       -
CB                                      -          -              -       -
RTE                                     -          -              -       -
story_cloze                             -          -              -       -
COPA                                    -          -              -       -
ReCoRD                                  -          -              -       -
hellaswag                               -          -              -       -
piqa                                    -          -              -       -
siqa                                    -          -              -       -
strategyqa                              -          -              -       -
math                                    -          -              -       -
gsm8k                                   1d7fe4     accuracy       gen     71.19
TheoremQA                               -          -              -       -
openai_humaneval                        -          -              -       -
mbpp                                    -          -              -       -
bbh                                     -          -              -       -
--------- 理解 Understanding ---------  -          -              -       -
C3                                      -          -              -       -
CMRC_dev                                -          -              -       -
DRCD_dev                                -          -              -       -
MultiRC                                 -          -              -       -
race-middle                             9a54b6     accuracy       gen     22.77
race-high                               9a54b6     accuracy       gen     22.53
openbookqa_fact                         -          -              -       -
csl_dev                                 -          -              -       -
lcsts                                   -          -              -       -
Xsum                                    -          -              -       -
eprstmt-dev                             -          -              -       -
lambada                                 -          -              -       -
tnews-dev                               -          -              -       -

@lvhan028
Copy link
Collaborator

Fantastic job! Thanks so much.
We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference.
Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 19, 2024

Fantastic job! Thanks so much. We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference. Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

ok

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 19, 2024

And may you help review the code and give some suggestions? Thanks. @lvhan028 @lzhangzz @grimoire

@ispobock
Copy link
Contributor Author

We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference.
Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

@lvhan028 Is there any planned features on Turbomind engine in the next month? Hopefully there won't be too many code conflicts.

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 19, 2024

We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference.
Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

@lvhan028 Is there any planned features on Turbomind engine in the next month? Hopefully there won't be too many code conflicts.

especially in LlamaBatch

@lvhan028
Copy link
Collaborator

There are definitely conflicts due to #1458

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 19, 2024

There are definitely conflicts due to #1458

There is almost no impact, as long as the refactoring of LlamaBatch, decoupling batch and model will be after v0.5.0, there will not be any major impact.

@ispobock
Copy link
Contributor Author

There are definitely conflicts due to #1458

Got it. It seems no big conflict with this feature.

@ispobock
Copy link
Contributor Author

ispobock commented Apr 22, 2024

The evaluation result for Turbomind prefix caching + AWQ + online kv cache int4 + tp2:

dataset                                 version    metric         mode    internlm2-chat-7b-4bits-turbomind
--------------------------------------  ---------  -------------  ------  ----------------------------------
--------- 考试 Exam ---------           -          -              -       -
ceval                                   -          naive_average  gen     51.35
agieval                                 -          -              -       -
mmlu                                    -          naive_average  gen     53.39
GaokaoBench                             -          -              -       -
ARC-c                                   -          -              -       -
--------- 语言 Language ---------       -          -              -       -
WiC                                     d06864     accuracy       gen     52.19
summedits                               -          -              -       -
chid-dev                                -          -              -       -
afqmc-dev                               -          -              -       -
bustm-dev                               -          -              -       -
cluewsc-dev                             -          -              -       -
WSC                                     7902a7     accuracy       gen     63.46
winogrande                              -          -              -       -
flores_100                              -          -              -       -
--------- 知识 Knowledge ---------      -          -              -       -
BoolQ                                   -          -              -       -
commonsense_qa                          -          -              -       -
nq                                      -          -              -       -
triviaqa                                2121ce     score          gen     40.64
--------- 推理 Reasoning ---------      -          -              -       -
cmnli                                   -          -              -       -
ocnli                                   -          -              -       -
ocnli_fc-dev                            -          -              -       -
AX_b                                    -          -              -       -
AX_g                                    -          -              -       -
CB                                      -          -              -       -
RTE                                     -          -              -       -
story_cloze                             -          -              -       -
COPA                                    -          -              -       -
ReCoRD                                  -          -              -       -
hellaswag                               -          -              -       -
piqa                                    -          -              -       -
siqa                                    -          -              -       -
strategyqa                              -          -              -       -
math                                    -          -              -       -
gsm8k                                   1d7fe4     accuracy       gen     39.73
TheoremQA                               -          -              -       -
openai_humaneval                        -          -              -       -
mbpp                                    -          -              -       -
bbh                                     -          -              -       -
--------- 理解 Understanding ---------  -          -              -       -
C3                                      -          -              -       -
CMRC_dev                                -          -              -       -
DRCD_dev                                -          -              -       -
MultiRC                                 -          -              -       -
race-middle                             9a54b6     accuracy       gen     74.16
race-high                               9a54b6     accuracy       gen     67.87
openbookqa_fact                         -          -              -       -
csl_dev                                 -          -              -       -
lcsts                                   -          -              -       -
Xsum                                    -          -              -       -
eprstmt-dev                             -          -              -       -
lambada                                 -          -              -       -
tnews-dev                               -          -              -       -

The evaluation result for AWQ + online kv cache int4 + tp2, without Turbomind prefix caching:

dataset                                 version    metric         mode    internlm2-chat-7b-4bits-turbomind
--------------------------------------  ---------  -------------  ------  -----------------------------------
--------- 考试 Exam ---------           -          -              -       -
ceval                                   -          naive_average  gen     50.92
agieval                                 -          -              -       -
mmlu                                    -          naive_average  gen     53.68
GaokaoBench                             -          -              -       -
ARC-c                                   -          -              -       -
--------- 语言 Language ---------       -          -              -       -
WiC                                     d06864     accuracy       gen     53.29
summedits                               -          -              -       -
chid-dev                                -          -              -       -
afqmc-dev                               -          -              -       -
bustm-dev                               -          -              -       -
cluewsc-dev                             -          -              -       -
WSC                                     7902a7     accuracy       gen     67.31
winogrande                              -          -              -       -
flores_100                              -          -              -       -
--------- 知识 Knowledge ---------      -          -              -       -
BoolQ                                   -          -              -       -
commonsense_qa                          -          -              -       -
nq                                      -          -              -       -
triviaqa                                2121ce     score          gen     40.48
--------- 推理 Reasoning ---------      -          -              -       -
cmnli                                   -          -              -       -
ocnli                                   -          -              -       -
ocnli_fc-dev                            -          -              -       -
AX_b                                    -          -              -       -
AX_g                                    -          -              -       -
CB                                      -          -              -       -
RTE                                     -          -              -       -
story_cloze                             -          -              -       -
COPA                                    -          -              -       -
ReCoRD                                  -          -              -       -
hellaswag                               -          -              -       -
piqa                                    -          -              -       -
siqa                                    -          -              -       -
strategyqa                              -          -              -       -
math                                    -          -              -       -
gsm8k                                   1d7fe4     accuracy       gen     40.03
TheoremQA                               -          -              -       -
openai_humaneval                        -          -              -       -
mbpp                                    -          -              -       -
bbh                                     -          -              -       -
--------- 理解 Understanding ---------  -          -              -       -
C3                                      -          -              -       -
CMRC_dev                                -          -              -       -
DRCD_dev                                -          -              -       -
MultiRC                                 -          -              -       -
race-middle                             9a54b6     accuracy       gen     74.30
race-high                               9a54b6     accuracy       gen     67.52
openbookqa_fact                         -          -              -       -
csl_dev                                 -          -              -       -
lcsts                                   -          -              -       -
Xsum                                    -          -              -       -
eprstmt-dev                             -          -              -       -
lambada                                 -          -              -       -
tnews-dev                               -          -              -       -

The result diff is mainly caused by the sampling settings in the evaluation code.
The results are close with and w/o prefix caching, which indicates these features are compatible.

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 22, 2024

Fantastic job! Thanks so much. We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference. Regarding the prefix caching of both engines, I would like to highlight it in v0.5.0, which is planned to be published around May 20th

We need to regularly merge the main branch before approving to avoid conflicts.

@lvhan028 lvhan028 added the enhancement New feature or request label Apr 22, 2024
src/turbomind/models/llama/SequenceManager.h Outdated Show resolved Hide resolved
src/turbomind/models/llama/SequenceManager.cc Outdated Show resolved Hide resolved
src/turbomind/models/llama/SequenceManager.cc Outdated Show resolved Hide resolved
if (input_length) {
// update tokens in sequence
seq.tokens.resize(history_length+input_length);
std::copy_n(input_ids, input_length, &seq.tokens[history_length]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating seq.tokens here will currupt the sequence state if it gets canceled before finishing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point out what sequence state would be corrupted? And any suggestions to handle this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point out what sequence state would be corrupted?

seq.tokens will be inconsistent with what's filled in the KV cache.

And any suggestions to handle this?

May add a prompt field for Sequence to hold a copy of input ids for newly created sequence (start_flag == true ). It can be cleared after CacheIfEnabled. This way both stage and interact_count can be eliminated.

Also, be careful when there exists input_embeddings. Prefix matching must avoid matching embedding tokens as the supplied token ids are duplicated model specific dummy id. So the copying of prompt tokens stops at first embedding token.

@irexyc What's the best way to get the first occurance of embedding tokens in ProcessInferRequests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irexyc What's the best way to get the first occurance of embedding tokens in ProcessInferRequests?

We could only enable Prefix Cache for LLM and not use for VLM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating seq.tokens here will currupt the sequence state if it gets canceled before finishing.

Do you mean stop requests? If the sequence gets canceled before finishing, the sequence will be interrupted and erased.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop request only stop the current request, the sequence won't be earsed. It can be resumed from any step in it's history.

src/turbomind/models/llama/SequenceManager.h Outdated Show resolved Hide resolved
freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end());
// if prefix cache enabled, blocks will be shared by sequences, cannot be freed immediately
if (!block_trie_->enabled()) {
freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still need to be handled, otherwise much more blocks than needed will be allocated when on demand allocation mode is used. We used to have a Block::ref_count for counting inactive references (Block::use_count for active references), this is the case where it's useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still need to be handled, otherwise much more blocks than needed will be allocated when on demand allocation mode is used.

What's the on demand allocation mode? Now the allocation seems to only happen in Materialize.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cache_chunk_size is set to positive value, instead of pre-allocating all blocks, the block manager will only allocate another chunk when there are not enough free blocks. Fail to release cache blocks (cached -> free) of dead sequences will make the manager allocate more blocks than needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we cannot move these blocks which were just cached to free_ here. It's better to use allocated free blocks and then evict the cached blocks if the gpu memory is not enough.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cache_chunk_size is set to positive value, instead of pre-allocating all blocks, the block manager will only allocate another chunk when there are not enough free blocks. Fail to release cache blocks (cached -> free) of dead sequences will make the manager allocate more blocks than needed.

@lzhangzz The Prefix Cache has been basically modified according to your suggestions. We just need to discuss this part, when would it be convenient for you to take a look? Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzhangzz When Prefix Cache is enabled, we are unable to add the sequence blocks to freed_ here. Even though we did not explicitly perform this action, the status of freed_ will be updated by the next Materialize. I think it's acceptable and just a tradeoff.

src/turbomind/models/llama/SequenceManager.cc Outdated Show resolved Hide resolved
@lzhangzz
Copy link
Collaborator

preempt logic can be applied to solve the starvation problem. C with higher priority, so the blocks in later A_x will be preempted.

Preemption won't work because BlockTrie holds a dummy reference and BlockTrie::evict operates directly on the actual use_count while preemption logic operates on a snapshot of use_count.

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @lzhangzz

@lzhangzz
Copy link
Collaborator

We still have some un-closed issues

  1. With a batch of sequences sharing previously unseen (or evicted) prefixes, neither computation nor cache blocks are shared.
  2. When something in the prefix cache gets evicted, it can't get into the cache again until a request with the same prefix as prompt comes.

@ispobock
Copy link
Contributor Author

@lzhangzz

  • The first one is expected. Now we only cache and reuse the computed blocks to avoid write conflict. In this design, the blocks reuse may have one iteration delay. If a new request want to match prefix blocks, the blocks should be cached in the previous iterations. I don't think it will affect overall performance too much.
  • For the second one, I am not sure I fully understand. Blocks in the prefix cache will not be evicted when use_count=0. They will only be evicted when they are re-allocated (indicated by unique_id mismatch during verification). Before the re-allocation, they still can be reused. If a block is re-allocated, we don't need to get it back to cache.

@zhyncs
Copy link
Collaborator

zhyncs commented Apr 30, 2024

We still have some un-closed issues

  1. With a batch of sequences sharing previously unseen (or evicted) prefixes, neither computation nor cache blocks are shared.

  2. When something in the prefix cache gets evicted, it can't get into the cache again until a request with the same prefix as prompt comes.

@lzhangzz The original intention of this feature design is to solve the problem of system prefix cache. Especially for search scenarios in Internet companies, the current classic practice is to conduct SFT based on a SOTA chat model and use prompt engineering to make it a question and answer assistant in the vertical field. This usually means that in the request, there will usually be one or more common, similar system prefix caches. The current implementation is compatible with the existing designs and implementations of LMDeploy, and meets the requirements mentioned above. We currently believe it meets expectations. What you are discussing is a more universal cache. And we could also discuss your suggestion again. Regarding the first point, I also agree with what @ispobock said, as it does not have a significant impact on performance.

@zhyncs zhyncs mentioned this pull request May 8, 2024
@zhyncs
Copy link
Collaborator

zhyncs commented May 9, 2024

Hi @irexyc May you help review the PR? Thanks.

@@ -336,13 +336,15 @@ def parse_args():
cache_count_act = ArgumentHelper.cache_max_entry_count(pt_group)
cache_block_seq_len_act = ArgumentHelper.cache_block_seq_len(pt_group)
session_len_act = ArgumentHelper.session_len(pt_group, default=2048)
prefix_caching_act = ArgumentHelper.enable_prefix_caching(pt_group)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the argument for the pytorch engine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for other benchmark scripts we also add the argument for both pytorch engine and turbomind engine. @grimoire pls help check.

@lzhangzz
Copy link
Collaborator

@irexyc Pls help to check this will not break VLM when embedding inputs are present.

@irexyc
Copy link
Collaborator

irexyc commented May 15, 2024

Please add cc to the extensions and fix the format issue.

docker run -it --rm --workdir /src -v $(pwd):/src clang-format-lint --clang-format-executable /clang-format/clang-format11 -r  --inplace True --style=file --extensions "h,c,cpp,hpp,cu,cuh,cc" src
modified:   src/turbomind/models/llama/BlockTrie.cc
modified:   src/turbomind/models/llama/LlamaBatch.cc
modified:   src/turbomind/models/llama/SequenceManager.cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants