-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turbomind prefix caching #1450
Turbomind prefix caching #1450
Conversation
TODO: need to add compatibility testing with AWQ, online KV Cache Int4 and Int8 @ispobock |
Also need to test the case when |
Benchmark with method mentioned in #1407 (comment).
Use LMDeploy benchmark script (used in #1429 (comment)):
with prefix caching:
Use vLLM benchmark script (used in #1407 (comment)):
with prefix caching:
We can see almost 15% throughput improvement when enable prefix caching for Turbomind engine. Actually, the token_id length of system prompts added in #1407 (comment) is |
Evaluation result for Internlm2-7b with prefix caching:
|
Fantastic job! Thanks so much. |
ok |
@lvhan028 Is there any planned features on Turbomind engine in the next month? Hopefully there won't be too many code conflicts. |
especially in |
There are definitely conflicts due to #1458 |
There is almost no impact, as long as the refactoring of LlamaBatch, decoupling batch and model will be after v0.5.0, there will not be any major impact. |
Got it. It seems no big conflict with this feature. |
The evaluation result for
The evaluation result for
The result diff is mainly caused by the sampling settings in the evaluation code. |
We need to regularly merge the main branch before approving to avoid conflicts. |
efe845c
to
da88171
Compare
if (input_length) { | ||
// update tokens in sequence | ||
seq.tokens.resize(history_length+input_length); | ||
std::copy_n(input_ids, input_length, &seq.tokens[history_length]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating seq.tokens
here will currupt the sequence state if it gets canceled before finishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point out what sequence state would be corrupted? And any suggestions to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point out what sequence state would be corrupted?
seq.tokens
will be inconsistent with what's filled in the KV cache.
And any suggestions to handle this?
May add a prompt
field for Sequence
to hold a copy of input ids for newly created sequence (start_flag == true
). It can be cleared after CacheIfEnabled
. This way both stage
and interact_count
can be eliminated.
Also, be careful when there exists input_embeddings
. Prefix matching must avoid matching embedding tokens as the supplied token ids are duplicated model specific dummy id. So the copying of prompt tokens stops at first embedding token.
@irexyc What's the best way to get the first occurance of embedding tokens in ProcessInferRequests
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irexyc What's the best way to get the first occurance of embedding tokens in
ProcessInferRequests
?
We could only enable Prefix Cache for LLM and not use for VLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updating seq.tokens here will currupt the sequence state if it gets canceled before finishing.
Do you mean stop requests? If the sequence gets canceled before finishing, the sequence will be interrupted and erased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop request only stop the current request, the sequence won't be earsed. It can be resumed from any step in it's history.
freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end()); | ||
// if prefix cache enabled, blocks will be shared by sequences, cannot be freed immediately | ||
if (!block_trie_->enabled()) { | ||
freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still need to be handled, otherwise much more blocks than needed will be allocated when on demand allocation mode is used. We used to have a Block::ref_count
for counting inactive references (Block::use_count
for active references), this is the case where it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still need to be handled, otherwise much more blocks than needed will be allocated when on demand allocation mode is used.
What's the on demand allocation mode? Now the allocation seems to only happen in Materialize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When cache_chunk_size
is set to positive value, instead of pre-allocating all blocks, the block manager will only allocate another chunk when there are not enough free blocks. Fail to release cache blocks (cached -> free) of dead sequences will make the manager allocate more blocks than needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we cannot move these blocks which were just cached to free_
here. It's better to use allocated free blocks and then evict the cached blocks if the gpu memory is not enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
cache_chunk_size
is set to positive value, instead of pre-allocating all blocks, the block manager will only allocate another chunk when there are not enough free blocks. Fail to release cache blocks (cached -> free) of dead sequences will make the manager allocate more blocks than needed.
@lzhangzz The Prefix Cache has been basically modified according to your suggestions. We just need to discuss this part, when would it be convenient for you to take a look? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lzhangzz When Prefix Cache is enabled, we are unable to add the sequence blocks to freed_
here. Even though we did not explicitly perform this action, the status of freed_
will be updated by the next Materialize
. I think it's acceptable and just a tradeoff.
Preemption won't work because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM cc @lzhangzz
We still have some un-closed issues
|
|
@lzhangzz The original intention of this feature design is to solve the problem of system prefix cache. Especially for search scenarios in Internet companies, the current classic practice is to conduct SFT based on a SOTA chat model and use prompt engineering to make it a question and answer assistant in the vertical field. This usually means that in the request, there will usually be one or more common, similar system prefix caches. The current implementation is compatible with the existing designs and implementations of LMDeploy, and meets the requirements mentioned above. We currently believe it meets expectations. What you are discussing is a more universal cache. And we could also discuss your suggestion again. Regarding the first point, I also agree with what @ispobock said, as it does not have a significant impact on performance. |
Hi @irexyc May you help review the PR? Thanks. |
@@ -336,13 +336,15 @@ def parse_args(): | |||
cache_count_act = ArgumentHelper.cache_max_entry_count(pt_group) | |||
cache_block_seq_len_act = ArgumentHelper.cache_block_seq_len(pt_group) | |||
session_len_act = ArgumentHelper.session_len(pt_group, default=2048) | |||
prefix_caching_act = ArgumentHelper.enable_prefix_caching(pt_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the argument for the pytorch engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for other benchmark scripts we also add the argument for both pytorch engine and turbomind engine. @grimoire pls help check.
@irexyc Pls help to check this will not break VLM when embedding inputs are present. |
Please add
|
Motivation
#1407
Modification