-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : reuse context chunks #9866
Conversation
Does this work similar to Koboldcpp's context shift? |
Yes, it's the same idea as proposed in #5793. I've been experimenting today with context reuse for code completion and results seem promising. |
Btw @ggerganov a while ago I remembered there was a discussion on storing token ID on KV cache. I'm wondering if it's complicated to add API like |
ggml-ci
a6b048e
to
27addf5
Compare
We should extend the API to support that. Maybe |
I have a small question regarding the illustration on the description:
AFAIU we only skip the |
It's skipped mainly to simplify the batch construction: With the current implementation, we stop reusing chunks at the first token that cannot be reused. This way, when we create the
The alternative that you suggest is if we reused the
There is no longer the concept of I'm very interested in trying this approach and see if it is viable, but the extra complexity at this point would be took much. Maybe in the future. |
ref #5793
Overview
Using a positive
--cache-reuse
argument withllama-server
will attempt to reuse KV chunks with size equal or larger than the specified value. The KV cache of reused chunks will be shifted (seellama_kv_cache_seq_add()
) in the respective position and processing for these tokens will be skipped.Only chunks without control/special tokens will be reused.Here is an illustration:Upon submitting
prompt 1
for processing, afterprompt 0
has been processed and cached:--cache-reuse 0
: only theaaaaa
prefix will be reused--cache-reuse 1
: the entireaaaaaccccccceeeeeeffhhhhhhh
will be reused--cache-reuse 3
: only theaaaaaccccccceeeeee
part will be reusedThe cache reuse will be done only for requests with
"cache_prompt": true
.Example