Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: reuse cached tokens for shifted prompt #5793

Closed
ngxson opened this issue Feb 29, 2024 · 2 comments
Closed

Server: reuse cached tokens for shifted prompt #5793

ngxson opened this issue Feb 29, 2024 · 2 comments
Labels
enhancement New feature or request stale

Comments

@ngxson
Copy link
Collaborator

ngxson commented Feb 29, 2024

Motivation

Currently, cached tokens is reused in server by doing common_part(new_tokens, cached_tokens)

This is good in the situation where all incoming requests have the same prefix:

cached_tokens  a b c d e f g h i
new_tokens     a b c d e f x y z
reused_tokens  x x x x x x

However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:

cached_tokens  a b c d e f g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x

Proposal

My proposal is to detect such case and uses llama_kv_cache_seq_rm + llama_kv_cache_seq_add to shift the tokens in cache accordingly.

cached_tokens  a b c d e f g h i
shifted_cache  a b c g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x x x x

I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov

@ngxson ngxson added the enhancement New feature or request label Feb 29, 2024
@ggerganov
Copy link
Owner

It's possible to do that, but we should do that at a bit later stage. KV cache management is tricky and this will add some extra complexity

@github-actions github-actions bot added the stale label Apr 1, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants