You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, cached tokens is reused in server by doing common_part(new_tokens, cached_tokens)
This is good in the situation where all incoming requests have the same prefix:
cached_tokens a b c d e f g h i
new_tokens a b c d e f x y z
reused_tokens x x x x x x
However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:
cached_tokens a b c d e f g h i
new_tokens a b c g h i k l m
reused_tokens x x x
Proposal
My proposal is to detect such case and uses llama_kv_cache_seq_rm + llama_kv_cache_seq_add to shift the tokens in cache accordingly.
cached_tokens a b c d e f g h i
shifted_cache a b c g h i
new_tokens a b c g h i k l m
reused_tokens x x x x x x
I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov
The text was updated successfully, but these errors were encountered:
Motivation
Currently, cached tokens is reused in server by doing
common_part(new_tokens, cached_tokens)
This is good in the situation where all incoming requests have the same prefix:
However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:
Proposal
My proposal is to detect such case and uses
llama_kv_cache_seq_rm
+llama_kv_cache_seq_add
to shift the tokens in cache accordingly.I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov
The text was updated successfully, but these errors were encountered: