Server: reuse cached tokens for shifted prompt #5793

ngxson · 2024-02-29T10:05:43Z

Motivation

Currently, cached tokens is reused in server by doing common_part(new_tokens, cached_tokens)

This is good in the situation where all incoming requests have the same prefix:

cached_tokens  a b c d e f g h i
new_tokens     a b c d e f x y z
reused_tokens  x x x x x x

However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:

cached_tokens  a b c d e f g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x

Proposal

My proposal is to detect such case and uses llama_kv_cache_seq_rm + llama_kv_cache_seq_add to shift the tokens in cache accordingly.

cached_tokens  a b c d e f g h i
shifted_cache  a b c g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x x x x

I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-03-01T08:34:38Z

It's possible to do that, but we should do that at a bit later stage. KV cache management is tricky and this will add some extra complexity

github-actions · 2024-04-15T02:46:49Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ngxson added the enhancement New feature or request label Feb 29, 2024

github-actions bot added the stale label Apr 1, 2024

github-actions bot closed this as completed Apr 15, 2024

ngxson mentioned this issue May 24, 2024

SimpleChat Completion Mode flexibility and cleanup, Settings gMe, Optional sliding window #7480

Merged

ngxson mentioned this issue Oct 12, 2024

server : reuse context chunks #9866

Merged

ggerganov mentioned this issue Oct 19, 2024

llama.vim : plugin for Neovim #9787

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: reuse cached tokens for shifted prompt #5793

Server: reuse cached tokens for shifted prompt #5793

ngxson commented Feb 29, 2024 •

edited

Loading

ggerganov commented Mar 1, 2024

github-actions bot commented Apr 15, 2024

Server: reuse cached tokens for shifted prompt #5793

Server: reuse cached tokens for shifted prompt #5793

Comments

ngxson commented Feb 29, 2024 • edited Loading

Motivation

Proposal

ggerganov commented Mar 1, 2024

github-actions bot commented Apr 15, 2024

ngxson commented Feb 29, 2024 •

edited

Loading