server : reuse context chunks #9866

ggerganov · 2024-10-12T13:18:09Z

Overview

Using a positive --cache-reuse argument with llama-server will attempt to reuse KV chunks with size equal or larger than the specified value. The KV cache of reused chunks will be shifted (see llama_kv_cache_seq_add()) in the respective position and processing for these tokens will be skipped. ~~Only chunks without control/special tokens will be reused.~~ Here is an illustration:

# here each letter generally corresponds to a different token
# same letters represent groups of tokens that are the same in both requests, but are located in different positions

# prompt 0 (cached)
aaaaabbbbbcccccccdddddeeeeeexffggggghhhhhhhxxxxxxxxx

# prompt 1
aaaaaccccccceeeeeeffhhhhhhhyyyyyyyy

Upon submitting prompt 1 for processing, after prompt 0 has been processed and cached:

--cache-reuse 0: only the aaaaa prefix will be reused
--cache-reuse 1: the entire aaaaaccccccceeeeeeffhhhhhhh will be reused
--cache-reuse 3: only the aaaaaccccccceeeeee part will be reused

The cache reuse will be done only for requests with "cache_prompt": true.

Example

# start a server with cache reusing enabled
./llama-server -m ${model.gguf} --port 8012 --cache-reuse 512

# long request with the word "hello" repeated 512 times
chunk=$(printf 'hello %.0s' {1..512})
curl \
    --request POST --url http://127.0.0.1:8012/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Some prefix. Reuse: '"${chunk}"'", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# ... computes 519 tokens ...

# submit new request with the prefix removed. note the leading space before "Reuse"
curl \
    --request POST --url http://127.0.0.1:8012/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": " Reuse: '"${chunk}"'", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# ... reuses 516 tokens and computes just 1 token ...

wooooyeahhhh · 2024-10-12T14:00:22Z

Does this work similar to Koboldcpp's context shift?

ngxson · 2024-10-12T18:31:18Z

Does this work similar to Koboldcpp's context shift?

If I understand correctly from this post then yes, it is.

Before, I had a similar feature request here: #5793 , which will be possible thanks to the current PR

ggerganov · 2024-10-12T19:33:10Z

Yes, it's the same idea as proposed in #5793. I've been experimenting today with context reuse for code completion and results seem promising.

ngxson · 2024-10-12T20:30:06Z

Btw @ggerganov a while ago I remembered there was a discussion on storing token ID on KV cache. I'm wondering if it's complicated to add API like llama_kv_get_tokens(int seq_id) and use it instead of having to synchronize between actual KV and slot.cache_tokens. What do you think?

ggml-ci

ggerganov · 2024-10-13T10:29:26Z

We should extend the API to support that. Maybe llama_token id = llama_kv_cache_seq_get_token(ctx, seq_id, pos);

ggml-ci

ngxson · 2024-11-01T10:01:25Z

I have a small question regarding the illustration on the description:

--cache-reuse 3: only the aaaaaccccccceeeeee part will be reused

AFAIU we only skip the ff part because its length is less than 3. But in this case, why the next part hhhhhhh is also skipped?

ggerganov · 2024-11-01T10:12:15Z

It's skipped mainly to simplify the batch construction:

With the current implementation, we stop reusing chunks at the first token that cannot be reused. This way, when we create the llama_batch for the new prompt, we start from n_past and add all remaining tokens with increasing positions:

n_past:   f
n_past+1: f
n_past+2: h
n_past+3: h
...
n_past+2+H+Y: y

The alternative that you suggest is if we reused the h chunk. In that case the new batch would have to look like this:

pos_f:    f
pos_f+1:  f
pos_y:    y
pos_y+1:  y
...
pos_y+Y:  y

There is no longer the concept of n_past. Instead, we would have to maintain more complicated information about the token positions.

I'm very interested in trying this approach and see if it is viable, but the extra complexity at this point would be took much. Maybe in the future.

github-actions bot added examples server labels Oct 12, 2024

server : reuse context chunks

27addf5

ggml-ci

ggerganov force-pushed the gg/server-reuse-context branch from a6b048e to 27addf5 Compare October 13, 2024 10:08

ggerganov marked this pull request as ready for review October 13, 2024 10:24

ggerganov mentioned this pull request Oct 13, 2024

llama.vim : plugin for Neovim #9787

Merged

7 tasks

ggerganov merged commit c7181bd into master Oct 13, 2024
58 checks passed

ggerganov deleted the gg/server-reuse-context branch October 13, 2024 15:52

This was referenced Oct 13, 2024

server : accept extra_context for the infill endpoint #9874

Merged

server : remove system prompt support #9811

Closed

drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024

server : reuse cached context chunks (ggerganov#9866)

4682f73

ggml-ci

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

server : reuse cached context chunks (ggerganov#9866)

3d63e80

ggml-ci

ngxson mentioned this pull request Oct 31, 2024

server : refactor slot input data, move tokenizer to HTTP thread #10023

Merged

7 tasks

sasha0552 mentioned this pull request Nov 1, 2024

server : fix smart selection of available slot #10120

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : reuse context chunks #9866

server : reuse context chunks #9866

ggerganov commented Oct 12, 2024 •

edited

Loading

wooooyeahhhh commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 13, 2024

ngxson commented Nov 1, 2024

ggerganov commented Nov 1, 2024

server : reuse context chunks #9866

server : reuse context chunks #9866

Conversation

ggerganov commented Oct 12, 2024 • edited Loading

Overview

Example

wooooyeahhhh commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 13, 2024

ngxson commented Nov 1, 2024

ggerganov commented Nov 1, 2024

ggerganov commented Oct 12, 2024 •

edited

Loading