[RFC]: Support sparse KV cache framework #5751

chizhang118 · 2024-06-21T20:21:39Z

Motivation

For current large model inference, KV cache occupies a significant portion of GPU memory, so reducing the size of KV cache is an important direction for improvement. Recently, several papers have approached this issue from different angles, detailed comparison in the table, including:

FastDecode: This method offloads all computation of KV cache to the CPU. The computation and storage of KV cache occurs on CPU.
Compression methods based on quantization (GEAR, Mixed Precision): By applying various quantization techniques, the size of individual token KV caches is reduced without decreasing the number of tokens stored in the KV cache. This method may also result in corresponding residual and outlier matrices, which need to be stored in memory but not in the KV cache. It may also involve quantizing unimportant token KV caches to reduce the memory footprint of the KV cache.
Partial KV cache eviction (H2O, SnapKV, LESS, Adaptive Compression, Scissorhands, Dynamic Memory Compression, StreamingLLM): By removing some relatively useless KV cache entries, the memory footprint of the KV cache is reduced. Essentially, this reduces the number of tokens stored in the KV cache without reducing the size of individual token KV caches.

When addressing the sparse KV cache issue, we have previously considered supporting quantization (VLLM has already implemented this), implementing quantization + outlier + residual like GEAR (not widely applicable as it requires generating outlier and residual for each token generation, which is costly), and implementing KV cache accumulation + appendix (not widely applicable as it requires models to be trained using the same method). Finally, the idea is to implement partial KV cache eviction, primarily aiming for generality and abstraction rather than being specific to one or two approaches. Considering that six of the sparse KV cache methods we found are based on evicting cache entries, this method is also suitable for modification as part of a framework to be integrated into VLLM.

Sparse KV Cache Workflow

First, let's clarify the required parameters, including:

An optional flag "--sparse-kv-cache-type" indicating if we want to specify any sparse KV cache type. Default is ‘auto’ without using any sparse KV cache type, otherwise, there could be various methods, such as attention scores for H2O.
Compression ratio for evicting KV cache entries: 20% if we want to achieve 80% reduction of KV cache usage. We can calculate the value of 'n' for recreating KV cache every 'n' step based on the compression ratio.

The entire workflow includes:

During the first decoding pass, besides computing the KV values for all input tokens, we also need to calculate and retain information about the priority ranking of all token pairs, such as attention scores in H2O.
During each scheduling of VLLM, we need to check whether 'n' steps have been completed, indicating the necessity for KV cache compression. If necessary, based on the priority ranking of tokens, one or more new KV cache blocks will be allocated, modifying the position information of input positions. The block manager will then manage the transfer of corresponding KV blocks from the original sequence group to the latest KV block. Finally, the reference count of the original KV block will be decremented, and the corresponding original KV blocks may even be released.
The corresponding KV values are added to the KV cache until the next compression of the KV cache after 'n' steps, repeating this process until the entire process is completed.

Proposed Change

Modified files mainly include

Modify vllm/core/scheduler.py: Add the corresponding logic for checking if sparse KV cache actions should be taken or not.
Modify vllm/core/block_manager_v1.py: Add the corresponding logic for updating block table mapping and manage the related allocated/free blocks.
Modify vllm/worker/model_runner.py: Update the position related code after sparse KV cache and pass the blocks_to_sparse_copy to the corresponding models.
Modify models, such as vllm/model_executor/models/opt.py: Indicating which KV should be filtered out.
Modify csrc/attention/attention_kernels.cu, csrc/cache_kernels.cu: Calculate attention score for selecting "important" tokens' KV and support sparse_cache_copy for copying "important" tokens' KV.

PR

PR link: #5752

Design doc

https://docs.google.com/document/d/13_cpb31P9VOmPGa_tZ70s7z1vXGP_UenXf1WVuIppCk/

Feedback Period.

No response

CC List.

@simon-mo @youkaichao @zhuohan123 @cadedaniel @ywang96 @WoosukKwon @LiuXiaoxuanPKU

Any Other Things.

No response

robertgshaw2-neuralmagic · 2024-06-21T21:07:30Z

Very exciting!

thesues · 2024-06-21T21:08:24Z

how many gpu memory can be saved? do you have any benchmark data?

chizhang118 · 2024-06-21T21:12:35Z

how many gpu memory can be saved? do you have any benchmark data?

This depends on the Sparse KV cache compression ratio, from current paper, 20% compression ratio is a rough number, which means 80% reduction. Now is pending feedback from community, there is no benchmark data yet.

Zefan-Cai · 2024-06-23T07:59:42Z

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

chizhang118 · 2024-06-24T02:13:24Z

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

Sure, it should not be difficult to add based on the current framework. Will be on my radar. Thanks!

Zefan-Cai · 2024-06-24T05:11:09Z

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

Sure, it should not be difficult to add based on the current framework. Will be on my radar. Thanks!

Super cool! Thank you so much for your efforts!

simon-mo · 2024-06-24T22:27:23Z

This is exciting indeed. Few things

I think this can help the [Misc] Add attention sinks #3515 (streaming LLM) in particular with the block manager changes.
@cadedaniel please help review the block manager changes.
@WoosukKwon please help review the paged attention kernel interface change.

Zefan-Cai · 2024-06-27T09:16:56Z

Would you mind adding newly-proposed KV cache compression methods other than SnapKV and H2O? (i.e. PyramidKV)

Sure, it should not be difficult to add based on the current framework. Will be on my radar. Thanks!

Would you mind @ me when the new method is added? can't wait to have a try with vLLM!

dongxiaolong · 2024-07-04T07:15:05Z

https://github.com/microsoft/MInference
Is there a combination of dynamic sparse attention and sparse KV cache?
The vllm implementation is provided here

Zefan-Cai · 2024-07-04T15:12:49Z

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

dongxiaolong · 2024-07-04T15:31:14Z

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

for vLLM,

from vllm import LLM, SamplingParams

from minference import MInference

llm = LLM(model_name, max_num_seqs=1, enforce_eager=True, max_model_len=128000)

Patch MInference Module

+minference_patch = MInference("vllm", model_name)
+llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params)
using only the kernel,

from minference import vertical_slash_sparse_attention, block_sparse_attention, streaming_forward

attn_output = vertical_slash_sparse_attention(q, k, v, vertical_topk, slash)
attn_output = block_sparse_attention(q, k, v, topk)
attn_output = streaming_forward(q, k, v, init_num, local_window_num)
For more details, please refer to our Examples and Experiments. You can find more information about the dynamic compiler PIT in this paper and on GitHub.

Zefan-Cai · 2024-07-04T15:43:27Z

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

for vLLM,

from vllm import LLM, SamplingParams

from minference import MInference

llm = LLM(model_name, max_num_seqs=1, enforce_eager=True, max_model_len=128000)

Patch MInference Module

+minference_patch = MInference("vllm", model_name) +llm = minference_patch(llm)

outputs = llm.generate(prompts, sampling_params) using only the kernel,

from minference import vertical_slash_sparse_attention, block_sparse_attention, streaming_forward

attn_output = vertical_slash_sparse_attention(q, k, v, vertical_topk, slash) attn_output = block_sparse_attention(q, k, v, topk) attn_output = streaming_forward(q, k, v, init_num, local_window_num) For more details, please refer to our Examples and Experiments. You can find more information about the dynamic compiler PIT in this paper and on GitHub.

Are you an author of this repo? Your attached code seems not containing sparse kv cache implementation. and the Examples folder neither. Do I miss something?

dongxiaolong · 2024-07-05T02:59:37Z

https://github.com/microsoft/MInference Is there a combination of dynamic sparse attention and sparse KV cache? The vllm implementation is provided here

This repo does not provide sparse KV cache implementation in vLLM. They only provide HF ones.

for vLLM,
from vllm import LLM, SamplingParams

from minference import MInference

llm = LLM(model_name, max_num_seqs=1, enforce_eager=True, max_model_len=128000)

Patch MInference Module

+minference_patch = MInference("vllm", model_name) +llm = minference_patch(llm)
outputs = llm.generate(prompts, sampling_params) using only the kernel,
from minference import vertical_slash_sparse_attention, block_sparse_attention, streaming_forward
attn_output = vertical_slash_sparse_attention(q, k, v, vertical_topk, slash) attn_output = block_sparse_attention(q, k, v, topk) attn_output = streaming_forward(q, k, v, init_num, local_window_num) For more details, please refer to our Examples and Experiments. You can find more information about the dynamic compiler PIT in this paper and on GitHub.

Are you an author of this repo? Your attached code seems not containing sparse kv cache implementation. and the Examples folder neither. Do I miss something?

an

I am not the author of this repo. It's not sparse kv cache, it's sparse attention. Isn't there something in common?

PatchouliTIS · 2024-07-11T12:51:26Z

Great work! However, I noticed that your implementation only adapts for memory-friendly attention for xformers. Do you think it would be a lot of work to adapt it for Flash-Attention 2 with the current architecture? Or do you have plans to adapt for FlashAttention 2 in the future?
https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flash_attn.py

PatchouliTIS · 2024-07-12T14:25:42Z

btw, I tried long prompt in your framework, found that in long prompt scenario (approximately 3k tokens) the outputs make no sense just repeat some tokens to its outputs limit. I think maybe it is related to the sparse kv implementation?

github-actions · 2024-10-25T02:04:36Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

chizhang118 added the RFC label Jun 21, 2024

This was referenced Jun 21, 2024

[Core] Support sparse KV cache framework #5752

Open

[Feature]: Sparse KV cache implementation bd-iaas-us/vllm#11

Open

chizhang118 changed the title ~~[RFC]: Support sparse KV cache~~ [RFC]: Support sparse KV cache framework Jun 21, 2024

Zefan-Cai mentioned this issue Jul 8, 2024

Merge into vLLM, is it possible? Zefan-Cai/PyramidKV#14

Open

simon-mo mentioned this issue Oct 1, 2024

[Roadmap] vLLM Roadmap Q4 2024 #9006

Open

39 tasks

github-actions bot added the stale label Oct 25, 2024

simon-mo added keep-open and removed stale labels Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Support sparse KV cache framework #5751

[RFC]: Support sparse KV cache framework #5751

chizhang118 commented Jun 21, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Jun 21, 2024

thesues commented Jun 21, 2024

chizhang118 commented Jun 21, 2024

Zefan-Cai commented Jun 23, 2024

chizhang118 commented Jun 24, 2024

Zefan-Cai commented Jun 24, 2024

simon-mo commented Jun 24, 2024

Zefan-Cai commented Jun 27, 2024

dongxiaolong commented Jul 4, 2024 •

edited

Loading

Zefan-Cai commented Jul 4, 2024

dongxiaolong commented Jul 4, 2024

Zefan-Cai commented Jul 4, 2024

Patch MInference Module

dongxiaolong commented Jul 5, 2024

Patch MInference Module

PatchouliTIS commented Jul 11, 2024

PatchouliTIS commented Jul 12, 2024

github-actions bot commented Oct 25, 2024

[RFC]: Support sparse KV cache framework #5751

[RFC]: Support sparse KV cache framework #5751

Comments

chizhang118 commented Jun 21, 2024 • edited Loading

Motivation

Sparse KV Cache Workflow

Proposed Change

PR

Design doc

Feedback Period.

CC List.

Any Other Things.

robertgshaw2-neuralmagic commented Jun 21, 2024

thesues commented Jun 21, 2024

chizhang118 commented Jun 21, 2024

Zefan-Cai commented Jun 23, 2024

chizhang118 commented Jun 24, 2024

Zefan-Cai commented Jun 24, 2024

simon-mo commented Jun 24, 2024

Zefan-Cai commented Jun 27, 2024

dongxiaolong commented Jul 4, 2024 • edited Loading

Zefan-Cai commented Jul 4, 2024

dongxiaolong commented Jul 4, 2024

Patch MInference Module

Zefan-Cai commented Jul 4, 2024

Patch MInference Module

dongxiaolong commented Jul 5, 2024

Patch MInference Module

PatchouliTIS commented Jul 11, 2024

PatchouliTIS commented Jul 12, 2024

github-actions bot commented Oct 25, 2024

chizhang118 commented Jun 21, 2024 •

edited

Loading

dongxiaolong commented Jul 4, 2024 •

edited

Loading