-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492
Comments
Note worthy is the fact that |
Llama.cpp only supports 8 bit k cache. V 8 bit is not implemented yet |
Not true, type Q4_0 and Q4_1 k cache quantization works for me and are documented in this PR: |
This issue is stale because it has been open for 30 days with no activity. |
Is anyone else still interested in this feature? It would be incredibly helpful for running long contexts on systems with limited VRAM |
@ikawrakow Any thing you can help with implement this on the project?We have a lots of progress on weight quants but we re still using FP16 kv cache :) |
I have been using q8_0 for the k part of the cache for a long time now without any issues.
|
To me it looks like the topic of quantized cache needs more attention from the project maintainers rather than quantization improvements:
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
@ggerganov With FA merged, Are there any chance to improve speed of kv quants so it become useful |
Feature Description
with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.
Motivation
Reduce memory use by Kv cache during long context batch inference
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI
it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/
Possible Implementation
https://github.com/jy-yuan/KIVI
I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.
The text was updated successfully, but these errors were encountered: