Replies: 2 comments 1 reply
-
I think we already support MLA. What makes you think that we use excessive VRAM? |
Beta Was this translation helpful? Give feedback.
0 replies
-
KV Cache sizes are extremely big, as seen here: I'm running Q3_K_S (101.7 GB) with M3 Max 128 GB memory and 122 GB allocated as VRAM, it swaps with small context size (<=2k, even with 256 for some reason). 4k/8k are unusable. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
DeepSeek-V2-Chat-0628 currently uses excessive VRAM, possibly due to running as MHA instead of MLA.
Discussion here:
https://old.reddit.com/r/LocalLLaMA/comments/1e6ba6a/deepseekv2chat0628_weight_release_1_open_weight/ldtybpo/
Beta Was this translation helpful? Give feedback.
All reactions