-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: n_ctx will reuse n_ctx_train when --ctx_size not set and make deepseek-v2 models meet out of memory crash even on a small output length. #8817
Comments
@characharm Your issue from #8483 is caused by the default n_ctx loaded from deepseek-v2 which is 163840 and will cause an allocation of kv buffer memory about 43G exceeding the GPU memory limit. You can use "-c 2048" to set the context length and it will work well on the newest commit on master: c8a0090 |
@slaren Do you think this functionality is a bug when user not set the --ctx_size and llama.cpp will reuse the n_ctx_train as n_ctx from the model. For deepseek-v2 case, the n_ctx_train size is 160K, even the user's real input and output to be small it will keep allocating a super large kv buffer(in this case about 43G kv buffer). Should we calculate the real n_ctx from the user input instead of reuse n_ctx_train? |
I think this is not expected since the latest LLM will introduce much longer and longer training context length |
I would classify this as enhancement. I understand the idea as follows:
reduce the context size so it's just enough for the generation. Note that condition 2 is not met in many cases, such as in interactive mode and in llama-server. If we want to avoid OOM condition in such cases, we will need some other ideas, such as:
|
It is certainly not a bug, and the value of |
@shibe2 I agree it is more of a feature enhancement. I think it will be quite useful if llama.cpp can calculate the appropriate n_ctx especially for serving, any plans on it? |
I agree with @slaren that we should revert to the default hard-coded value instead of using |
Tossing another idea around: set some value at compile time with some default, then use that value to limit default |
The main reason to use |
When I most recently hit the OOM because of large context, it said:
and
And that's on a system that doesn't even have CUDA installed. |
On my CUDA machine I get the following error: GGML_CUDA=1 make -j && ./llama-cli -m models/llama-7b-v2/ggml-model-q4_0.gguf \
-ngl 99 -p "I believe the meaning of life is" -c 1000000
Huh, that does not seem right. Try to clean/rebuild |
My initial idea was to add a hint message if
However, on my machine (Mac M3), running with |
What happened?
deepseek-v2 model will meet out of memory issue with the kv buffer size allocating about 43G with a 160K context length from the model. But when you set the -c or --ctx_size 2048, then the inference can work normally.
Name and Version
./build/bin/llama-cli -m deepseek-v2-lite-chat-q4_0.gguf -p "how to build a website?" -n 32 -e -ngl 29 -sm none
Linux build on master branch :c8a0090922bad576623de4aae227717085249262
What operating system are you seeing the problem on?
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: