-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185
Comments
digging a bit deeper into the reason for the speed of the examples/server frontend: it looks like this frontend uses the i tested with some manual curl'ing and it seems like the |
i'm testing with the following patch: diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index ff0814b2..0464280e 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -98,7 +98,7 @@ struct server_task_multi {
struct slot_params {
bool stream = true;
- bool cache_prompt = false; // remember the prompt to avoid reprocessing all prompt
+ bool cache_prompt = true; // remember the prompt to avoid reprocessing all prompt
uint32_t seed = -1; // RNG seed
int32_t n_keep = 0; // number of tokens to keep from initial prompt
@@ -834,7 +834,7 @@ struct server_context {
}
slot.params.stream = json_value(data, "stream", false);
- slot.params.cache_prompt = json_value(data, "cache_prompt", false);
+ slot.params.cache_prompt = json_value(data, "cache_prompt", true);
slot.params.n_predict = json_value(data, "n_predict", default_params.n_predict);
slot.sparams.top_k = json_value(data, "top_k", default_sparams.top_k);
slot.sparams.top_p = json_value(data, "top_p", default_sparams.top_p); this didn't help any of the clients that i tested. moving onto some manual testing with curl/hurl. i'm sending what should be a purely additive sequence of requests (using a static seed) which seems like it should pull from the cache. the first request:
response body: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "The answer is 4! This is a very basic addition problem. Are you looking for help with any other simple math questions? \n",
"role": "assistant"
}
}
],
"created": 1715288809,
"model": "gpt-3.5-turbo",
"object": "chat.completion",
"usage": {
"completion_tokens": 28,
"prompt_tokens": 25,
"total_tokens": 53
},
"id": "chatcmpl-Ikv3dt0Z3FerIQdxR0Kl99RaG2cVqCG5"
} and the server log for that request: {"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"launch_slot_with_task","line":1036,"msg":"slot is processing task","id_slot":0,"id_task":255}
{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"update_slots","line":2043,"msg":"we have to evaluate at least 1 token to generate logits","id_slot":0,"id_task":255}
{"tid":"140586246731584","timestamp":1715288785,"level":"INFO","function":"update_slots","line":2087,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":255,"p0":24}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":313,"msg":"prompt eval time = 806.35 ms / 1 tokens ( 806.35 ms per token, 1.24 tokens per second)","id_slot":0,"id_task":255,"t_prompt_processing":806.349,"n_prompt_tokens_processed":1,"t_token":806.349,"n_tokens_second":1.2401577976781766}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":329,"msg":"generation eval time = 23563.84 ms / 28 runs ( 841.57 ms per token, 1.19 tokens per second)","id_slot":0,"id_task":255,"t_token_generation":23563.84,"n_decoded":28,"t_token":841.5657142857143,"n_tokens_second":1.1882613360131455}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"print_timings","line":340,"msg":" total time = 24370.19 ms","id_slot":0,"id_task":255,"t_prompt_processing":806.349,"t_token_generation":23563.84,"t_total":24370.189}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":1781,"msg":"slot released","id_slot":0,"id_task":255,"n_ctx":4096,"n_past":52,"n_system_tokens":0,"n_cache_tokens":52,"truncated":false}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":1807,"msg":"all slots are idle"}
{"tid":"140551452063424","timestamp":1715288809,"level":"INFO","function":"log_server_request","line":2862,"msg":"request","remote_addr":"127.0.0.1","remote_port":33458,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}} the subsequent request:
response body: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "The answer to this one is also 4! It's another straightforward addition problem: the two fours add up to make eight, and then half of eight is four. Easy peasy!",
"role": "assistant"
}
}
],
"created": 1715288852,
"model": "gpt-3.5-turbo",
"object": "chat.completion",
"usage": {
"completion_tokens": 39,
"prompt_tokens": 66,
"total_tokens": 105
},
"id": "chatcmpl-2Thkbuwu0V4w4TgfqI14kEjnaKQeC129"
} and the server log: {"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"launch_slot_with_task","line":1036,"msg":"slot is processing task","id_slot":0,"id_task":284}
{"tid":"140586246731584","timestamp":1715288809,"level":"INFO","function":"update_slots","line":2087,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":284,"p0":51}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":313,"msg":"prompt eval time = 7871.37 ms / 15 tokens ( 524.76 ms per token, 1.91 tokens per second)","id_slot":0,"id_task":284,"t_prompt_processing":7871.371,"n_prompt_tokens_processed":15,"t_token":524.7580666666667,"n_tokens_second":1.9056400721043387}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":329,"msg":"generation eval time = 34750.44 ms / 39 runs ( 891.04 ms per token, 1.12 tokens per second)","id_slot":0,"id_task":284,"t_token_generation":34750.439,"n_decoded":39,"t_token":891.0368974358973,"n_tokens_second":1.122287980304364}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"print_timings","line":340,"msg":" total time = 42621.81 ms","id_slot":0,"id_task":284,"t_prompt_processing":7871.371,"t_token_generation":34750.439,"t_total":42621.81}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"update_slots","line":1781,"msg":"slot released","id_slot":0,"id_task":284,"n_ctx":4096,"n_past":104,"n_system_tokens":0,"n_cache_tokens":104,"truncated":false}
{"tid":"140586246731584","timestamp":1715288852,"level":"INFO","function":"update_slots","line":1807,"msg":"all slots are idle"}
{"tid":"140551452063424","timestamp":1715288852,"level":"INFO","function":"log_server_request","line":2862,"msg":"request","remote_addr":"127.0.0.1","remote_port":33458,"status":200,"method":"POST","path":"/v1/chat/completions","params":{}} it's clear from the server log of the second request that the full prompt is being evaluated. it takes much longer than just the incremental prompt. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Motivation
third party applications are overwhelmingly slow for subsequent prompt evaluation. where a subsequent prompt in the examples/server web interface can be evaluated in seconds, longer chats in these applications can take several minutes just to begin generating additional text.
i believe there are two separate issues:
Description
N.B. it is possible that this is only a documentation issue.
Request: provide a well-lit path for consumers of the llama.cpp API and the OpenAI compatible examples/server endpoint to avoid reprocessing the full chat history on each subsequent prompt evaluation.
i suspect there is a usability or discoverability issue with the llama.cpp APIs which is leading to inefficient use of llama.cpp. i've tested many llama.cpp based apps on Linux and Android (many listed in the README) and all of them struggle with this problem.
in the case of text-generation-webui and KoboldCpp, i tested both the builtin (llama-cpp-python based) inference as well as using them as an API client for examples/server endpoint. Both suffer from this problem.
examples/main and examples/server are the only two pieces of software i've tested which handle this well, which results in these two simple examples being the most performant way to interact with LLMs.
the high level llama-cpp-python API seems to be perpetuating this mistake, which has follow-on effects for other consumers such as oodabooga-webui: abetlen/llama-cpp-python#181 (don't be fooled by the closed status, the issue persists)
The text was updated successfully, but these errors were encountered: