You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Linux REDACTED 4.18.0-147.5.1.6.h541.eulerosv2r9.x86_64 #1 SMP Wed Aug 4 02:30:13 UTC 2021 x86_64 GNU/Linux
Describe the bug
I have done inference on an orca model, then on airoboros 13b, both ggml using llama-stable. Here is the output :
3:52PM DBG Loading model llama-stable from airoboros-l2-13b-2.1.ggmlv3.Q4_0.bin
3:52PM DBG Model already loaded in memory: airoboros-l2-13b-2.1.ggmlv3.Q4_0.bin
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings: load time = 1828.58 ms
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings: sample time = 157.75 ms / 197 runs ( 0.80 ms per token, 1248.85 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings: prompt eval time = 32782.90 ms / 550 tokens ( 59.61 ms per token, 16.78 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings: eval time = 1120927.58 ms / 196 runs ( 5719.02 ms per token, 0.17 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings: total time = 1153945.53 ms
4:10PM DBG Response: {"object":"chat.completion","model":"orca","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"REDACTED"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
We can see that airoboros was loaded, but the model in DBG Response is orca. I have indeed generated an orca response but way before, also the total time taken match with the model load time.
To Reproduce
Inference with a model, then another one from curl.
Expected behavior
The second inference should use the model specified in the request.
The text was updated successfully, but these errors were encountered:
The error seems to only be in the logs, that the result were shown after I inferenced the other model and not at the same time I got my http request's result. Unsure if there really is an issue but 20 minutes for an orca inference is way longer than what I had previously and more than the time I think it really took.
LocalAI version:
1.25.0
Environment, CPU architecture, OS, and Version:
Linux REDACTED 4.18.0-147.5.1.6.h541.eulerosv2r9.x86_64 #1 SMP Wed Aug 4 02:30:13 UTC 2021 x86_64 GNU/Linux
Describe the bug
I have done inference on an orca model, then on airoboros 13b, both ggml using llama-stable. Here is the output :
We can see that airoboros was loaded, but the model in DBG Response is orca. I have indeed generated an orca response but way before, also the total time taken match with the model load time.
To Reproduce
Inference with a model, then another one from curl.
Expected behavior
The second inference should use the model specified in the request.
The text was updated successfully, but these errors were encountered: