Alternating two models inference with the first one only one #1061

flotos · 2023-09-14T16:22:27Z

LocalAI version:

1.25.0

Environment, CPU architecture, OS, and Version:

Linux REDACTED 4.18.0-147.5.1.6.h541.eulerosv2r9.x86_64 #1 SMP Wed Aug 4 02:30:13 UTC 2021 x86_64 GNU/Linux

Describe the bug

I have done inference on an orca model, then on airoboros 13b, both ggml using llama-stable. Here is the output :

3:52PM DBG Loading model llama-stable from airoboros-l2-13b-2.1.ggmlv3.Q4_0.bin
3:52PM DBG Model already loaded in memory: airoboros-l2-13b-2.1.ggmlv3.Q4_0.bin
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:        load time =  1828.58 ms
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:      sample time =   157.75 ms /   197 runs   (    0.80 ms per token,  1248.85 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings: prompt eval time = 32782.90 ms /   550 tokens (   59.61 ms per token,    16.78 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:        eval time = 1120927.58 ms /   196 runs   ( 5719.02 ms per token,     0.17 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:       total time = 1153945.53 ms
4:10PM DBG Response: {"object":"chat.completion","model":"orca","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"REDACTED"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

We can see that airoboros was loaded, but the model in DBG Response is orca. I have indeed generated an orca response but way before, also the total time taken match with the model load time.

To Reproduce

Inference with a model, then another one from curl.

Expected behavior

The second inference should use the model specified in the request.

The text was updated successfully, but these errors were encountered:

flotos · 2023-09-14T16:35:45Z

The error seems to only be in the logs, that the result were shown after I inferenced the other model and not at the same time I got my http request's result. Unsure if there really is an issue but 20 minutes for an orca inference is way longer than what I had previously and more than the time I think it really took.

flotos · 2023-09-14T17:47:05Z

The issue was on my end, sorry for the inconvenience

flotos added the bug Something isn't working label Sep 14, 2023

flotos assigned mudler Sep 14, 2023

flotos closed this as completed Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternating two models inference with the first one only one #1061

Alternating two models inference with the first one only one #1061

flotos commented Sep 14, 2023

flotos commented Sep 14, 2023

flotos commented Sep 14, 2023

Alternating two models inference with the first one only one #1061

Alternating two models inference with the first one only one #1061

Comments

flotos commented Sep 14, 2023

flotos commented Sep 14, 2023

flotos commented Sep 14, 2023