Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternating two models inference with the first one only one #1061

Closed
flotos opened this issue Sep 14, 2023 · 2 comments
Closed

Alternating two models inference with the first one only one #1061

flotos opened this issue Sep 14, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@flotos
Copy link

flotos commented Sep 14, 2023

LocalAI version:

1.25.0

Environment, CPU architecture, OS, and Version:

Linux REDACTED 4.18.0-147.5.1.6.h541.eulerosv2r9.x86_64 #1 SMP Wed Aug 4 02:30:13 UTC 2021 x86_64 GNU/Linux

Describe the bug

I have done inference on an orca model, then on airoboros 13b, both ggml using llama-stable. Here is the output :

3:52PM DBG Loading model llama-stable from airoboros-l2-13b-2.1.ggmlv3.Q4_0.bin
3:52PM DBG Model already loaded in memory: airoboros-l2-13b-2.1.ggmlv3.Q4_0.bin
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:        load time =  1828.58 ms
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:      sample time =   157.75 ms /   197 runs   (    0.80 ms per token,  1248.85 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings: prompt eval time = 32782.90 ms /   550 tokens (   59.61 ms per token,    16.78 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:        eval time = 1120927.58 ms /   196 runs   ( 5719.02 ms per token,     0.17 tokens per second)
4:10PM DBG GRPC(orca-mini-3b.ggmlv3.q8_0.bin-127.0.0.1:39103): stderr llama_print_timings:       total time = 1153945.53 ms
4:10PM DBG Response: {"object":"chat.completion","model":"orca","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"REDACTED"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

We can see that airoboros was loaded, but the model in DBG Response is orca. I have indeed generated an orca response but way before, also the total time taken match with the model load time.

To Reproduce

Inference with a model, then another one from curl.

Expected behavior

The second inference should use the model specified in the request.

@flotos flotos added the bug Something isn't working label Sep 14, 2023
@flotos
Copy link
Author

flotos commented Sep 14, 2023

The error seems to only be in the logs, that the result were shown after I inferenced the other model and not at the same time I got my http request's result. Unsure if there really is an issue but 20 minutes for an orca inference is way longer than what I had previously and more than the time I think it really took.

@flotos
Copy link
Author

flotos commented Sep 14, 2023

The issue was on my end, sorry for the inconvenience

@flotos flotos closed this as completed Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants