You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all: Thank you very much for making LocalAI available. This is a giant leap for the community!
LocalAI version:
1.24.1 - built an hour ago
Environment, CPU architecture, OS, and Version:
Linux localAI 6.1.0-11-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08) x86_64 GNU/Linux
This is a VM with 8 XEON cores, 32 GB ram and an NVIDIA RTX 3090
Describe the bug
It works but it is super slow: I asked "I want to make cheesecake" and the response took 21 minutes (!). That can't be right.
To Reproduce
Run it in an environment like mine and ask for cheesecake.
Expected behavior
Difficult to say but I would expect an answer within a minute or two? Or is that unrealistic?
11:32PM DBG Prompt (after templating): ### Instruction: I want to make cheesecake
Response:
11:32PM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:32PM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: load time = 6589.55 ms
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: sample time = 460.18 ms / 599 runs ( 0.77 ms per token, 1301.66 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: prompt eval time = 39340.21 ms / 20 tokens ( 1967.01 ms per token, 0.51 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: eval time = 1228961.92 ms / 598 runs ( 2055.12 ms per token, 0.49 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: total time = 1268915.94 ms
11:53PM DBG Response: {"object":"chat.completion","model":"lunademo","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To make a delicious cheesecake, you will need the following ingredients:\n\nFor the crust:\n- 1 1/2 cups graham cracker crumbs\n- 1/4 cup granulated sugar\n- 6 tablespoons (3/4 stick) unsalted butter, melted\n\nFor the filling:\n- 16 ounces cream cheese, softened\n- 1/2 cup granulated sugar\n- 4 large eggs, separated\n- 1 teaspoon vanilla extract\n- 1/2 cup sour cream\n\nOptional toppings:\n- Fresh berries (strawberries, blueberries, raspberries)\n- Whipped cream\n- Chocolate syrup or chocolate shavings\n\nInstructions:\n\n1. Preheat your oven to 325°F (160°C).\n2. Prepare the crust: In a medium bowl, mix together the graham cracker crumbs and sugar. Stir in the melted butter until the mixture is evenly moistened. Press the crust mixture into the bottom of a 9-inch (23 cm) springform pan. Bake for 10 minutes, then let it cool completely.\n3. Prepare the filling: In a large mixing bowl, beat the cream cheese until smooth. Add the granulated sugar and beat until well combined. Beat in the egg yolks one at a time, followed by the vanilla extract.\n4. In a separate bowl, beat the egg whites until stiff peaks form. Gently fold one-fourth of the beaten egg whites into the cream cheese mixture to lighten it, then gently fold in the remaining egg whites.\n5. Pour the filling over the prepared crust and smooth the top with a spatula. Bake for 1 hour and 15 minutes, or until the edges are set and the center is just slightly jiggly.\n6. Let the cheesecake cool in the oven with the door ajar for 30 minutes. Then, remove it from the oven and let it cool completely on a wire rack.\n7. Once cooled, refrigerate the cheesecake for at least 4 hours or overnight to set.\n8. To serve, release the springform pan sides and transfer the cheesecake to a serving plate. If desired, top with fresh berries, whipped cream, chocolate syrup, or shavings.\n\nEnjoy your delicious homemade cheesecake!"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[127.0.0.1]:43580 200 - POST /v1/chat/completions`
Additional context
I tried one the lamas out a while ago (I think I was using oobabooga) on a machine without GPU. It responded slowly but steadily, maybe a couple of tokens per second. Mind you that was without a GPU. With a GPU and a better model, I would expect it to much much quicker. There may be a perception issue at play here as well: In oobabooga you get the response token by token where here you get it all in one go, but have to wait for the last token to be generated. So the wait before you see anything is infinitely longer. But still - 21 minutes???
Could it be that my GPU wasn't actually used? Does it say anything about it in the debug info?
The text was updated successfully, but these errors were encountered:
First of all: Thank you very much for making LocalAI available. This is a giant leap for the community!
LocalAI version:
1.24.1 - built an hour ago
Environment, CPU architecture, OS, and Version:
Linux localAI 6.1.0-11-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08) x86_64 GNU/Linux
This is a VM with 8 XEON cores, 32 GB ram and an NVIDIA RTX 3090
Describe the bug
It works but it is super slow: I asked "I want to make cheesecake" and the response took 21 minutes (!). That can't be right.
To Reproduce
Run it in an environment like mine and ask for cheesecake.
Expected behavior
Difficult to say but I would expect an answer within a minute or two? Or is that unrealistic?
Logs
`11:32PM DBG Request received:
11:32PM DBG Configuration read: &{PredictionOptions:{Model:wizardlm-13b-v1.2.ggmlv3.q4_0.bin Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:4 Debug:true Roles:map[assistant:### Response: system:### System: user:### Instruction:] Embeddings:false Backend:llama-stable TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
11:32PM DBG Parameters: &{PredictionOptions:{Model:wizardlm-13b-v1.2.ggmlv3.q4_0.bin Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:4 Debug:true Roles:map[assistant:### Response: system:### System: user:### Instruction:] Embeddings:false Backend:llama-stable TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
11:32PM DBG Prompt (before templating): ### Instruction: I want to make cheesecake
11:32PM DBG Template found, input modified to: ### Instruction: I want to make cheesecake
Response:
11:32PM DBG Prompt (after templating): ### Instruction: I want to make cheesecake
Response:
11:32PM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:32PM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: load time = 6589.55 ms
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: sample time = 460.18 ms / 599 runs ( 0.77 ms per token, 1301.66 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: prompt eval time = 39340.21 ms / 20 tokens ( 1967.01 ms per token, 0.51 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: eval time = 1228961.92 ms / 598 runs ( 2055.12 ms per token, 0.49 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: total time = 1268915.94 ms
11:53PM DBG Response: {"object":"chat.completion","model":"lunademo","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To make a delicious cheesecake, you will need the following ingredients:\n\nFor the crust:\n- 1 1/2 cups graham cracker crumbs\n- 1/4 cup granulated sugar\n- 6 tablespoons (3/4 stick) unsalted butter, melted\n\nFor the filling:\n- 16 ounces cream cheese, softened\n- 1/2 cup granulated sugar\n- 4 large eggs, separated\n- 1 teaspoon vanilla extract\n- 1/2 cup sour cream\n\nOptional toppings:\n- Fresh berries (strawberries, blueberries, raspberries)\n- Whipped cream\n- Chocolate syrup or chocolate shavings\n\nInstructions:\n\n1. Preheat your oven to 325°F (160°C).\n2. Prepare the crust: In a medium bowl, mix together the graham cracker crumbs and sugar. Stir in the melted butter until the mixture is evenly moistened. Press the crust mixture into the bottom of a 9-inch (23 cm) springform pan. Bake for 10 minutes, then let it cool completely.\n3. Prepare the filling: In a large mixing bowl, beat the cream cheese until smooth. Add the granulated sugar and beat until well combined. Beat in the egg yolks one at a time, followed by the vanilla extract.\n4. In a separate bowl, beat the egg whites until stiff peaks form. Gently fold one-fourth of the beaten egg whites into the cream cheese mixture to lighten it, then gently fold in the remaining egg whites.\n5. Pour the filling over the prepared crust and smooth the top with a spatula. Bake for 1 hour and 15 minutes, or until the edges are set and the center is just slightly jiggly.\n6. Let the cheesecake cool in the oven with the door ajar for 30 minutes. Then, remove it from the oven and let it cool completely on a wire rack.\n7. Once cooled, refrigerate the cheesecake for at least 4 hours or overnight to set.\n8. To serve, release the springform pan sides and transfer the cheesecake to a serving plate. If desired, top with fresh berries, whipped cream, chocolate syrup, or shavings.\n\nEnjoy your delicious homemade cheesecake!"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[127.0.0.1]:43580 200 - POST /v1/chat/completions`
Additional context
I tried one the lamas out a while ago (I think I was using oobabooga) on a machine without GPU. It responded slowly but steadily, maybe a couple of tokens per second. Mind you that was without a GPU. With a GPU and a better model, I would expect it to much much quicker. There may be a perception issue at play here as well: In oobabooga you get the response token by token where here you get it all in one go, but have to wait for the last token to be generated. So the wait before you see anything is infinitely longer. But still - 21 minutes???
Could it be that my GPU wasn't actually used? Does it say anything about it in the debug info?
The text was updated successfully, but these errors were encountered: