Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ridiculously slow on RTX 3090!?! #957

Closed
gitwittidbit opened this issue Aug 25, 2023 · 1 comment
Closed

Ridiculously slow on RTX 3090!?! #957

gitwittidbit opened this issue Aug 25, 2023 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@gitwittidbit
Copy link

First of all: Thank you very much for making LocalAI available. This is a giant leap for the community!

LocalAI version:
1.24.1 - built an hour ago

Environment, CPU architecture, OS, and Version:
Linux localAI 6.1.0-11-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08) x86_64 GNU/Linux
This is a VM with 8 XEON cores, 32 GB ram and an NVIDIA RTX 3090

Describe the bug
It works but it is super slow: I asked "I want to make cheesecake" and the response took 21 minutes (!). That can't be right.

To Reproduce
Run it in an environment like mine and ask for cheesecake.

Expected behavior
Difficult to say but I would expect an answer within a minute or two? Or is that unrealistic?

Logs
`11:32PM DBG Request received:
11:32PM DBG Configuration read: &{PredictionOptions:{Model:wizardlm-13b-v1.2.ggmlv3.q4_0.bin Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:4 Debug:true Roles:map[assistant:### Response: system:### System: user:### Instruction:] Embeddings:false Backend:llama-stable TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
11:32PM DBG Parameters: &{PredictionOptions:{Model:wizardlm-13b-v1.2.ggmlv3.q4_0.bin Language: N:0 TopP:0.65 TopK:40 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:lunademo F16:false Threads:4 Debug:true Roles:map[assistant:### Response: system:### System: user:### Instruction:] Embeddings:false Backend:llama-stable TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:2000 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0}}
11:32PM DBG Prompt (before templating): ### Instruction: I want to make cheesecake
11:32PM DBG Template found, input modified to: ### Instruction: I want to make cheesecake

Response:

11:32PM DBG Prompt (after templating): ### Instruction: I want to make cheesecake

Response:

11:32PM DBG Loading model llama-stable from wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:32PM DBG Model already loaded in memory: wizardlm-13b-v1.2.ggmlv3.q4_0.bin
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: load time = 6589.55 ms
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: sample time = 460.18 ms / 599 runs ( 0.77 ms per token, 1301.66 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: prompt eval time = 39340.21 ms / 20 tokens ( 1967.01 ms per token, 0.51 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: eval time = 1228961.92 ms / 598 runs ( 2055.12 ms per token, 0.49 tokens per second)
11:53PM DBG GRPC(wizardlm-13b-v1.2.ggmlv3.q4_0.bin-127.0.0.1:39987): stderr llama_print_timings: total time = 1268915.94 ms
11:53PM DBG Response: {"object":"chat.completion","model":"lunademo","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"To make a delicious cheesecake, you will need the following ingredients:\n\nFor the crust:\n- 1 1/2 cups graham cracker crumbs\n- 1/4 cup granulated sugar\n- 6 tablespoons (3/4 stick) unsalted butter, melted\n\nFor the filling:\n- 16 ounces cream cheese, softened\n- 1/2 cup granulated sugar\n- 4 large eggs, separated\n- 1 teaspoon vanilla extract\n- 1/2 cup sour cream\n\nOptional toppings:\n- Fresh berries (strawberries, blueberries, raspberries)\n- Whipped cream\n- Chocolate syrup or chocolate shavings\n\nInstructions:\n\n1. Preheat your oven to 325°F (160°C).\n2. Prepare the crust: In a medium bowl, mix together the graham cracker crumbs and sugar. Stir in the melted butter until the mixture is evenly moistened. Press the crust mixture into the bottom of a 9-inch (23 cm) springform pan. Bake for 10 minutes, then let it cool completely.\n3. Prepare the filling: In a large mixing bowl, beat the cream cheese until smooth. Add the granulated sugar and beat until well combined. Beat in the egg yolks one at a time, followed by the vanilla extract.\n4. In a separate bowl, beat the egg whites until stiff peaks form. Gently fold one-fourth of the beaten egg whites into the cream cheese mixture to lighten it, then gently fold in the remaining egg whites.\n5. Pour the filling over the prepared crust and smooth the top with a spatula. Bake for 1 hour and 15 minutes, or until the edges are set and the center is just slightly jiggly.\n6. Let the cheesecake cool in the oven with the door ajar for 30 minutes. Then, remove it from the oven and let it cool completely on a wire rack.\n7. Once cooled, refrigerate the cheesecake for at least 4 hours or overnight to set.\n8. To serve, release the springform pan sides and transfer the cheesecake to a serving plate. If desired, top with fresh berries, whipped cream, chocolate syrup, or shavings.\n\nEnjoy your delicious homemade cheesecake!"}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
[127.0.0.1]:43580 200 - POST /v1/chat/completions`

Additional context
I tried one the lamas out a while ago (I think I was using oobabooga) on a machine without GPU. It responded slowly but steadily, maybe a couple of tokens per second. Mind you that was without a GPU. With a GPU and a better model, I would expect it to much much quicker. There may be a perception issue at play here as well: In oobabooga you get the response token by token where here you get it all in one go, but have to wait for the last token to be generated. So the wait before you see anything is infinitely longer. But still - 21 minutes???

Could it be that my GPU wasn't actually used? Does it say anything about it in the debug info?

@gitwittidbit gitwittidbit added the bug Something isn't working label Aug 25, 2023
@mudler
Copy link
Owner

mudler commented Aug 25, 2023

@gitwittidbit looks like you didn't configured the model to run with the GPU - did you had a look at https://localai.io/basics/getting_started/#cuda ? you need to set the gpu_layers numer to be offloaded

Repository owner locked and limited conversation to collaborators Aug 25, 2023
@mudler mudler converted this issue into discussion #958 Aug 25, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants