Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc error: code = Unknown desc = unimplemented instead of chat completion #1946

Open
splitbrain opened this issue Apr 2, 2024 · 7 comments
Labels
bug Something isn't working unconfirmed

Comments

@splitbrain
Copy link

LocalAI version:

docker run -p 8080:8080 --gpus all --name local-ai -ti localai/localai:v2.11.0-aio-gpu-nvidia-cuda-12

Environment, CPU architecture, OS, and Version:

Linux rumpel 6.8.2-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 28 Mar 2024 17:06:35 +0000 x86_64 GNU/Linux

NVIDIA GPU detected
Tue Apr  2 12:23:11 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off |   00000000:01:00.0  On |                  N/A |
|  0%   58C    P8             13W /  120W |     608MiB /   6144MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
NVIDIA GPU detected. Attempting to find memory size...
Total GPU Memory: 6144 MiB

Describe the bug

I am trying to run the example but am not getting an answer but an error:

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] }'
{"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}}

Expected behavior

An answer instead of an error.

Logs

In the docker console I see this:

12:26PM INF Trying to load the model '5c7cd056ecf9a4bb5b527410b97f48cb' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/vall-e-x/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/diffusers/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/exllama2/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/sentencetransformers/run.sh
12:26PM INF [llama-cpp] Attempting to load
12:26PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-cpp
12:26PM INF [llama-cpp] Fails: could not load model: rpc error: code = Canceled desc = 
12:26PM INF [llama-ggml] Attempting to load
12:26PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-ggml
12:26PM INF [llama-ggml] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:26PM INF [gpt4all] Attempting to load
12:26PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend gpt4all
12:26PM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:26PM INF [bert-embeddings] Attempting to load
12:26PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend bert-embeddings
12:26PM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
12:26PM INF [rwkv] Attempting to load
12:26PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend rwkv
12:26PM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
12:26PM INF [whisper] Attempting to load
12:26PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend whisper
12:26PM INF [whisper] Fails: could not load model: rpc error: code = Unknown desc = unable to load model
12:26PM INF [stablediffusion] Attempting to load
12:26PM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend stablediffusion
12:26PM INF [stablediffusion] Loads OK

Additional context

It seems like it can't load the model. But I have no idea why.

Some side nodes:

@splitbrain splitbrain added bug Something isn't working unconfirmed labels Apr 2, 2024
@mudler
Copy link
Owner

mudler commented Apr 2, 2024

NVIDIA GPU detected
Tue Apr  2 12:23:11 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off |   00000000:01:00.0  On |                  N/A |
|  0%   58C    P8             13W /  120W |     608MiB /   6144MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
NVIDIA GPU detected. Attempting to find memory size...
Total GPU Memory: 6144 MiB

ouch, most likely the model is too big for the available GPU ram as there isn't still a smaller profile available (yet).

For the moment you can try to enforce a different profile with setting for instance SIZE=cpu.

Also, set the DEBUG=true environment variable to have more logging output so we can confirm that's the issue.

Some side nodes:

* your docs should be updated, the latest tags don't work (see [ci: latest image tags #1906](https://github.com/mudler/LocalAI/issues/1906) and [Dockerhub images referenced in the documentation don't exist #1898](https://github.com/mudler/LocalAI/issues/1898)) and I am not sure if the tag I used above is actually the one I should use. I also tried master but had the same errrors

I can pull the images in https://localai.io/docs/reference/aio-images/, e.g.:

docker pull localai/localai:latest-aio-gpu-nvidia-cuda-11                                                                                                                                                                    [19/1441]
latest-aio-gpu-nvidia-cuda-11: Pulling from localai/localai 
bccd10f490ab: Pulling fs layer 
28390858c725: Pulling fs layer 
0e658d52bbec: Pulling fs layer 
1079083b2863: Waiting 
d99da6cbef70: Waiting 
4f4fb700ef54: Waiting 
19d09b01367a: Waiting 
d409b4e53d9d: Waiting 
d51919462d43: Waiting 
b17033fc5ece: Waiting 
53cc1c8ae629: Waiting 
* i find it odd that you map open source model names to propriety names like gpt-4. Or maybe that's the actual issue and I need to specify some different model name? Which one?

It is on purpose to automatically map with existing tools that expect the models named like the ones available in OpenAI. The AIO image is an opinionated image which pre-configures OSS models to show like the proprietary ones, however you are free to configure your LocalAI instance with any model you like: https://localai.io/docs/getting-started/run-other-models/

@chino-lu
Copy link

chino-lu commented Apr 4, 2024

I have the same issue. Below the debug log

8:02AM DBG Request received: {"model":"gpt-4","language":"","n":0,"top_p":null,"top_k":null,"temperature":null,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":0,"typical_p":0,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"How are you doing?"}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
8:02AM DBG Configuration read: &{PredictionOptions:{Model:5c7cd056ecf9a4bb5b527410b97f48cb Language: N:0 TopP:0xc00032c550 TopK:0xc00032c558 Temperature:0xc00032c560 Maxtokens:0xc00032c568 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0 TypicalP:0 Seed:0xc00032c598 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4 F16:0xc00032c530 Threads:0xc00032c540 Debug:0xc000735918 Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:{{.Input}}
<|im_start|>assistant
 ChatMessage:<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}
{{ if .FunctionCall }}<tool_call>{{end}}
{{ if eq .RoleName "tool" }}<tool_result>{{end}}
{{if .Content}}{{.Content}}{{end}}
{{if .FunctionCall}}{{toJson .FunctionCall}}{{end}}
{{ if .FunctionCall }}</tool_call>{{end}}
{{ if eq .RoleName "tool" }}</tool_result>{{end}}
<|im_end|>
 Completion:{{.Input}}
 Edit: Functions:<|im_start|>system
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
<tools>
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
</tools>
Use the following pydantic model json schema for each tool call you will make:
{'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}
For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{'arguments': <args-dict>, 'name': <function-name>}
</tool_call><|im_end|>
{{.Input}}
<|im_start|>assistant
<tool_call>
} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc00032c580 MirostatTAU:0xc00032c578 Mirostat:0xc00032c570 NGPULayers:0xc00032c588 MMap:0xc00032c4ad MMlock:0xc00032c591 LowVRAM:0xc00032c591 Grammar: StopWords:[<|im_end|> <dummy32000> 
</tool_call> 


] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc00032c520 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
}'
}
8:02AM DBG Parameters: &{PredictionOptions:{Model:5c7cd056ecf9a4bb5b527410b97f48cb Language: N:0 TopP:0xc00032c550 TopK:0xc00032c558 Temperature:0xc00032c560 Maxtokens:0xc00032c568 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0 TypicalP:0 Seed:0xc00032c598 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:gpt-4 F16:0xc00032c530 Threads:0xc00032c540 Debug:0xc000735918 Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat:{{.Input}}
<|im_start|>assistant
 ChatMessage:<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}
{{ if .FunctionCall }}<tool_call>{{end}}
{{ if eq .RoleName "tool" }}<tool_result>{{end}}
{{if .Content}}{{.Content}}{{end}}
{{if .FunctionCall}}{{toJson .FunctionCall}}{{end}}
{{ if .FunctionCall }}</tool_call>{{end}}
{{ if eq .RoleName "tool" }}</tool_result>{{end}}
<|im_end|>
 Completion:{{.Input}}
 Edit: Functions:<|im_start|>system
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools:
<tools>
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
</tools>
Use the following pydantic model json schema for each tool call you will make:
{'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']}
For each function call return a json object with function name and arguments within <tool_call></tool_call> XML tags as follows:
<tool_call>
{'arguments': <args-dict>, 'name': <function-name>}
</tool_call><|im_end|>
{{.Input}}
<|im_start|>assistant
<tool_call>
} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc00032c580 MirostatTAU:0xc00032c578 Mirostat:0xc00032c570 NGPULayers:0xc00032c588 MMap:0xc00032c4ad MMlock:0xc00032c591 LowVRAM:0xc00032c591 Grammar: StopWords:[<|im_end|> <dummy32000> 
</tool_call> 


] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc00032c520 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
}'
}
8:02AM DBG templated message for chat: <|im_start|>user


How are you doing?



<|im_end|>

8:02AM DBG Prompt (before templating): <|im_start|>user


How are you doing?



<|im_end|>

8:02AM DBG Template found, input modified to: <|im_start|>user


How are you doing?



<|im_end|>

<|im_start|>assistant

8:02AM DBG Prompt (after templating): <|im_start|>user


How are you doing?



<|im_end|>

<|im_start|>assistant

8:02AM INF Trying to load the model '5c7cd056ecf9a4bb5b527410b97f48cb' with all the available backends: llama-cpp, llama-ggml, gpt4all, bert-embeddings, rwkv, whisper, stablediffusion, tinydream, piper, /build/backend/python/exllama2/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/bark/run.sh, /build/backend/python/exllama/run.sh, /build/backend/python/vllm/run.sh, /build/backend/python/coqui/run.sh, /build/backend/python/mamba/run.sh, /build/backend/python/diffusers/run.sh, /build/backend/python/sentencetransformers/run.sh, /build/backend/python/transformers-musicgen/run.sh, /build/backend/python/petals/run.sh, /build/backend/python/autogptq/run.sh, /build/backend/python/transformers/run.sh, /build/backend/python/vall-e-x/run.sh
8:02AM INF [llama-cpp] Attempting to load
8:02AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-cpp
8:02AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: llama-cpp): {backendString:llama-cpp model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000552000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:02AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
8:02AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:39595'
8:02AM DBG GRPC Service state dir: /tmp/go-processmanager3123708696
8:02AM DBG GRPC Service Started
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stdout Server listening on 127.0.0.1:39595
[127.0.0.1]:33296 200 - GET /readyz
8:02AM DBG GRPC Service Ready
8:02AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:805641383 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb (version GGUF V3 (latest))
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   0:                       general.architecture str              = llama
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   1:                               general.name str              = jeffq
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   4:                          llama.block_count u32              = 32
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  11:                          general.file_type u32              = 18
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32032]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32032]   = [0.000000, 0.000000, 0.000000, 0.0000...
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32032]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - kv  21:               general.quantization_version u32              = 2
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - type  f32:   65 tensors
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_loader: - type q6_K:  226 tensors
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_vocab: special tokens definition check successful ( 291/32032 ).
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: format           = GGUF V3 (latest)
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: arch             = llama
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: vocab type       = SPM
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_vocab          = 32032
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_merges         = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_ctx_train      = 32768
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_embd           = 4096
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_head           = 32
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_head_kv        = 8
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_layer          = 32
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_rot            = 128
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_embd_head_k    = 128
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_embd_head_v    = 128
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_gqa            = 4
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_embd_k_gqa     = 1024
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_embd_v_gqa     = 1024
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: f_norm_eps       = 0.0e+00
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: f_clamp_kqv      = 0.0e+00
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: f_logit_scale    = 0.0e+00
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_ff             = 14336
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_expert         = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_expert_used    = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: causal attn      = 1
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: pooling type     = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: rope type        = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: rope scaling     = linear
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: freq_base_train  = 10000.0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: freq_scale_train = 1
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: n_yarn_orig_ctx  = 32768
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: rope_finetuned   = unknown
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: ssm_d_conv       = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: ssm_d_inner      = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: ssm_d_state      = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: ssm_dt_rank      = 0
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: model type       = 7B
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: model ftype      = Q6_K
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: model params     = 7.24 B
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: model size       = 5.53 GiB (6.56 BPW) 
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: general.name     = jeffq
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: BOS token        = 1 '<s>'
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: UNK token        = 0 '<unk>'
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_print_meta: LF token         = 13 '<0x0A>'
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr ggml_cuda_init: found 1 CUDA devices:
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr   Device 0: NVIDIA T400 4GB, compute capability 7.5, VMM: yes
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_tensors: ggml ctx size =    0.22 MiB
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5563.66 MiB on device 0: cudaMalloc failed: out of memory
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stdout {"timestamp":1712210556,"level":"ERROR","function":"load_model","line":464,"message":"unable to load model","model":"/build/models/5c7cd056ecf9a4bb5b527410b97f48cb"}
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_model_load: error loading model: unable to allocate backend buffer
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_load_model_from_file: failed to load model
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llama_init_from_gpt_params: error: failed to load model '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
8:02AM INF [llama-cpp] Fails: could not load model: rpc error: code = Canceled desc = 
8:02AM INF [llama-ggml] Attempting to load
8:02AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend llama-ggml
8:02AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: llama-ggml): {backendString:llama-ggml model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000552000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:02AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-ggml
8:02AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:38629'
8:02AM DBG GRPC Service state dir: /tmp/go-processmanager1166941467
8:02AM DBG GRPC Service Started
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr 2024/04/04 08:02:36 gRPC Server listening at 127.0.0.1:38629
8:02AM DBG GRPC Service Ready
8:02AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:805641383 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr create_gpt_params: loading model /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr ggml_init_cublas: found 1 CUDA devices:
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr   Device 0: NVIDIA T400 4GB, compute capability 7.5
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr llama.cpp: loading model from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr error loading model: unknown (magic, version) combination: 46554747, 00000003; is this really a GGML file?
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr llama_load_model_from_file: failed to load model
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr llama_init_from_gpt_params: error: failed to load model '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:38629): stderr load_binding_model: error: unable to load model
8:02AM INF [llama-ggml] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
8:02AM INF [gpt4all] Attempting to load
8:02AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend gpt4all
8:02AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: gpt4all): {backendString:gpt4all model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000552000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:02AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/gpt4all
8:02AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:42541'
8:02AM DBG GRPC Service state dir: /tmp/go-processmanager399407558
8:02AM DBG GRPC Service Started
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:42541): stderr 2024/04/04 08:02:47 gRPC Server listening at 127.0.0.1:42541
8:02AM DBG GRPC Service Ready
8:02AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:805641383 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:42541): stderr load_model: error 'Model format not supported (no matching implementation found)'
8:02AM INF [gpt4all] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
8:02AM INF [bert-embeddings] Attempting to load
8:02AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend bert-embeddings
8:02AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: bert-embeddings): {backendString:bert-embeddings model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000552000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:02AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/bert-embeddings
8:02AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:36869'
8:02AM DBG GRPC Service state dir: /tmp/go-processmanager1561849468
8:02AM DBG GRPC Service Started
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:36869): stderr 2024/04/04 08:02:49 gRPC Server listening at 127.0.0.1:36869
8:02AM DBG GRPC Service Ready
8:02AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:805641383 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:36869): stderr bert_load_from_file: invalid model file '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb' (bad magic)
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:36869): stderr bert_bootstrap: failed to load model from '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
8:02AM INF [bert-embeddings] Fails: could not load model: rpc error: code = Unknown desc = failed loading model
8:02AM INF [rwkv] Attempting to load
8:02AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend rwkv
8:02AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: rwkv): {backendString:rwkv model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000552000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:02AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/rwkv
8:02AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:44135'
8:02AM DBG GRPC Service state dir: /tmp/go-processmanager643369116
8:02AM DBG GRPC Service Started
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr 2024/04/04 08:02:52 gRPC Server listening at 127.0.0.1:44135
8:02AM DBG GRPC Service Ready
8:02AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:805641383 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr 
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr /build/sources/go-rwkv/rwkv.cpp/rwkv_file_format.inc:93: header.magic == 0x67676d66
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr Invalid file header
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr /build/sources/go-rwkv/rwkv.cpp/rwkv_model_loading.inc:158: rwkv_fread_file_header(file.file, model.header)
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr 
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr /build/sources/go-rwkv/rwkv.cpp/rwkv.cpp:63: rwkv_load_model_from_file(file_path, *ctx->model)
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:44135): stderr 2024/04/04 08:02:53 InitFromFile /build/models/5c7cd056ecf9a4bb5b527410b97f48cb failed
8:02AM INF [rwkv] Fails: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
8:02AM INF [whisper] Attempting to load
8:02AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend whisper
8:02AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: whisper): {backendString:whisper model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000552000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:02AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/whisper
8:02AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:42537'
8:02AM DBG GRPC Service state dir: /tmp/go-processmanager272901310
8:02AM DBG GRPC Service Started
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:42537): stderr 2024/04/04 08:02:53 gRPC Server listening at 127.0.0.1:42537
8:02AM DBG GRPC Service Ready
8:02AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:805641383 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:42537): stderr whisper_init_from_file_with_params_no_state: loading model from '/build/models/5c7cd056ecf9a4bb5b527410b97f48cb'
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:42537): stderr whisper_model_load: loading model
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:42537): stderr whisper_model_load: invalid model data (bad magic)
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:42537): stderr whisper_init_with_params_no_state: failed to load model
8:02AM INF [whisper] Fails: could not load model: rpc error: code = Unknown desc = unable to load model
8:02AM INF [stablediffusion] Attempting to load
8:02AM INF Loading model '5c7cd056ecf9a4bb5b527410b97f48cb' with backend stablediffusion
8:02AM DBG Loading model in memory from file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb
8:02AM DBG Loading Model 5c7cd056ecf9a4bb5b527410b97f48cb with gRPC (file: /build/models/5c7cd056ecf9a4bb5b527410b97f48cb) (backend: stablediffusion): {backendString:stablediffusion model:5c7cd056ecf9a4bb5b527410b97f48cb threads:4 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000552000 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:02AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/stablediffusion
8:02AM DBG GRPC Service for 5c7cd056ecf9a4bb5b527410b97f48cb will be running at: '127.0.0.1:34825'
8:02AM DBG GRPC Service state dir: /tmp/go-processmanager4245184256
8:02AM DBG GRPC Service Started
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:34825): stderr 2024/04/04 08:02:56 gRPC Server listening at 127.0.0.1:34825
8:02AM DBG GRPC Service Ready
8:02AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:5c7cd056ecf9a4bb5b527410b97f48cb ContextSize:4096 Seed:805641383 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:/tmp/localai/backend_data/backend-assets/gpt4all RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/5c7cd056ecf9a4bb5b527410b97f48cb Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
8:02AM INF [stablediffusion] Loads OK
[192.168.0.75]:57672 500 - POST /v1/chat/completions
[127.0.0.1]:50052 200 - GET /readyz
[127.0.0.1]:43440 200 - GET /readyz

@DavidGOrtega
Copy link

related to #1909 ?

@fakezeta
Copy link
Collaborator

Hi @chino-lu , about your issue the error is here:

8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr   Device 0: NVIDIA T400 4GB, compute capability 7.5, VMM: yes
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr llm_load_tensors: ggml ctx size =    0.22 MiB
8:02AM DBG GRPC(5c7cd056ecf9a4bb5b527410b97f48cb-127.0.0.1:39595): stderr ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5563.66 MiB on device 0: cudaMalloc failed: out of memory

Trying to allocate >5G on a board with only 4GB.

Reduce GPU offload playing with the gpu_layers: parameter.

@localai-bot
Copy link
Contributor

Thank you for providing the error details. To resolve this issue, you can indeed reduce the gpu_layers parameter to allocate less than 4GB of memory for the GPU. The gpu_layers parameter specifies the number of layers to be executed on the GPU. By reducing this value, you can decrease the memory usage on the GPU.

For example, you can try reducing the gpu_layers value to 256:

{
  "model": "stablediffusion::stablediffusion-v1-5c7cd056ecf9a4bb5b527410b97f48cb",
  "settings": {
    ...
    "gpu_layers": 256,
    ...
  }
}

Remember to adjust this value based on your specific requirements and available GPU memory. This should help avoid the out of memory error.

@chino-lu
Copy link

I got it working, but not the way I would like.
When using the AIO container, the yaml files will be overwritten with each restart.

By moving from the AIO container to the normal one, settings will not be overwritten anymore and it is working (but not with 256. With 11 I am now on working settings, not sure what would be the upper limit. But I have to admit that other things are running on that graphics card as well)

@mudler
Copy link
Owner

mudler commented Apr 16, 2024

I got it working, but not the way I would like. When using the AIO container, the yaml files will be overwritten with each restart.

By moving from the AIO container to the normal one, settings will not be overwritten anymore and it is working (but not with 256. With 11 I am now on working settings, not sure what would be the upper limit. But I have to admit that other things are running on that graphics card as well)

@chino-lu the AIO image contains pre-configured, opinionated models. In the case of GPU it defaults to automatically offload all the GPUs. To customize the model settings you can follow https://localai.io/docs/getting-started/customize-model/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
None yet
Development

No branches or pull requests

6 participants