Text generation is not working by unable to load model. #3448

aef5748 · 2024-09-02T09:30:50Z

aef5748
Sep 2, 2024

I can run text to image with gpu.
But when I tried to ask question on localai, I got below error message.

8:18AM DBG GRPC(Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf-127.0.0.1:35639): stderr ggml_backend_cuda_buffer_type_alloc_buffer: allocating 560.00 MiB on device 0: cudaMalloc failed: out of memory
8:18AM DBG GRPC(Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf-127.0.0.1:35639): stderr ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 587206656
8:18AM DBG GRPC(Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf-127.0.0.1:35639): stderr llama_new_context_with_model: failed to allocate compute buffers
8:18AM DBG GRPC(Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf-127.0.0.1:35639): stderr llama_init_from_gpt_params: error: failed to create context with model '/models/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf'
8:18AM DBG GRPC(Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf-127.0.0.1:35639): stdout {"timestamp":1725265090,"level":"ERROR","function":"load_model","line":466,"message":"unable to load model","model":"/models/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf"}
8:18AM INF [llama-cpp] Fails: could not load model: rpc error: code = Canceled desc =

full log:
localai_chat_gpu_setGpuLayers_20240902.txt

My environment:

OS: ubuntu 20.04
GPU: NVIDIA GeForce GTX 1060 6GB

docker-compose.yaml:

version: '3.7'
networks:
  nextcloud:
    name: nextcloud
    driver: bridge
    
services:
  localai:
    image: localai/localai:v2.20.1-aio-gpu-nvidia-cuda-12
    container_name: localai
    restart: always
    environment:
      - DEBUG=true
      - LOCALAI_MODELS_PATH=/models
      - LOCALAI_IMAGE_PATH=/tmp/generated/images
      - HUGGINGFACE_HUB_CACHE=/usr/local/huggingface

    volumes:
      - /home/sw/Nextcloud/container/localai/models:/models:cached
      - /home/sw/Nextcloud/container/localai/images:/tmp/generated/images/
      - /home/sw/Nextcloud/container/localai/huggingface:/usr/local/huggingface
    ports:
      - 3008:8080
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

model.yaml:

name: gpt-4
mmap: true
parameters:
  model: huggingface://NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf
context_size: 2048
gpu_layers: 10

stopwords:
- "<|im_end|>"
- "<dummy32000>"
- "</tool_call>"
- "<|eot_id|>"
- "<|end_of_text|>"

function:
  # disable injecting the "answer" tool
  disable_no_action: true

  grammar:
    # This allows the grammar to also return messages
    mixed_mode: true
    # Suffix to add to the grammar
    #prefix: '<tool_call>\n'
    # Force parallel calls in the grammar
    # parallel_calls: true

  return_name_in_function_response: true
  # Without grammar uncomment the lines below
  # Warning: this is relying only on the capability of the
  # LLM model to generate the correct function call.
  json_regex_match: 
   - "(?s)<tool_call>(.*?)</tool_call>"
   - "(?s)<tool_call>(.*?)"
  replace_llm_results:
  # Drop the scratchpad content from responses
  - key: "(?s)<scratchpad>.*</scratchpad>"
    value: ""
  replace_function_results: 
  # Replace everything that is not JSON array or object
  # 
  - key: '(?s)^[^{\[]*'
    value: ""
  - key: '(?s)[^}\]]*$'
    value: ""
  - key: "'([^']*?)'"
    value: "_DQUOTE_${1}_DQUOTE_"
  - key: '\\"'
    value: "__TEMP_QUOTE__"
  - key: "\'"
    value: "'"
  - key: "_DQUOTE_"
    value: '"'
  - key: "__TEMP_QUOTE__"
    value: '"'
  # Drop the scratchpad content from responses
  - key: "(?s)<scratchpad>.*</scratchpad>"
    value: ""

template:
  chat: |
    {{.Input -}}
    <|im_start|>assistant
  chat_message: |
    <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}}
    {{- if .FunctionCall }}
    <tool_call>
    {{- else if eq .RoleName "tool" }}
    <tool_response>
    {{- end }}
    {{- if .Content}}
    {{.Content }}
    {{- end }}
    {{- if .FunctionCall}}
    {{toJson .FunctionCall}}
    {{- end }}
    {{- if .FunctionCall }}
    </tool_call>
    {{- else if eq .RoleName "tool" }}
    </tool_response>
    {{- end }}<|im_end|>
  completion: |
    {{.Input}}
  function: |-
    <|im_start|>system
    You are a function calling AI model.
    Here are the available tools:
    <tools>
    {{range .Functions}}
    {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
    {{end}}
    </tools>
    You should call the tools provided to you sequentially
    Please use <scratchpad> XML tags to record your reasoning and planning before you call the functions as follows:
    <scratchpad>
    {step-by-step reasoning and plan in bullet points}
    </scratchpad>
    For each function call return a json object with function name and arguments within <tool_call> XML tags as follows:
    <tool_call>
    {"arguments": <args-dict>, "name": <function-name>}
    </tool_call><|im_end|>
    {{.Input -}}
    <|im_start|>assistant

I have tried to add gpu_layers: 10 in model's yaml, but it not working. Others models get the same error.
How can I reduce gpu memory to run chat model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text generation is not working by unable to load model. #3448

{{title}}

Replies: 0 comments

Select a reply

Text generation is not working by unable to load model. #3448

aef5748 Sep 2, 2024

Replies: 0 comments

aef5748
Sep 2, 2024