Support Qwen Model by Alibaba #1110

thiner · 2023-09-27T03:30:59Z

Is your feature request related to a problem? Please describe.

No. I just want to ask whether LocalAI can support Qwen model. https://github.com/QwenLM/Qwen

Describe the solution you'd like

Support Qwen model as a backend. They have already supported OpenAI style API, https://github.com/QwenLM/Qwen#api, maybe that would be easy to integrate.

Describe alternatives you've considered

Additional context

localai-bot · 2023-09-27T12:04:45Z

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

LocalAI can support Qwen model as a backend by adding it to the list of supported models in the "Features" section of the documentation. The integration process would involve setting up the Qwen API according to the OpenAI style API specification and configuring LocalAI to use it. This could be done through a PR to the LocalAI repository or by submitting a feature request on the issue tracker.
Alternatively, users can also use the Qwen model with LocalAI by installing the Qwen package and setting up the model in their own environment. However, this would require some technical knowledge and may not be as straightforward as using a pre-integrated backend.

Sources:

Aisuko · 2023-10-14T10:13:53Z

I am thinking here that it should be compatible by the new C++ backend. According to their descriptions in the README.md below. https://github.com/QwenLM/qwen.cpp#qwencpp

fishfree · 2023-12-11T01:10:38Z

It still does not work.

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "qwen7b-ggml.bin",
     "messages": [{"role": "user", "content": "你好！"}],
     "temperature": 0.9
   }'
{"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}}

snowyu · 2024-01-20T01:53:29Z

Now, QWen has already been merged into llama.cpp @mudler :

thiner · 2024-02-26T08:50:40Z

Tested with model qwen/Qwen1.5-1.8B-Chat-GGUF running in LocalAI:v2.9.0-cublas-cuda12-core, and it's working.

thiner · 2024-02-26T08:52:52Z

The model config file for reference.

# Model name.
# The model name is used to identify the model in the API calls.
name: gpt-3.5-turbo
# Default model parameters.
# These options can also be specified in the API calls
parameters:
  model: qwen1_5-1_8b-chat-q5_k_m.gguf
  temperature: 0.75
  top_k: 85
  top_p: 0.7

# Default context size
context_size: 8192
# Default number of threads
threads: 16
backend: llama-cpp

# define chat roles
roles:
  user: "user:"
  assistant: "assistant:"
  system: "system:"
template:
  chat_message: &template |
    <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
    {{if .Content}}{{.Content}}{{end}}
    <|im_end|>
  chat: &template |
    {{.Input}}
    <|im_start|>assistant
  # Modify the prompt template here ^^^ as per your requirements
  completion: &template |
    {{.Input}}
# Enable F16 if backend supports it
stopwords:
- "<|im_end|>"
f16: false
embeddings: false
# Mirostat configuration (llama.cpp only)
mirostat_eta: 0.8
mirostat_tau: 0.9
mirostat: 1
# GPU Layers (only used when built with cublas)
gpu_layers: 25
# Enable memory lock
mmlock: false
# Define a prompt cache path (relative to the models)
prompt_cache_path: "prompt-cache"
# Cache all the prompts
prompt_cache_all: true
# Read only
prompt_cache_ro: false
# Enable mmap
mmap: true
# Enable low vram mode (GPU only)
low_vram: false

# Disable mulmatq (CUDA)
no_mulmatq: true

# Diffusers/transformers
cuda: true

fishfree · 2024-03-08T04:37:12Z

@thiner Thank you for your sharing. I created your yaml under LocalAI/models folder, and run docker compose restart, then run:

ang@ubuntugpu:~/anythingllm$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "qwen1_5-1_8b-chat-q5_k_m.gguf",
     "messages": [{"role": "user", "content": "你是谁？"}],
     "temperature": 0.9 
   }'

Hours passed, no progress. Why?
Or can we use this model instead? If can, how to write the LocalAI model yaml file?

thiner · 2024-03-08T05:01:13Z

Please post your environment setup and the container log for diagnosing.
Of course you can use GPTQ quantization model, but don't forget to set the right backend value in the config file.

I suggest to minimize your config file as start point, set the context length to a small value if you don't have powerful GPU.

fishfree · 2024-03-15T01:19:46Z

@thiner Thank you!

docker-compose logs -f
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to decode the batch, n_batch = 1, ret = 1

My environment:
docker-compose.yaml file:

version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:v2.9.0-cublas-cuda12-ffmpeg
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models:cached
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]

(base) mememe@ubuntugpu:~/LocalAI/models$ ls
all-minilm-l6-v2.yaml  c0c3c83d0ec33ffe925657a56b06771b            phi-2.yaml                     whisper-base.yaml
baichuan-7b.yaml       ggml-whisper-base.bin                       qwen1_5-1_8b-chat-q5_k_m.gguf
bakllava-mmproj.gguf   llava.yaml                                  qwen1_5-1_8b-chat-q5_k_m.yaml
bakllava.gguf          paraphrase-multilingual-MiniLM-L12-v2.yaml  tinydream.yaml

The content of the qwen1_5-1_8b-chat-q5_k_m.yaml file is copied from your post.

The /v1/models endpoint returns:
{"object":"list","data":[{"id":"paraphrase-multilingual-MiniLM-L12-v2","object":"model"},{"id":"phi-2","object":"model"},{"id":"tinydream","object":"model"},{"id":"whisper","object":"model"},{"id":"all-minilm-l6-v2","object":"model"},{"id":"baichuan-7b","object":"model"},{"id":"llava","object":"model"},{"id":"bakllava-mmproj.gguf","object":"model"},{"id":"qwen1_5-1_8b-chat-q5_k_m.gguf","object":"model"}]}

thiner · 2024-03-15T02:21:57Z

You didn't load the config file correctly, because there is not gpt-3.5-turbo in the model list. Maybe you should remove the "cached" flag in volume mount.
If it still encountering error for loading the model, check the user guide: https://localai.io/docs/getting-started/customize-model/

thiner · 2024-03-15T02:34:41Z

I just found that there was a typo in my previous config file. I just fixed.
You may try below simplified configuration to load your model file.

# Model name.
# The model name is used to identify the model in the API calls.
name: gpt-3.5-turbo
# Default model parameters.
# These options can also be specified in the API calls
parameters:
  model: qwen1_5-1_8b-chat-q5_k_m.gguf
  temperature: 0.75
  top_k: 85
  top_p: 0.7

# Default context size
context_size: 512
# Default number of threads
threads: 16
backend: llama-cpp

# define chat roles
roles:
  user: "user:"
  assistant: "assistant:"
  system: "system:"
template:
  chat_message: &template |
    <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
    {{if .Content}}{{.Content}}{{end}}
    <|im_end|>
  chat: &template |
    {{.Input}}
    <|im_start|>assistant
  # Modify the prompt template here ^^^ as per your requirements
  completion: &template |
    {{.Input}}
# Enable F16 if backend supports it
stopwords:
- "<|im_end|>"
f16: false
embeddings: false
# Mirostat configuration (llama.cpp only)
mirostat_eta: 0.8
mirostat_tau: 0.9
mirostat: 1
# GPU Layers (only used when built with cublas)
gpu_layers: -1
# Define a prompt cache path (relative to the models)
prompt_cache_path: "prompt-cache"
# Cache all the prompts
prompt_cache_all: true
# Read only
prompt_cache_ro: false
# Diffusers/transformers
cuda: true

fishfree · 2024-03-15T09:08:19Z

@thiner Thank you! I replaced the yaml file with the new one, and docker-compose restart. However, the problem still exists the same. My disk space is not full.

thiner · 2024-03-15T10:57:17Z

If there is not error poping up, just no response, check this out: https://localai.io/faq/#everything-is-slow-how-is-it-possible.
If you still can't find "gpt-3.5-turbo" from the models list, check LoalAI documentation, try to fix your startup command and settings.

luoweb · 2024-04-09T08:03:02Z

@thiner Thank you!


docker-compose logs -f

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to decode the batch, n_batch = 1, ret = 1

My environment:

docker-compose.yaml file:


version: '3.6'



services:

  api:

    image: quay.io/go-skynet/local-ai:v2.9.0-cublas-cuda12-ffmpeg

    build:

      context: .

      dockerfile: Dockerfile

    ports:

      - 8080:8080

    env_file:

      - .env

    volumes:

      - ./models:/models:cached

      - ./images/:/tmp/generated/images/

    command: ["/usr/bin/local-ai" ]


(base) mememe@ubuntugpu:~/LocalAI/models$ ls

all-minilm-l6-v2.yaml  c0c3c83d0ec33ffe925657a56b06771b            phi-2.yaml                     whisper-base.yaml

baichuan-7b.yaml       ggml-whisper-base.bin                       qwen1_5-1_8b-chat-q5_k_m.gguf

bakllava-mmproj.gguf   llava.yaml                                  qwen1_5-1_8b-chat-q5_k_m.yaml

bakllava.gguf          paraphrase-multilingual-MiniLM-L12-v2.yaml  tinydream.yaml

The content of the qwen1_5-1_8b-chat-q5_k_m.yaml file is copied from your post.

The /v1/models endpoint returns:

{"object":"list","data":[{"id":"paraphrase-multilingual-MiniLM-L12-v2","object":"model"},{"id":"phi-2","object":"model"},{"id":"tinydream","object":"model"},{"id":"whisper","object":"model"},{"id":"all-minilm-l6-v2","object":"model"},{"id":"baichuan-7b","object":"model"},{"id":"llava","object":"model"},{"id":"bakllava-mmproj.gguf","object":"model"},{"id":"qwen1_5-1_8b-chat-q5_k_m.gguf","object":"model"}]}

This happen to me too…

thiner added the enhancement New feature or request label Sep 27, 2023

thiner assigned mudler Sep 27, 2023

mudler added the roadmap label Sep 27, 2023

mudler removed their assignment Oct 1, 2023

mudler mentioned this issue Oct 2, 2023

[EPIC] Model support dashboard (v2) #1126

Open

thiner closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Qwen Model by Alibaba #1110

Support Qwen Model by Alibaba #1110

thiner commented Sep 27, 2023 •

edited

Loading

localai-bot commented Sep 27, 2023

Aisuko commented Oct 14, 2023

fishfree commented Dec 11, 2023

snowyu commented Jan 20, 2024

thiner commented Feb 26, 2024

thiner commented Feb 26, 2024 •

edited

Loading

fishfree commented Mar 8, 2024 •

edited

Loading

thiner commented Mar 8, 2024

fishfree commented Mar 15, 2024 •

edited

Loading

thiner commented Mar 15, 2024

thiner commented Mar 15, 2024

fishfree commented Mar 15, 2024

thiner commented Mar 15, 2024 •

edited

Loading

luoweb commented Apr 9, 2024

Support Qwen Model by Alibaba #1110

Support Qwen Model by Alibaba #1110

Comments

thiner commented Sep 27, 2023 • edited Loading

localai-bot commented Sep 27, 2023

⚠️⚠️⚠️⚠️⚠️

⚠️⚠️⚠️⚠️⚠️

Aisuko commented Oct 14, 2023

fishfree commented Dec 11, 2023

snowyu commented Jan 20, 2024

thiner commented Feb 26, 2024

thiner commented Feb 26, 2024 • edited Loading

fishfree commented Mar 8, 2024 • edited Loading

thiner commented Mar 8, 2024

fishfree commented Mar 15, 2024 • edited Loading

thiner commented Mar 15, 2024

thiner commented Mar 15, 2024

fishfree commented Mar 15, 2024

thiner commented Mar 15, 2024 • edited Loading

luoweb commented Apr 9, 2024

thiner commented Sep 27, 2023 •

edited

Loading

thiner commented Feb 26, 2024 •

edited

Loading

fishfree commented Mar 8, 2024 •

edited

Loading

fishfree commented Mar 15, 2024 •

edited

Loading

thiner commented Mar 15, 2024 •

edited

Loading