Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Qwen Model by Alibaba #1110

Closed
thiner opened this issue Sep 27, 2023 · 14 comments
Closed

Support Qwen Model by Alibaba #1110

thiner opened this issue Sep 27, 2023 · 14 comments
Labels
enhancement New feature or request roadmap

Comments

@thiner
Copy link
Contributor

thiner commented Sep 27, 2023

Is your feature request related to a problem? Please describe.

No. I just want to ask whether LocalAI can support Qwen model. https://github.com/QwenLM/Qwen

Describe the solution you'd like

Support Qwen model as a backend. They have already supported OpenAI style API, https://github.com/QwenLM/Qwen#api, maybe that would be easy to integrate.

Describe alternatives you've considered

Additional context

@thiner thiner added the enhancement New feature or request label Sep 27, 2023
@mudler mudler added the roadmap label Sep 27, 2023
@localai-bot
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

but.... I can also be funny or helpful 😸 and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me,
I don't support (yet) replying!

⚠️⚠️⚠️⚠️⚠️

LocalAI can support Qwen model as a backend by adding it to the list of supported models in the "Features" section of the documentation. The integration process would involve setting up the Qwen API according to the OpenAI style API specification and configuring LocalAI to use it. This could be done through a PR to the LocalAI repository or by submitting a feature request on the issue tracker.
Alternatively, users can also use the Qwen model with LocalAI by installing the Qwen package and setting up the model in their own environment. However, this would require some technical knowledge and may not be as straightforward as using a pre-integrated backend.

Sources:

@mudler mudler removed their assignment Oct 1, 2023
@Aisuko
Copy link
Collaborator

Aisuko commented Oct 14, 2023

I am thinking here that it should be compatible by the new C++ backend. According to their descriptions in the README.md below. https://github.com/QwenLM/qwen.cpp#qwencpp

@fishfree
Copy link

It still does not work.

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "qwen7b-ggml.bin",
     "messages": [{"role": "user", "content": "你好!"}],
     "temperature": 0.9
   }'
{"error":{"code":500,"message":"rpc error: code = Unknown desc = unimplemented","type":""}}

@snowyu
Copy link

snowyu commented Jan 20, 2024

Now, QWen has already been merged into llama.cpp @mudler :

@thiner
Copy link
Contributor Author

thiner commented Feb 26, 2024

Tested with model qwen/Qwen1.5-1.8B-Chat-GGUF running in LocalAI:v2.9.0-cublas-cuda12-core, and it's working.

@thiner thiner closed this as completed Feb 26, 2024
@thiner
Copy link
Contributor Author

thiner commented Feb 26, 2024

The model config file for reference.

# Model name.
# The model name is used to identify the model in the API calls.
name: gpt-3.5-turbo
# Default model parameters.
# These options can also be specified in the API calls
parameters:
  model: qwen1_5-1_8b-chat-q5_k_m.gguf
  temperature: 0.75
  top_k: 85
  top_p: 0.7

# Default context size
context_size: 8192
# Default number of threads
threads: 16
backend: llama-cpp

# define chat roles
roles:
  user: "user:"
  assistant: "assistant:"
  system: "system:"
template:
  chat_message: &template |
    <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
    {{if .Content}}{{.Content}}{{end}}
    <|im_end|>
  chat: &template |
    {{.Input}}
    <|im_start|>assistant
  # Modify the prompt template here ^^^ as per your requirements
  completion: &template |
    {{.Input}}
# Enable F16 if backend supports it
stopwords:
- "<|im_end|>"
f16: false
embeddings: false
# Mirostat configuration (llama.cpp only)
mirostat_eta: 0.8
mirostat_tau: 0.9
mirostat: 1
# GPU Layers (only used when built with cublas)
gpu_layers: 25
# Enable memory lock
mmlock: false
# Define a prompt cache path (relative to the models)
prompt_cache_path: "prompt-cache"
# Cache all the prompts
prompt_cache_all: true
# Read only
prompt_cache_ro: false
# Enable mmap
mmap: true
# Enable low vram mode (GPU only)
low_vram: false

# Disable mulmatq (CUDA)
no_mulmatq: true

# Diffusers/transformers
cuda: true

@fishfree
Copy link

fishfree commented Mar 8, 2024

@thiner Thank you for your sharing. I created your yaml under LocalAI/models folder, and run docker compose restart, then run:

ang@ubuntugpu:~/anythingllm$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "qwen1_5-1_8b-chat-q5_k_m.gguf",
     "messages": [{"role": "user", "content": "你是谁?"}],
     "temperature": 0.9 
   }'

Hours passed, no progress. Why?
Or can we use this model instead? If can, how to write the LocalAI model yaml file?

@thiner
Copy link
Contributor Author

thiner commented Mar 8, 2024

  1. Please post your environment setup and the container log for diagnosing.
  2. Of course you can use GPTQ quantization model, but don't forget to set the right backend value in the config file.

I suggest to minimize your config file as start point, set the context length to a small value if you don't have powerful GPU.

@fishfree
Copy link

fishfree commented Mar 15, 2024

@thiner Thank you!

docker-compose logs -f
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to decode the batch, n_batch = 1, ret = 1

My environment:
docker-compose.yaml file:

version: '3.6'

services:
  api:
    image: quay.io/go-skynet/local-ai:v2.9.0-cublas-cuda12-ffmpeg
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models:cached
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]
(base) mememe@ubuntugpu:~/LocalAI/models$ ls
all-minilm-l6-v2.yaml  c0c3c83d0ec33ffe925657a56b06771b            phi-2.yaml                     whisper-base.yaml
baichuan-7b.yaml       ggml-whisper-base.bin                       qwen1_5-1_8b-chat-q5_k_m.gguf
bakllava-mmproj.gguf   llava.yaml                                  qwen1_5-1_8b-chat-q5_k_m.yaml
bakllava.gguf          paraphrase-multilingual-MiniLM-L12-v2.yaml  tinydream.yaml

The content of the qwen1_5-1_8b-chat-q5_k_m.yaml file is copied from your post.

The /v1/models endpoint returns:
{"object":"list","data":[{"id":"paraphrase-multilingual-MiniLM-L12-v2","object":"model"},{"id":"phi-2","object":"model"},{"id":"tinydream","object":"model"},{"id":"whisper","object":"model"},{"id":"all-minilm-l6-v2","object":"model"},{"id":"baichuan-7b","object":"model"},{"id":"llava","object":"model"},{"id":"bakllava-mmproj.gguf","object":"model"},{"id":"qwen1_5-1_8b-chat-q5_k_m.gguf","object":"model"}]}

@thiner
Copy link
Contributor Author

thiner commented Mar 15, 2024

You didn't load the config file correctly, because there is not gpt-3.5-turbo in the model list. Maybe you should remove the "cached" flag in volume mount.
If it still encountering error for loading the model, check the user guide: https://localai.io/docs/getting-started/customize-model/

@thiner
Copy link
Contributor Author

thiner commented Mar 15, 2024

I just found that there was a typo in my previous config file. I just fixed.
You may try below simplified configuration to load your model file.

# Model name.
# The model name is used to identify the model in the API calls.
name: gpt-3.5-turbo
# Default model parameters.
# These options can also be specified in the API calls
parameters:
  model: qwen1_5-1_8b-chat-q5_k_m.gguf
  temperature: 0.75
  top_k: 85
  top_p: 0.7

# Default context size
context_size: 512
# Default number of threads
threads: 16
backend: llama-cpp

# define chat roles
roles:
  user: "user:"
  assistant: "assistant:"
  system: "system:"
template:
  chat_message: &template |
    <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
    {{if .Content}}{{.Content}}{{end}}
    <|im_end|>
  chat: &template |
    {{.Input}}
    <|im_start|>assistant
  # Modify the prompt template here ^^^ as per your requirements
  completion: &template |
    {{.Input}}
# Enable F16 if backend supports it
stopwords:
- "<|im_end|>"
f16: false
embeddings: false
# Mirostat configuration (llama.cpp only)
mirostat_eta: 0.8
mirostat_tau: 0.9
mirostat: 1
# GPU Layers (only used when built with cublas)
gpu_layers: -1
# Define a prompt cache path (relative to the models)
prompt_cache_path: "prompt-cache"
# Cache all the prompts
prompt_cache_all: true
# Read only
prompt_cache_ro: false
# Diffusers/transformers
cuda: true

@fishfree
Copy link

@thiner Thank you! I replaced the yaml file with the new one, and docker-compose restart. However, the problem still exists the same. My disk space is not full.

@thiner
Copy link
Contributor Author

thiner commented Mar 15, 2024

If there is not error poping up, just no response, check this out: https://localai.io/faq/#everything-is-slow-how-is-it-possible.
If you still can't find "gpt-3.5-turbo" from the models list, check LoalAI documentation, try to fix your startup command and settings.

@luoweb
Copy link

luoweb commented Apr 9, 2024

@thiner Thank you!


docker-compose logs -f

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1

api_1  | 1:06AM DBG GRPC(qwen1_5-1_8b-chat-q5_k_m.gguf-127.0.0.1:42109): stderr update_slots : failed to decode the batch, n_batch = 1, ret = 1



My environment:

docker-compose.yaml file:


version: '3.6'



services:

  api:

    image: quay.io/go-skynet/local-ai:v2.9.0-cublas-cuda12-ffmpeg

    build:

      context: .

      dockerfile: Dockerfile

    ports:

      - 8080:8080

    env_file:

      - .env

    volumes:

      - ./models:/models:cached

      - ./images/:/tmp/generated/images/

    command: ["/usr/bin/local-ai" ]


(base) mememe@ubuntugpu:~/LocalAI/models$ ls

all-minilm-l6-v2.yaml  c0c3c83d0ec33ffe925657a56b06771b            phi-2.yaml                     whisper-base.yaml

baichuan-7b.yaml       ggml-whisper-base.bin                       qwen1_5-1_8b-chat-q5_k_m.gguf

bakllava-mmproj.gguf   llava.yaml                                  qwen1_5-1_8b-chat-q5_k_m.yaml

bakllava.gguf          paraphrase-multilingual-MiniLM-L12-v2.yaml  tinydream.yaml

The content of the qwen1_5-1_8b-chat-q5_k_m.yaml file is copied from your post.

The /v1/models endpoint returns:

{"object":"list","data":[{"id":"paraphrase-multilingual-MiniLM-L12-v2","object":"model"},{"id":"phi-2","object":"model"},{"id":"tinydream","object":"model"},{"id":"whisper","object":"model"},{"id":"all-minilm-l6-v2","object":"model"},{"id":"baichuan-7b","object":"model"},{"id":"llava","object":"model"},{"id":"bakllava-mmproj.gguf","object":"model"},{"id":"qwen1_5-1_8b-chat-q5_k_m.gguf","object":"model"}]}

This happen to me too…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

No branches or pull requests

7 participants