Skip to content

Latest commit

 

History

History
313 lines (235 loc) · 9.08 KB

README.md

File metadata and controls

313 lines (235 loc) · 9.08 KB

Set up Local Model API Serving

AgentScope supports developers to build their local model API serving with different inference engines/libraries. This document will introduce how to fast build their local API serving with provided scripts.

Table of Contents

Local Model API Serving

ollama

ollama is a CPU inference engine for LLMs. With ollama, developers can build their local model API serving without GPU requirements.

Install Libraries and Set up Serving

  • First, install ollama in its official repository based on your system (e.g. macOS, windows or linux).

  • Follow ollama's guidance to pull or create a model and start its serving. Taking llama2 as an example, you can run the following command to pull the model files.

ollama pull llama2

How to use in AgentScope

In AgentScope, you can use the following model configurations to load the model.

  • For ollama Chat API:
{
    "config_name": "my_ollama_chat_config",
    "model_type": "ollama_chat",

    # Required parameters
    "model_name": "{model_name}",                    # The model name used in ollama API, e.g. llama2

    # Optional parameters
    "options": {                                # Parameters passed to the model when calling
        # e.g. "temperature": 0., "seed": 123,
    },
    "keep_alive": "5m",                         # Controls how long the model will stay loaded into memory
}
  • For ollama generate API:
{
    "config_name": "my_ollama_generate_config",
    "model_type": "ollama_generate",

    # Required parameters
    "model_name": "{model_name}",                    # The model name used in ollama API, e.g. llama2

    # Optional parameters
    "options": {                                # Parameters passed to the model when calling
        # "temperature": 0., "seed": 123,
    },
    "keep_alive": "5m",                         # Controls how long the model will stay loaded into memory
}
  • For ollama embedding API:
{
    "config_name": "my_ollama_embedding_config",
    "model_type": "ollama_embedding",

    # Required parameters
    "model_name": "{model_name}",                    # The model name used in ollama API, e.g. llama2

    # Optional parameters
    "options": {                                # Parameters passed to the model when calling
        # "temperature": 0., "seed": 123,
    },
    "keep_alive": "5m",                         # Controls how long the model will stay loaded into memory
}

Flask-based Model API Serving

Flask is a lightweight web application framework. It is easy to build a local model API serving with Flask.

Here we provide two Flask examples with Transformers and ModelScope library, respectively. You can build your own model API serving with few modifications.

With Transformers Library

Install Libraries and Set up Serving

Install Flask and Transformers by following command.

pip install flask torch transformers accelerate

Taking model meta-llama/Llama-2-7b-chat-hf and port 8000 as an example, set up the model API serving by running the following command.

python flask_transformers/setup_hf_service.py \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --device "cuda:0" \
    --port 8000

You can replace meta-llama/Llama-2-7b-chat-hf with any model card in huggingface model hub.

How to use in AgentScope

In AgentScope, you can load the model with the following model configs: ./flask_transformers/model_config.json.

{
    "model_type": "post_api_chat",
    "config_name": "flask_llama2-7b-chat-hf",
    "api_url": "http://127.0.0.1:8000/llm/",
    "json_args": {
        "max_length": 4096,
        "temperature": 0.5
    }
}
Note

In this model serving, the messages from post requests should be in STRING format. You can use templates for chat model in transformers with a little modification in ./flask_transformers/setup_hf_service.py.

With ModelScope Library

Install Libraries and Set up Serving

Install Flask and modelscope by following command.

pip install flask torch modelscope

Taking model modelscope/Llama-2-7b-chat-ms and port 8000 as an example, to set up the model API serving, run the following command.

python flask_modelscope/setup_ms_service.py \
    --model_name_or_path modelscope/Llama-2-7b-chat-ms \
    --device "cuda:0" \
    --port 8000

You can replace modelscope/Llama-2-7b-chat-ms with any model card in modelscope model hub.

How to use in AgentScope

In AgentScope, you can load the model with the following model configs: flask_modelscope/model_config.json.

{
    "model_type": "post_api_chat",
    "config_name": "flask_llama2-7b-chat-ms",
    "api_url": "http://127.0.0.1:8000/llm/",
    "json_args": {
        "max_length": 4096,
        "temperature": 0.5
    }
}
Note

Similar with the example of transformers, the messages from post requests should be in STRING format.

FastChat

FastChat is an open platform that provides quick setup for model serving with OpenAI-compatible RESTful APIs.

Install Libraries and Set up Serving

To install FastChat, run

pip install "fschat[model_worker,webui]"

Taking model meta-llama/Llama-2-7b-chat-hf and port 8000 as an example, to set up model API serving, run the following command to set up model serving.

bash fastchat/fastchat_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000

Supported Models

Refer to supported model list of FastChat.

How to use in AgentScope

Now you can load the model in AgentScope by the following model config: fastchat/model_config.json.

{
    "model_type": "openai_chat",
    "config_name": "fastchat_llama2-7b-chat-hf",
    "model_name": "meta-llama/Llama-2-7b-chat-hf",
    "api_key": "EMPTY",
    "client_args": {
        "base_url": "http://127.0.0.1:8000/v1/"
    },
    "generate_args": {
        "temperature": 0.5
    }
}

vllm

vllm is a high-throughput inference and serving engine for LLMs.

Install Libraries and Set up Serving

To install vllm, run

pip install vllm

Taking model meta-llama/Llama-2-7b-chat-hf and port 8000 as an example, to set up model API serving, run

./vllm/vllm_setup.sh -m meta-llama/Llama-2-7b-chat-hf -p 8000

Supported models

Please refer to the supported models list of vllm.

How to use in AgentScope

Now you can load the model in AgentScope by the following model config: vllm/model_config.json.

{
    "model_type": "openai_chat",
    "config_name": "vllm_llama2-7b-chat-hf",
    "model_name": "meta-llama/Llama-2-7b-chat-hf",
    "api_key": "EMPTY",
    "client_args": {
        "base_url": "http://127.0.0.1:8000/v1/"
    },
    "generate_args": {
        "temperature": 0.5
    }
}

Model Inference API

Both Huggingface and ModelScope provide model inference API, which can be used with AgentScope post api model wrapper. Taking gpt2 in HuggingFace inference API as an example, you can use the following model config in AgentScope.

{
    "model_type": "post_api_chat",
    "config_name": "gpt2",
    "headers": {
        "Authorization": "Bearer {YOUR_API_TOKEN}"
    },
    "api_url": "https://api-inference.huggingface.co/models/gpt2"
}