Skip to content

Commit

Permalink
openvino support in vllm (#65)
Browse files Browse the repository at this point in the history
Signed-off-by: Zahidul Haque <[email protected]>
  • Loading branch information
zahidulhaque committed Jun 7, 2024
1 parent 3d134d2 commit 7dbad07
Show file tree
Hide file tree
Showing 3 changed files with 128 additions and 0 deletions.
73 changes: 73 additions & 0 deletions comps/llms/text-generation/vllm-openvino/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Use vLLM with OpenVINO

## Build Docker Image

To build the docker image, run the command

```bash
bash ./build_vllm_openvino.sh
```

Once it successfully builds, you will have the `vllm:openvino` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell.

## Use vLLM serving with OpenAI API

### Start The Server:

For gated models, such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.

Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get an access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token.

```bash
export HUGGINGFACEHUB_API_TOKEN=<token>
```

To start the model server:

```bash
bash launch_model_server.sh
```

### Request Completion With Curl:

```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"prompt": "What is the key advantage of Openvino framework?",
"max_tokens": 300,
"temperature": 0.7
}'
```

#### Customize vLLM-OpenVINO Service

The `launch_model_server.sh` script accepts two parameters:

- port: The port number assigned to the vLLM CPU endpoint, with the default being 8000.
- model: The model name utilized for LLM, with the default set to "meta-llama/Llama-2-7b-hf".

You have the flexibility to customize the two parameters according to your specific needs. Below is a sample reference, if you wish to specify a different model and port number

` bash launch_model_server.sh -m meta-llama/Llama-2-7b-chat-hf -p 8123`

Additionally, you can set the vLLM CPU endpoint by exporting the environment variable `vLLM_LLM_ENDPOINT`:

```bash
export vLLM_LLM_ENDPOINT="http://xxx.xxx.xxx.xxx:8000"
export LLM_MODEL=<model_name> # example: export LLM_MODEL="meta-llama/Llama-2-7b-hf"
```

## Use Int-8 Weights Compression

Weights int-8 compression is disabled by default. For better performance and lower memory consumption, the weights compression can be enabled by setting the environment variable `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1`.
To pass the variable in docker, use `-e VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=1` as an additional argument to `docker run` command in the examples above.

The variable enables weights compression logic described in [optimum-intel 8-bit weights quantization](https://huggingface.co/docs/optimum/intel/optimization_ov#8-bit).
Hence, even if the variable is enabled, the compression is applied only for models starting with a certain size and avoids compression of too small models due to a significant accuracy drop.

## Use UInt-8 KV cache Compression

KV cache uint-8 compression is disabled by default. For better performance and lower memory consumption, the KV cache compression can be enabled by setting the environment variable `VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`.
To pass the variable in docker, use `-e VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8` as an additional argument to `docker run` command in the examples above.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#!/bin/bash

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


git clone --branch openvino-model-executor https://github.com/ilya-lavrenov/vllm.git
cd ./vllm/
docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
46 changes: 46 additions & 0 deletions comps/llms/text-generation/vllm-openvino/launch_model_server.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0


# Set default values


default_port=8000
default_model="meta-llama/Llama-2-7b-hf"
swap_space=50

while getopts ":hm:p:" opt; do
case $opt in
h)
echo "Usage: $0 [-h] [-m model] [-p port]"
echo "Options:"
echo " -h Display this help message"
echo " -m model Model (default: meta-llama/Llama-2-7b-hf)"
echo " -p port Port (default: 8000)"
exit 0
;;
m)
model=$OPTARG
;;
p)
port=$OPTARG
;;
\?)
echo "Invalid option: -$OPTARG" >&2
exit 1
;;
esac
done

# Assign arguments to variables
model_name=${model:-$default_model}
port_number=${port:-$default_port}


# Set the Huggingface cache directory variable
HF_CACHE_DIR=$HOME/.cache/huggingface

# Start the model server using Openvino as the backend inference engine. Provide the container name that is unique and meaningful, typically one that includes the model name.
docker run --rm --name="vllm-openvino-server" -p $port_number:$port_number -v $HF_CACHE_DIR:/root/.cache/huggingface vllm:openvino --model $model_name --port $port_number --disable-log-requests --swap-space $swap_space

0 comments on commit 7dbad07

Please sign in to comment.