Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vLLM ARC support with OpenVINO backend #641

Merged
merged 8 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/docker/compose/llms-compose-cd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ services:
context: vllm-openvino
dockerfile: Dockerfile.openvino
image: ${REGISTRY:-opea}/vllm-openvino:${TAG:-latest}
vllm-arc:
build:
dockerfile: comps/llms/text-generation/vllm/langchain/dependency/Dockerfile.intel_gpu
image: ${REGISTRY:-opea}/vllm-arc:${TAG:-latest}
llm-eval:
build:
dockerfile: comps/llms/utils/lm-eval/Dockerfile
Expand Down
43 changes: 39 additions & 4 deletions comps/llms/text-generation/vllm/langchain/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,23 +98,31 @@ For example, if we run `meta-llama/Meta-Llama-3-70b` with 8 cards, we can use fo
bash ./launch_vllm_service.sh 8008 meta-llama/Meta-Llama-3-70b hpu 8
```

### 2.3 vLLM with OpenVINO
### 2.3 vLLM with OpenVINO (on Intel GPU and CPU)

vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support. OpenVINO vLLM backend supports the following advanced vLLM features:
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](https://github.com/vllm-project/vllm/blob/main/docs/source/models/supported_models.rst) and can perform optimal model serving on Intel GPU and all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs (starting from Intel® UHD Graphics generation). OpenVINO vLLM backend supports the following advanced vLLM features:

- Prefix caching (`--enable-prefix-caching`)
- Chunked prefill (`--enable-chunked-prefill`)

#### Build Docker Image

To build the docker image, run the command
To build the docker image for Intel CPU, run the command

```bash
bash ./build_docker_vllm_openvino.sh
```

Once it successfully builds, you will have the `vllm:openvino` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell.

To build the docker image for Intel GPU, run the command

```bash
bash ./build_docker_vllm_openvino.sh gpu
```

Once it successfully builds, you will have the `opea/vllm-arc:latest` image. It can be used to spawn a serving container with OpenAI API endpoint or you can work with it interactively via bash shell.

#### Launch vLLM service

For gated models, such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
Expand All @@ -125,14 +133,30 @@ Please follow this link [huggingface token](https://huggingface.co/docs/hub/secu
export HUGGINGFACEHUB_API_TOKEN=<token>
```

To start the model server:
To start the model server for Intel CPU:

```bash
bash launch_vllm_service_openvino.sh
```

To start the model server for Intel GPU:

```bash
bash launch_vllm_service_openvino.sh -d gpu
```

#### Performance tips

---

vLLM OpenVINO backend environment variables

- `VLLM_OPENVINO_DEVICE` to specify which device utilize for the inference. If there are multiple GPUs in the system, additional indexes can be used to choose the proper one (e.g, `VLLM_OPENVINO_DEVICE=GPU.1`). If the value is not specified, CPU device is used by default.
gavinlichn marked this conversation as resolved.
Show resolved Hide resolved

- `VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON` enables U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `<model_id>`

##### CPU performance tips

vLLM OpenVINO backend uses the following environment variables to control behavior:

- `VLLM_OPENVINO_KVCACHE_SPACE` to specify the KV Cache size (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=40` means 40 GB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users.
Expand All @@ -148,6 +172,17 @@ OpenVINO best known configuration is:
$ VLLM_OPENVINO_KVCACHE_SPACE=100 VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8 VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --enable-chunked-prefill --max-num-batched-tokens 256

##### GPU performance tips

GPU device implements the logic for automatic detection of available GPU memory and, by default, tries to reserve as much memory as possible for the KV cache (taking into account `gpu_memory_utilization` option). However, this behavior can be overridden by explicitly specifying the desired amount of memory for the KV cache using `VLLM_OPENVINO_KVCACHE_SPACE` environment variable (e.g, `VLLM_OPENVINO_KVCACHE_SPACE=8` means 8 GB space for KV cache).
gavinlichn marked this conversation as resolved.
Show resolved Hide resolved

Currently, the best performance using GPU can be achieved with the default vLLM execution parameters for models with quantized weights (8 and 4-bit integer data types are supported) and `preemption-mode=swap`.

OpenVINO best known configuration for GPU is:

$ VLLM_OPENVINO_DEVICE=GPU VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON \
python3 vllm/benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-chat-hf --dataset vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json

### 2.4 Query the service

And then you can make requests like below to check the service status:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# The vLLM Dockerfile is used to construct vLLM image that can be directly used
# to run the OpenAI compatible server.
# Based on https://github.com/vllm-project/vllm/blob/main/Dockerfile.openvino
# add Intel ARC support package

FROM ubuntu:22.04 AS dev

RUN apt-get update -y && \
apt-get install -y \
git python3-pip \
ffmpeg libsm6 libxext6 libgl1 \
gpg-agent wget

RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg && \
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | \
tee /etc/apt/sources.list.d/intel-gpu-jammy.list &&\
apt update -y &&\
apt install -y \
intel-opencl-icd intel-level-zero-gpu level-zero \
intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 \
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo

WORKDIR /workspace

RUN git clone -b v0.6.3.post1 https://github.com/vllm-project/vllm.git

#ARG GIT_REPO_CHECK=0
#RUN --mount=type=bind,source=.git,target=.git \
# if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi

# install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt
# build vLLM with OpenVINO backend
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/

#COPY examples/ /workspace/vllm/examples
#COPY benchmarks/ /workspace/vllm/benchmarks


CMD ["/bin/bash"]

Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,27 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
git clone https://github.com/vllm-project/vllm.git vllm
cd ./vllm/ && git checkout v0.6.1
docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
cd $BASEDIR && rm -rf vllm
# Set default values
default_hw_mode="cpu"

# Assign arguments to variable
hw_mode=${1:-$default_hw_mode}

# Check if all required arguments are provided
if [ "$#" -lt 0 ] || [ "$#" -gt 1 ]; then
echo "Usage: $0 [hw_mode]"
echo "Please customize the arguments you want to use.
- hw_mode: The hardware mode for the vLLM endpoint, with the default being 'cpu', and the optional selection can be 'cpu' and 'gpu'."
exit 1
fi

# Build the docker image for vLLM based on the hardware mode
if [ "$hw_mode" = "gpu" ]; then
docker build -f Dockerfile.intel_gpu -t opea/vllm-arc:latest . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
else
BASEDIR="$( cd "$( dirname "$0" )" && pwd )"
git clone https://github.com/vllm-project/vllm.git vllm
cd ./vllm/ && git checkout v0.6.1
docker build -t vllm:openvino -f Dockerfile.openvino . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
cd $BASEDIR && rm -rf vllm
fi
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,20 @@

default_port=8008
default_model="meta-llama/Llama-2-7b-hf"
default_device="cpu"
swap_space=50
image="vllm:openvino"

while getopts ":hm:p:" opt; do
while getopts ":hm:p:d:" opt; do
case $opt in
h)
echo "Usage: $0 [-h] [-m model] [-p port]"
echo "Usage: $0 [-h] [-m model] [-p port] [-d device]"
echo "Options:"
echo " -h Display this help message"
echo " -m model Model (default: meta-llama/Llama-2-7b-hf)"
echo " -m model Model (default: meta-llama/Llama-2-7b-hf for cpu"
echo " meta-llama/Llama-3.2-3B-Instruct for gpu)"
echo " -p port Port (default: 8000)"
echo " -d device Target Device (Default: cpu, optional selection can be 'cpu' and 'gpu')"
exit 0
;;
m)
Expand All @@ -27,6 +31,9 @@ while getopts ":hm:p:" opt; do
p)
port=$OPTARG
;;
d)
device=$OPTARG
;;
\?)
echo "Invalid option: -$OPTARG" >&2
exit 1
Expand All @@ -37,25 +44,33 @@ done
# Assign arguments to variables
model_name=${model:-$default_model}
port_number=${port:-$default_port}
device=${device:-$default_device}


# Set the Huggingface cache directory variable
HF_CACHE_DIR=$HOME/.cache/huggingface

if [ "$device" = "gpu" ]; then
docker_args="-e VLLM_OPENVINO_DEVICE=GPU --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path"
vllm_args="--max_model_len=1024"
model_name="meta-llama/Llama-3.2-3B-Instruct"
image="opea/vllm-arc:latest"
fi
# Start the model server using Openvino as the backend inference engine.
# Provide the container name that is unique and meaningful, typically one that includes the model name.

docker run -d --rm --name="vllm-openvino-server" \
-p $port_number:80 \
--ipc=host \
$docker_args \
-e HTTPS_PROXY=$https_proxy \
-e HTTP_PROXY=$https_proxy \
-e HF_TOKEN=${HUGGINGFACEHUB_API_TOKEN} \
-v $HOME/.cache/huggingface:/home/user/.cache/huggingface \
vllm:openvino /bin/bash -c "\
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
$image /bin/bash -c "\
cd / && \
export VLLM_CPU_KVCACHE_SPACE=50 && \
python3 -m vllm.entrypoints.openai.api_server \
--model \"$model_name\" \
$vllm_args \
--host 0.0.0.0 \
--port 80"
3 changes: 2 additions & 1 deletion comps/llms/text-generation/vllm/langchain/query.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@
# SPDX-License-Identifier: Apache-2.0

your_ip="0.0.0.0"
model=$(curl http://localhost:8008/v1/models -s|jq -r '.data[].id')

curl http://${your_ip}:8008/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"model": "'$model'",
"prompt": "What is Deep Learning?",
"max_tokens": 32,
"temperature": 0
Expand Down