Skip to content

Commit

Permalink
[HPU] [Serve] [experimental] Add vllm HPU support in vllm example (#4…
Browse files Browse the repository at this point in the history
…5893)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
This PR adds vllm HPU support in vllm example
(#45430). The added codes will
check whether the HPU device exists before allocating resources to vllm
actors. If it exists, HPU resources are used, otherwise GPU resources
are still used.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: KepingYan <[email protected]>
Co-authored-by: akshay-anyscale <[email protected]>
  • Loading branch information
KepingYan and akshay-anyscale authored Aug 19, 2024
1 parent eff4726 commit c46c2e5
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 8 deletions.
22 changes: 18 additions & 4 deletions doc/source/serve/doc_code/vllm_openai_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@
ErrorResponse,
)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_engine import LoRAModulePath
from vllm.entrypoints.openai.serving_engine import LoRAModulePath, PromptAdapterPath
from vllm.utils import FlexibleArgumentParser
from vllm.entrypoints.logger import RequestLogger

logger = logging.getLogger("ray.serve")

Expand All @@ -40,13 +41,17 @@ def __init__(
engine_args: AsyncEngineArgs,
response_role: str,
lora_modules: Optional[List[LoRAModulePath]] = None,
prompt_adapters: Optional[List[PromptAdapterPath]] = None,
request_logger: Optional[RequestLogger] = None,
chat_template: Optional[str] = None,
):
logger.info(f"Starting with engine args: {engine_args}")
self.openai_serving_chat = None
self.engine_args = engine_args
self.response_role = response_role
self.lora_modules = lora_modules
self.prompt_adapters = prompt_adapters
self.request_logger = request_logger
self.chat_template = chat_template
self.engine = AsyncLLMEngine.from_engine_args(engine_args)

Expand All @@ -71,8 +76,10 @@ async def create_chat_completion(
model_config,
served_model_names,
self.response_role,
self.lora_modules,
self.chat_template,
lora_modules=self.lora_modules,
prompt_adapters=self.prompt_adapters,
request_logger=self.request_logger,
chat_template=self.chat_template,
)
logger.info(f"Request: {request}")
generator = await self.openai_serving_chat.create_chat_completion(
Expand Down Expand Up @@ -116,6 +123,10 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
Supported engine arguments: https://docs.vllm.ai/en/latest/models/engine_args.html.
""" # noqa: E501
if "accelerator" in cli_args.keys():
accelerator = cli_args.pop("accelerator")
else:
accelerator = "GPU"
parsed_args = parse_vllm_args(cli_args)
engine_args = AsyncEngineArgs.from_cli_args(parsed_args)
engine_args.worker_use_ray = True
Expand All @@ -125,7 +136,7 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
pg_resources = []
pg_resources.append({"CPU": 1}) # for the deployment replica
for i in range(tp):
pg_resources.append({"CPU": 1, "GPU": 1}) # for the vLLM actors
pg_resources.append({"CPU": 1, accelerator: 1}) # for the vLLM actors

# We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
# the same Ray node.
Expand All @@ -135,6 +146,8 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
engine_args,
parsed_args.response_role,
parsed_args.lora_modules,
parsed_args.prompt_adapters,
cli_args.get("request_logger"),
parsed_args.chat_template,
)

Expand Down Expand Up @@ -171,6 +184,7 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
],
temperature=0.01,
stream=True,
max_tokens=100,
)

for chat in chat_completion:
Expand Down
21 changes: 17 additions & 4 deletions doc/source/serve/tutorials/vllm-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,19 @@ orphan: true
(serve-vllm-tutorial)=

# Serve a Large Language Model with vLLM
This example runs a large language model with Ray Serve using [vLLM](https://docs.vllm.ai/en/latest/), a popular open-source library for serving LLMs. It uses the [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/text-generation/chat-completions-api), which easily integrates with other LLM tools. The example also sets up multi-GPU serving with Ray Serve using placement groups. For more advanced features like multi-lora support with serve multiplexing, JSON mode function calling and further performance improvements, try LLM deployment solutions on [Anyscale](https://www.anyscale.com/).
This example runs a large language model with Ray Serve using [vLLM](https://docs.vllm.ai/en/latest/), a popular open-source library for serving LLMs. It uses the [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/text-generation/chat-completions-api), which easily integrates with other LLM tools. The example also sets up multi-GPU or multi-HPU serving with Ray Serve using placement groups. For more advanced features like multi-lora support with serve multiplexing, JSON mode function calling and further performance improvements, try LLM deployment solutions on [Anyscale](https://www.anyscale.com/).

To run this example, install the following:

```bash
pip install "ray[serve]" requests vllm
pip install "ray[serve]" requests
```
vllm needs to be installed according to the device:
```bash
# on GPU
pip install vllm
# on HPU
pip install -v git+https://github.com/HabanaAI/vllm-fork.git@habana_main
```

This example uses the [NousResearch/Meta-Llama-3-8B-Instruct](https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct) model. Save the following code to a file named `llm.py`.
Expand All @@ -23,10 +30,16 @@ The Serve code is as follows:
:end-before: __serve_example_end__
```

Use `serve run llm:build_app model="NousResearch/Meta-Llama-3-8B-Instruct" tensor-parallel-size=2` to start the Serve app.
Use the following code to start the Serve app:
```bash
# on GPU
serve run llm:build_app model="NousResearch/Meta-Llama-3-8B-Instruct" tensor-parallel-size=2 accelerator="GPU"
# on HPU
serve run llm:build_app model="NousResearch/Meta-Llama-3-8B-Instruct" tensor-parallel-size=2 accelerator="HPU"
```

:::{note}
This example uses Tensor Parallel size of 2, which means Ray Serve deploys the model to Ray Actors across 2 GPUs using placement groups.
This example uses Tensor Parallel size of 2, which means Ray Serve deploys the model to Ray Actors across 2 GPUs or HPUs (based on the accelerator type) using placement groups.
:::


Expand Down

0 comments on commit c46c2e5

Please sign in to comment.