[HPU] [Serve] [experimental] Add vllm HPU support in vllm example (#4…

…5893)   ## Why are these changes needed? This PR adds vllm HPU support in vllm example (#45430). The added codes will check whether the HPU device exists before allocating resources to vllm actors. If it exists, HPU resources are used, otherwise GPU resources are still used.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: KepingYan <[email protected]> Co-authored-by: akshay-anyscale <[email protected]>
ray-project · Aug 19, 2024 · c46c2e5 · c46c2e5
1 parent eff4726
commit c46c2e5
Show file tree

Hide file tree

Showing 2 changed files with 35 additions and 8 deletions.
diff --git a/doc/source/serve/doc_code/vllm_openai_example.py b/doc/source/serve/doc_code/vllm_openai_example.py
@@ -17,8 +17,9 @@
     ErrorResponse,
 )
 from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
-from vllm.entrypoints.openai.serving_engine import LoRAModulePath
+from vllm.entrypoints.openai.serving_engine import LoRAModulePath, PromptAdapterPath
 from vllm.utils import FlexibleArgumentParser
+from vllm.entrypoints.logger import RequestLogger
 
 logger = logging.getLogger("ray.serve")
 
@@ -40,13 +41,17 @@ def __init__(
         engine_args: AsyncEngineArgs,
         response_role: str,
         lora_modules: Optional[List[LoRAModulePath]] = None,
+        prompt_adapters: Optional[List[PromptAdapterPath]] = None,
+        request_logger: Optional[RequestLogger] = None,
         chat_template: Optional[str] = None,
     ):
         logger.info(f"Starting with engine args: {engine_args}")
         self.openai_serving_chat = None
         self.engine_args = engine_args
         self.response_role = response_role
         self.lora_modules = lora_modules
+        self.prompt_adapters = prompt_adapters
+        self.request_logger = request_logger
         self.chat_template = chat_template
         self.engine = AsyncLLMEngine.from_engine_args(engine_args)
 
@@ -71,8 +76,10 @@ async def create_chat_completion(
                 model_config,
                 served_model_names,
                 self.response_role,
-                self.lora_modules,
-                self.chat_template,
+                lora_modules=self.lora_modules,
+                prompt_adapters=self.prompt_adapters,
+                request_logger=self.request_logger,
+                chat_template=self.chat_template,
             )
         logger.info(f"Request: {request}")
         generator = await self.openai_serving_chat.create_chat_completion(
@@ -116,6 +123,10 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
 
     Supported engine arguments: https://docs.vllm.ai/en/latest/models/engine_args.html.
     """  # noqa: E501
+    if "accelerator" in cli_args.keys():
+        accelerator = cli_args.pop("accelerator")
+    else:
+        accelerator = "GPU"
     parsed_args = parse_vllm_args(cli_args)
     engine_args = AsyncEngineArgs.from_cli_args(parsed_args)
     engine_args.worker_use_ray = True
@@ -125,7 +136,7 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
     pg_resources = []
     pg_resources.append({"CPU": 1})  # for the deployment replica
     for i in range(tp):
-        pg_resources.append({"CPU": 1, "GPU": 1})  # for the vLLM actors
+        pg_resources.append({"CPU": 1, accelerator: 1})  # for the vLLM actors
 
     # We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
     # the same Ray node.
@@ -135,6 +146,8 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
         engine_args,
         parsed_args.response_role,
         parsed_args.lora_modules,
+        parsed_args.prompt_adapters,
+        cli_args.get("request_logger"),
         parsed_args.chat_template,
     )
 
@@ -171,6 +184,7 @@ def build_app(cli_args: Dict[str, str]) -> serve.Application:
         ],
         temperature=0.01,
         stream=True,
+        max_tokens=100,
     )
 
     for chat in chat_completion:

diff --git a/doc/source/serve/tutorials/vllm-example.md b/doc/source/serve/tutorials/vllm-example.md
@@ -5,12 +5,19 @@ orphan: true
 (serve-vllm-tutorial)=
 
 # Serve a Large Language Model with vLLM
-This example runs a large language model with Ray Serve using [vLLM](https://docs.vllm.ai/en/latest/), a popular open-source library for serving LLMs. It uses the [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/text-generation/chat-completions-api), which easily integrates with other LLM tools. The example also sets up multi-GPU serving with Ray Serve using placement groups. For more advanced features like multi-lora support with serve multiplexing, JSON mode function calling and further performance improvements, try LLM deployment solutions on [Anyscale](https://www.anyscale.com/). 
+This example runs a large language model with Ray Serve using [vLLM](https://docs.vllm.ai/en/latest/), a popular open-source library for serving LLMs. It uses the [OpenAI Chat Completions API](https://platform.openai.com/docs/guides/text-generation/chat-completions-api), which easily integrates with other LLM tools. The example also sets up multi-GPU or multi-HPU serving with Ray Serve using placement groups. For more advanced features like multi-lora support with serve multiplexing, JSON mode function calling and further performance improvements, try LLM deployment solutions on [Anyscale](https://www.anyscale.com/). 
 
 To run this example, install the following:
 
 ```bash
-pip install "ray[serve]" requests vllm
+pip install "ray[serve]" requests
+```
+vllm needs to be installed according to the device:
+```bash
+# on GPU
+pip install vllm
+# on HPU
+pip install -v git+https://github.com/HabanaAI/vllm-fork.git@habana_main
 ```
 
 This example uses the [NousResearch/Meta-Llama-3-8B-Instruct](https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct) model. Save the following code to a file named `llm.py`.
@@ -23,10 +30,16 @@ The Serve code is as follows:
 :end-before: __serve_example_end__
 ```
 
-Use `serve run llm:build_app model="NousResearch/Meta-Llama-3-8B-Instruct" tensor-parallel-size=2` to start the Serve app.
+Use the following code to start the Serve app:
+```bash
+# on GPU
+serve run llm:build_app model="NousResearch/Meta-Llama-3-8B-Instruct" tensor-parallel-size=2 accelerator="GPU"
+# on HPU
+serve run llm:build_app model="NousResearch/Meta-Llama-3-8B-Instruct" tensor-parallel-size=2 accelerator="HPU"
+```
 
 :::{note}
-This example uses Tensor Parallel size of 2, which means Ray Serve deploys the model to Ray Actors across 2 GPUs using placement groups.
+This example uses Tensor Parallel size of 2, which means Ray Serve deploys the model to Ray Actors across 2 GPUs or HPUs (based on the accelerator type) using placement groups.
 :::