[Docs][KubeRay] add a guide for deploying vLLM with RayService #47038

andrewsykim · 2024-08-08T23:39:07Z

Why are these changes needed?

Based on sample in ray-project/kuberay#2289

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

andrewsykim · 2024-08-09T17:38:40Z

@angelinalg @kevin85421 this is ready for review now

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

andrewsykim · 2024-09-09T18:34:30Z

@angelinalg can you review please?

angelinalg

Just some style nits. Please consider using Vale to find these issues in the future. Please excuse any inaccuracies I introduced in my suggestions and correct as needed. Happy to answer any questions you have about the suggestions. Thanks for your contribution!

angelinalg · 2024-09-09T23:45:57Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+
+## Prerequisites
+
+This example downloads model weights from Hugging Face. You will need to complete the following


Suggested change

This example downloads model weights from Hugging Face. You will need to complete the following

This example downloads model weights from Hugging Face. You need to complete the following

angelinalg · 2024-09-09T23:46:34Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+prerequisites to successfully complete this guide:
+* A [Hugging Face account](https://huggingface.co/)
+* A Hugging Face [access token](https://huggingface.co/docs/hub/security-tokens) with read access to gated repos.
+* Access to the Llama 3 8B model. This usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.


Suggested change

* Access to the Llama 3 8B model. This usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.

* Access to the Llama 3 8B model. Getting access usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.

angelinalg · 2024-09-09T23:46:51Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+    --accelerator=type=nvidia-l4,count=2,gpu-driver-version=latest
+```
+
+This example uses L4 GPUs. Each model replica will use 2 L4 GPUs using vLLM's tensor parallelism.


Suggested change

This example uses L4 GPUs. Each model replica will use 2 L4 GPUs using vLLM's tensor parallelism.

This example uses L4 GPUs. Each model replica uses 2 L4 GPUs using vLLM's tensor parallelism.

angelinalg · 2024-09-09T23:47:45Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+kubectl create secret generic hf-secret   --from-literal=hf_api_token=${HF_TOKEN}   --dry-run=client -o yaml | kubectl apply -f -
+```
+
+This secret will be referenced as an environment variable in the RayCluster used in the next steps.


Suggested change

This secret will be referenced as an environment variable in the RayCluster used in the next steps.

This guide references this secret as an environment variable in the RayCluster in the next steps.

angelinalg · 2024-09-09T23:50:17Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/vllm/ray-service.vllm.yaml
+```
+
+The RayService is configured to deploy a Ray Serve application, running vLLM as the serving engine for the Llama 3 8B Instruct model. The code used in this example can be found [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).


Suggested change

The RayService is configured to deploy a Ray Serve application, running vLLM as the serving engine for the Llama 3 8B Instruct model. The code used in this example can be found [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).

This step configures RayService to deploy a Ray Serve app, running vLLM as the serving engine for the Llama 3 8B Instruct model. You can find the code for this example [on GitHub](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).

angelinalg · 2024-09-09T23:50:28Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+          TENSOR_PARALLELISM: "2"
+```
+
+Wait for the RayService resource to be ready. You can inspect it's status by running the following command:


Suggested change

Wait for the RayService resource to be ready. You can inspect it's status by running the following command:

Wait for the RayService resource to be ready. You can inspect its status by running the following command:

angelinalg · 2024-09-09T23:50:50Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+
+## Send a prompt
+
+Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application:


Suggested change

Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application:

Confirm the Ray Serve deployment is healthy, then you can establish a port-forwarding session for the Serve app:

angelinalg · 2024-09-09T23:51:09Z

doc/source/cluster/kubernetes/examples/vllm-rayservice.md

+$ kubectl port-forward svc/llama-3-8b-serve-svc 8000
+```
+
+Note that this Kubernetes Service will be created after the Serve applications are ready and running.


Suggested change

Note that this Kubernetes Service will be created after the Serve applications are ready and running.

Note that KubeRay creates this Kubernetes Service after the Serve apps are ready and running.

Signed-off-by: Andrew Sy Kim <[email protected]>

andrewsykim · 2024-09-10T18:40:39Z

Thanks for the review @angelinalg, addressed your feedback!

…roject#47038) Signed-off-by: Andrew Sy Kim <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

andrewsykim requested review from architkulkarni, maxpumperla, pcmoritz, kevin85421 and a team as code owners August 8, 2024 23:39

andrewsykim force-pushed the kuberay-vllm-guide branch 3 times, most recently from fdcf9c4 to 10d4985 Compare August 9, 2024 01:26

andrewsykim force-pushed the kuberay-vllm-guide branch from 10d4985 to bf42e34 Compare August 9, 2024 18:42

kevin85421 self-assigned this Aug 14, 2024

kevin85421 added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Sep 3, 2024

kevin85421 approved these changes Sep 4, 2024

View reviewed changes

doc/source/cluster/kubernetes/examples/vllm-rayservice.md Outdated Show resolved Hide resolved

andrewsykim force-pushed the kuberay-vllm-guide branch from bf42e34 to 07dd54d Compare September 5, 2024 01:46

angelinalg approved these changes Sep 9, 2024

View reviewed changes

[Docs][KubeRay] add a guide for deploying vLLM with RayService

f09d4a6

Signed-off-by: Andrew Sy Kim <[email protected]>

andrewsykim force-pushed the kuberay-vllm-guide branch from 07dd54d to f09d4a6 Compare September 10, 2024 14:04

kevin85421 added the go add ONLY when ready to merge, run all tests label Sep 10, 2024

jjyao merged commit b6ca703 into ray-project:master Sep 11, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs][KubeRay] add a guide for deploying vLLM with RayService #47038

[Docs][KubeRay] add a guide for deploying vLLM with RayService #47038

andrewsykim commented Aug 8, 2024

andrewsykim commented Aug 9, 2024

andrewsykim commented Sep 9, 2024

angelinalg left a comment •

edited

Loading

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

angelinalg Sep 9, 2024

andrewsykim Sep 10, 2024

andrewsykim commented Sep 10, 2024


		## Prerequisites

		This example downloads model weights from Hugging Face. You will need to complete the following

	* Access to the Llama 3 8B model. This usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.
	* Access to the Llama 3 8B model. Getting access usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.

	This example uses L4 GPUs. Each model replica will use 2 L4 GPUs using vLLM's tensor parallelism.
	This example uses L4 GPUs. Each model replica uses 2 L4 GPUs using vLLM's tensor parallelism.

	This secret will be referenced as an environment variable in the RayCluster used in the next steps.
	This guide references this secret as an environment variable in the RayCluster in the next steps.

	The RayService is configured to deploy a Ray Serve application, running vLLM as the serving engine for the Llama 3 8B Instruct model. The code used in this example can be found [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).
	This step configures RayService to deploy a Ray Serve app, running vLLM as the serving engine for the Llama 3 8B Instruct model. You can find the code for this example [on GitHub](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).

	Wait for the RayService resource to be ready. You can inspect it's status by running the following command:
	Wait for the RayService resource to be ready. You can inspect its status by running the following command:


		## Send a prompt

		Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application:

	Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application:
	Confirm the Ray Serve deployment is healthy, then you can establish a port-forwarding session for the Serve app:

	Note that this Kubernetes Service will be created after the Serve applications are ready and running.
	Note that KubeRay creates this Kubernetes Service after the Serve apps are ready and running.

[Docs][KubeRay] add a guide for deploying vLLM with RayService #47038

[Docs][KubeRay] add a guide for deploying vLLM with RayService #47038

Conversation

andrewsykim commented Aug 8, 2024

Why are these changes needed?

Related issue number

Checks

andrewsykim commented Aug 9, 2024

andrewsykim commented Sep 9, 2024

angelinalg left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim commented Sep 10, 2024

angelinalg left a comment •

edited

Loading