-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs][KubeRay] add a guide for deploying vLLM with RayService #47038
Conversation
fdcf9c4
to
10d4985
Compare
@angelinalg @kevin85421 this is ready for review now |
10d4985
to
bf42e34
Compare
bf42e34
to
07dd54d
Compare
@angelinalg can you review please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some style nits. Please consider using Vale to find these issues in the future. Please excuse any inaccuracies I introduced in my suggestions and correct as needed. Happy to answer any questions you have about the suggestions. Thanks for your contribution!
|
||
## Prerequisites | ||
|
||
This example downloads model weights from Hugging Face. You will need to complete the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example downloads model weights from Hugging Face. You will need to complete the following | |
This example downloads model weights from Hugging Face. You need to complete the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
prerequisites to successfully complete this guide: | ||
* A [Hugging Face account](https://huggingface.co/) | ||
* A Hugging Face [access token](https://huggingface.co/docs/hub/security-tokens) with read access to gated repos. | ||
* Access to the Llama 3 8B model. This usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Access to the Llama 3 8B model. This usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details. | |
* Access to the Llama 3 8B model. Getting access usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
--accelerator=type=nvidia-l4,count=2,gpu-driver-version=latest | ||
``` | ||
|
||
This example uses L4 GPUs. Each model replica will use 2 L4 GPUs using vLLM's tensor parallelism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example uses L4 GPUs. Each model replica will use 2 L4 GPUs using vLLM's tensor parallelism. | |
This example uses L4 GPUs. Each model replica uses 2 L4 GPUs using vLLM's tensor parallelism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kubectl create secret generic hf-secret --from-literal=hf_api_token=${HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f - | ||
``` | ||
|
||
This secret will be referenced as an environment variable in the RayCluster used in the next steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This secret will be referenced as an environment variable in the RayCluster used in the next steps. | |
This guide references this secret as an environment variable in the RayCluster in the next steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/vllm/ray-service.vllm.yaml | ||
``` | ||
|
||
The RayService is configured to deploy a Ray Serve application, running vLLM as the serving engine for the Llama 3 8B Instruct model. The code used in this example can be found [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RayService is configured to deploy a Ray Serve application, running vLLM as the serving engine for the Llama 3 8B Instruct model. The code used in this example can be found [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py). | |
This step configures RayService to deploy a Ray Serve app, running vLLM as the serving engine for the Llama 3 8B Instruct model. You can find the code for this example [on GitHub](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
TENSOR_PARALLELISM: "2" | ||
``` | ||
|
||
Wait for the RayService resource to be ready. You can inspect it's status by running the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait for the RayService resource to be ready. You can inspect it's status by running the following command: | |
Wait for the RayService resource to be ready. You can inspect its status by running the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
## Send a prompt | ||
|
||
Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application: | |
Confirm the Ray Serve deployment is healthy, then you can establish a port-forwarding session for the Serve app: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
$ kubectl port-forward svc/llama-3-8b-serve-svc 8000 | ||
``` | ||
|
||
Note that this Kubernetes Service will be created after the Serve applications are ready and running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this Kubernetes Service will be created after the Serve applications are ready and running. | |
Note that KubeRay creates this Kubernetes Service after the Serve apps are ready and running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Andrew Sy Kim <[email protected]>
07dd54d
to
f09d4a6
Compare
Thanks for the review @angelinalg, addressed your feedback! |
…roject#47038) Signed-off-by: Andrew Sy Kim <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
Based on sample in ray-project/kuberay#2289
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.