Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs][KubeRay] add a guide for deploying vLLM with RayService #47038

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

andrewsykim
Copy link
Contributor

Why are these changes needed?

Based on sample in ray-project/kuberay#2289

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@andrewsykim
Copy link
Contributor Author

@angelinalg @kevin85421 this is ready for review now

@kevin85421 kevin85421 self-assigned this Aug 14, 2024
@kevin85421 kevin85421 added the kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side label Sep 3, 2024
@andrewsykim
Copy link
Contributor Author

@angelinalg can you review please?

Copy link
Contributor

@angelinalg angelinalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some style nits. Please consider using Vale to find these issues in the future. Please excuse any inaccuracies I introduced in my suggestions and correct as needed. Happy to answer any questions you have about the suggestions. Thanks for your contribution!


## Prerequisites

This example downloads model weights from Hugging Face. You will need to complete the following
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example downloads model weights from Hugging Face. You will need to complete the following
This example downloads model weights from Hugging Face. You need to complete the following

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

prerequisites to successfully complete this guide:
* A [Hugging Face account](https://huggingface.co/)
* A Hugging Face [access token](https://huggingface.co/docs/hub/security-tokens) with read access to gated repos.
* Access to the Llama 3 8B model. This usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Access to the Llama 3 8B model. This usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.
* Access to the Llama 3 8B model. Getting access usually requires signing an agreement on Hugging Face to access this model. Go to the [Llama 3 model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B) for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

--accelerator=type=nvidia-l4,count=2,gpu-driver-version=latest
```

This example uses L4 GPUs. Each model replica will use 2 L4 GPUs using vLLM's tensor parallelism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This example uses L4 GPUs. Each model replica will use 2 L4 GPUs using vLLM's tensor parallelism.
This example uses L4 GPUs. Each model replica uses 2 L4 GPUs using vLLM's tensor parallelism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

kubectl create secret generic hf-secret --from-literal=hf_api_token=${HF_TOKEN} --dry-run=client -o yaml | kubectl apply -f -
```

This secret will be referenced as an environment variable in the RayCluster used in the next steps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This secret will be referenced as an environment variable in the RayCluster used in the next steps.
This guide references this secret as an environment variable in the RayCluster in the next steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/vllm/ray-service.vllm.yaml
```

The RayService is configured to deploy a Ray Serve application, running vLLM as the serving engine for the Llama 3 8B Instruct model. The code used in this example can be found [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The RayService is configured to deploy a Ray Serve application, running vLLM as the serving engine for the Llama 3 8B Instruct model. The code used in this example can be found [here](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).
This step configures RayService to deploy a Ray Serve app, running vLLM as the serving engine for the Llama 3 8B Instruct model. You can find the code for this example [on GitHub](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/vllm/serve.py).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

TENSOR_PARALLELISM: "2"
```

Wait for the RayService resource to be ready. You can inspect it's status by running the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Wait for the RayService resource to be ready. You can inspect it's status by running the following command:
Wait for the RayService resource to be ready. You can inspect its status by running the following command:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


## Send a prompt

Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once you've confirmed the Ray Serve deployment is healthy, you can establish a port-forwarding session for the Serve application:
Confirm the Ray Serve deployment is healthy, then you can establish a port-forwarding session for the Serve app:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

$ kubectl port-forward svc/llama-3-8b-serve-svc 8000
```

Note that this Kubernetes Service will be created after the Serve applications are ready and running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that this Kubernetes Service will be created after the Serve applications are ready and running.
Note that KubeRay creates this Kubernetes Service after the Serve apps are ready and running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@andrewsykim
Copy link
Contributor Author

Thanks for the review @angelinalg, addressed your feedback!

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Sep 10, 2024
@jjyao jjyao merged commit b6ca703 into ray-project:master Sep 11, 2024
6 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests kuberay Issues for the Ray/Kuberay integration that are tracked on the Ray side
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants