[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1470

kevin85421 · 2023-10-06T21:46:35Z

Why are these changes needed?

GKE's Autopilot does not support GPU-using init containers, so we explicitly specify the resources for the init container instead of reusing the resources of the Ray container. The resource consumption of the init container should be constant. Hence, it is OK to hard-code the resources.

Related issue number

Closes #1349

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

# Step 1: Create a GKE Autopilot cluster
gcloud container clusters create-auto kuberay-gpu-cluster --region=us-west1

# Step 2: Install a KubeRay operator with this PR
# Step 3: Install a RayCluster https://gist.github.com/kevin85421/ef4c3a0c7493e76edd3d2cab5cea33e4
# (Note that you need to add nodeSelector)

architkulkarni

Nice fix! Once this makes it into the kuberay release, we can probably simplify some of the GKE instructions in our doc and just recommend users to use GKE autopilot going forward. (Depending of how stable we believe GKE Autopilot is)

architkulkarni · 2023-10-06T22:15:05Z

ray-operator/controllers/ray/common/pod.go

+				},
+				Requests: v1.ResourceList{
+					v1.ResourceCPU:    resource.MustParse("200m"),
+					v1.ResourceMemory: resource.MustParse("256Mi"),


A comment here with some indication of why these numbers were chosen would be useful, perhaps a brief reminder of what the init container does

Updated 9039f24

…ct#1470) [Enhancement] GPU RayCluster doesn't work on GKE Autopilot

…1520)

update

12ea488

kevin85421 marked this pull request as ready for review October 6, 2023 22:10

kevin85421 requested a review from architkulkarni October 6, 2023 22:11

architkulkarni approved these changes Oct 6, 2023

View reviewed changes

architkulkarni reviewed Oct 6, 2023

View reviewed changes

update

9039f24

kevin85421 merged commit 0a56cd4 into ray-project:master Oct 6, 2023
23 checks passed

kevin85421 mentioned this pull request Oct 16, 2023

[Bug] Ray gpu worker cannot start up success #1414

Closed

2 tasks

kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Oct 17, 2023

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot (ray-proje…

24af077

…ct#1470) [Enhancement] GPU RayCluster doesn't work on GKE Autopilot

kevin85421 added a commit that referenced this pull request Oct 17, 2023

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot (#1470) (#…

0285b5e

…1520)

kevin85421 mentioned this pull request Oct 24, 2023

[Bug] GKE autopilot cluster - Have you run 'ray start' on this node? #1563

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1470

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1470

kevin85421 commented Oct 6, 2023 •

edited

Loading

architkulkarni left a comment

architkulkarni Oct 6, 2023

kevin85421 Oct 6, 2023

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1470

[Enhancement] GPU RayCluster doesn't work on GKE Autopilot #1470

Conversation

kevin85421 commented Oct 6, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni Oct 6, 2023

Choose a reason for hiding this comment

kevin85421 Oct 6, 2023

Choose a reason for hiding this comment

kevin85421 commented Oct 6, 2023 •

edited

Loading