[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274

DmitriGekhtman · 2022-05-24T06:08:55Z

Why are these changes needed?

Upstreams recent autoscaler changes from the Ray repo.

The Ray image is used to run the autoscaler, using the new entrypoint ray kuberay-autoscaler.
Redis passwords are removed from autoscaler configuration, since it's not relevant for Ray >= 1.11.0 and we don't support autoscaling with older Ray.
The autoscaler and Ray container need to share a log volume so that Ray drivers can receive autoscaling events. This PR has the operator configure the relevant volume mounts when autoscaling is enabled. Existing code for configuring volumeMounts is generalized for this purpose.
Some unit test logic around volume mounts is added.
Documentation is updated.
- Deployment steps for setting up autoscaling clusters involve an overlay to set the "prioritizeWorkersToDelete" feature flag on the operator.
- Commands involving CRD creation are updated from "create" to "apply", due to large CRD size ([Bug] Issues with RayCluster CRD and kubectl apply #271)

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests: Deployed with the updated instructions, checked that autoscaling works.

DmitriGekhtman · 2022-05-24T18:07:06Z

ray-operator/config/samples/ray-cluster.autoscaler.yaml

@@ -8,7 +8,7 @@ metadata:
    # An unique identifier for the head node and workers of this cluster.
  name: raycluster-autoscaler
 spec:
-  rayVersion: 'nightly'
+  rayVersion: '1.12.1'


The latest Ray release is compatible with the pinned autoscaler image.

DmitriGekhtman · 2022-05-24T18:09:59Z

ray-operator/controllers/raycluster/common/pod.go

-func addEmptyDir(container *v1.Container, pod *v1.Pod) {
-	if checkIfVolumeMounted(container, pod) {
+func addEmptyDir(container *v1.Container, pod *v1.Pod, volumeName string, volumeMountPath string, storageMedium v1.StorageMedium) {
+	if checkIfVolumeMounted(container, pod, volumeMountPath) {


Generalized for purposes of adding ray-log volume.

DmitriGekhtman · 2022-05-24T18:10:46Z

ray-operator/controllers/raycluster/common/pod.go

 		container.VolumeMounts = append(container.VolumeMounts, mountedVolume)
 	}
 }

-func checkIfVolumeMounted(container *v1.Container, pod *v1.Pod) bool {
+//Format an emptyDir volume.
+//When the storage medium is memory, set the size limit based on container resources.


The size limit calculation is kept as it was previously.

DmitriGekhtman · 2022-05-24T18:12:28Z

ray-operator/controllers/raycluster/common/pod_test.go

@@ -47,7 +49,7 @@ var instance = rayiov1alpha1.RayCluster{
 					Containers: []v1.Container{
 						{
 							Name:  "ray-head",
-							Image: "rayproject/autoscaler",


rayproject/autoscaler is deprecated

DmitriGekhtman · 2022-05-24T18:12:56Z

ray-operator/controllers/raycluster/common/pod_test.go

+								},
+								Limits: v1.ResourceList{
+									v1.ResourceCPU:    resource.MustParse("1"),
+									v1.ResourceMemory: testMemoryLimit,


Added to test /dev/shm volume size limit.

DmitriGekhtman · 2022-05-27T01:11:54Z

Though actually right now this wouldn't work, because there's a bug with the autoscaler container entrypoint that prevents logs being written to the right place -- I'm working on fixing that.

That bug fix is here.

sriram-anyscale · 2022-05-27T07:05:15Z

docs/guidance/autoscaler.md

 ```

+> Note: For compatibility with the Ray autoscaler, the KubeRay Operator's entrypoint
+> must include the flag `--prioritize-workers-to-delete`. The kustomization overlay


We should make sure to have a plan to remove this flag and make it the default behavior.

ray-operator/controllers/raycluster/common/pod.go

pcmoritz · 2022-05-28T07:20:38Z

Small consistency ask before we merge: Can we always have a space after // in comments to be consistent?

DmitriGekhtman · 2022-05-28T07:36:41Z

Small consistency ask before we merge: Can we always have a space after // in comments to be consistent?

I think I got all of the ones from this PR.
This is sad: golang/go#30540

DmitriGekhtman · 2022-05-28T07:40:24Z

ray-operator/controllers/ray/common/pod_test.go

@@ -259,14 +359,6 @@ func TestDefaultHeadPodTemplate_WithAutoscalingEnabled(t *testing.T) {
 	}
 }

-func TestBuildAutoscalerContainer(t *testing.T) {


Logic moved into TestBuildPod_with_autoscaler_enabled.

pcmoritz · 2022-05-28T07:46:35Z

Thanks for updating the comments :)

pcmoritz · 2022-05-28T07:50:02Z

ray-operator/controllers/ray/common/pod.go

+	}
+
+	// not found, use second container
+	// (This branch shouldn't be accessed -- the autoscaler container should be present.)


Should we log an error or warning here, or possibly even exit the program if this happens?

If this were a Python program, I'd add an assertion.

The most correct thing would probably be to bubble an error a couple layers up the call stack, but that's a bit too much work for something that's not logically possible.

I will add a log statement.

Or panic, maybe.

yeah, why don't we add a warning log statement both here and in the function above that we use the first or second container respectively :)

Panic for autoscaler container, since it's supposed to be logically impossible.
Info statements for Ray container. IMO, using the first container should be the standard (also, that's the pattern suggested by the sample configs.)

DmitriGekhtman · 2022-05-28T22:59:25Z

ray-operator/controllers/ray/common/pod.go

+	}
+
+	// This should be unreachable.
+	panic("Autoscaler container not found!")


I think it's standard to panic in a code branch that's not supposed to be accessible.

DmitriGekhtman · 2022-05-28T23:01:30Z

ray-operator/controllers/ray/common/pod.go

 				return i
 			}
 		}
 	}
 	// not found, use first container
+	log.Info("Head pod container with index 0 identified as Ray container.")


I think it would make sense to document the requirement that the Ray container goes first.
To me, it seems odd to identify a container by an env variable -- also, none of the kuberay sample configs I've seen do that.

pcmoritz · 2022-05-28T23:22:15Z

SGTM!

…mprovements to Kuberay NodeProvider (ray-project#274) Upstreams recent autoscaler changes from the Ray repo.

DmitriGekhtman added 23 commits May 19, 2022 18:04

Update autoscaler image.

4245265

Trailing spaces.

8feae85

Add overlays.

ad3e463

Add to docs.

076fa54

Remove redis in a couple of spots.

ba25502

Namespace selector came out of somewhere...

420f4d6

Remove scratch yaml.

77e77df

Remove redis password logic from test.

e63c2ce

Add namespaces.

1643dee

Fix kustomization.

abdaac5

Log if the feature flag is enabled.

dd587bf

Fix entrypoint.

1c7c6bd

Autoscaler logs volume mount.

fe8619a

fix-test

ca422ae

Add Ray log volume mount when autoscaling.

cfea869

Fix BuildPod.

16bafbd

fix

6658b8c

Add an emptyDir volume functions.

88848ff

Add resources to test instance.

0dfa759

Unit test.

91a04e1

Merge branch 'master' into dmitri/update-autoscaler-image

b571e25

Fix variable name.

5bacaef

apply -> create

588e30f

DmitriGekhtman commented May 24, 2022

View reviewed changes

Doc typos.

189b1bd

DmitriGekhtman commented May 24, 2022

View reviewed changes

DmitriGekhtman marked this pull request as ready for review May 24, 2022 18:16

DmitriGekhtman added 2 commits May 25, 2022 19:25

Add a comment explaining what the log volume is for.

a54ddd0

Document the volume.

23f631c

sriram-anyscale reviewed May 27, 2022

View reviewed changes

DmitriGekhtman added 6 commits May 27, 2022 14:54

container -> pod

7cbdcff

Add volumes using the same method.

e970fab

Reuse function to add volume.

fe6da32

Merge branch 'master' into dmitri/update-autoscaler-image

26ff42c

Remove print statements

d0f98ce

raycluster -> ray

8f7d64e

DmitriGekhtman added 3 commits May 28, 2022 00:20

explain

c895130

Typo

f601cb8

pods.go: Spaces

704519d

Test Typo

05a18d2

DmitriGekhtman commented May 28, 2022

View reviewed changes

pcmoritz reviewed May 28, 2022

View reviewed changes

pcmoritz approved these changes May 28, 2022

View reviewed changes

sriram-anyscale approved these changes May 28, 2022

View reviewed changes

Container indices: Log and panic.

0cef0cc

DmitriGekhtman commented May 28, 2022

View reviewed changes

DmitriGekhtman merged commit eac60b3 into ray-project:master May 28, 2022

DmitriGekhtman deleted the dmitri/update-autoscaler-image branch May 29, 2022 01:32

DmitriGekhtman mentioned this pull request Jun 2, 2022

[kuberay][autoscaler] Use new autoscaling fields from the KubeRay operator ray-project/ray#25386

Merged

4 tasks

DmitriGekhtman mentioned this pull request Jul 19, 2022

[autoscaler] Make log file mount path more specific. #391

Merged

4 tasks

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[autoscaler] Improve autoscaler auto-configuration, upstream recent i…

bde2599

…mprovements to Kuberay NodeProvider (ray-project#274) Upstreams recent autoscaler changes from the Ray repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274

[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274

DmitriGekhtman commented May 24, 2022 •

edited

Loading

DmitriGekhtman May 24, 2022

DmitriGekhtman May 24, 2022

DmitriGekhtman May 24, 2022

DmitriGekhtman May 24, 2022

DmitriGekhtman May 24, 2022

DmitriGekhtman commented May 27, 2022 •

edited

Loading

sriram-anyscale May 27, 2022

pcmoritz commented May 28, 2022

DmitriGekhtman commented May 28, 2022

DmitriGekhtman May 28, 2022

pcmoritz commented May 28, 2022

pcmoritz May 28, 2022

DmitriGekhtman May 28, 2022

DmitriGekhtman May 28, 2022

pcmoritz May 28, 2022

DmitriGekhtman May 28, 2022

DmitriGekhtman May 28, 2022 •

edited

Loading

DmitriGekhtman May 28, 2022 •

edited

Loading

pcmoritz commented May 28, 2022

[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274

[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274

Conversation

DmitriGekhtman commented May 24, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman commented May 27, 2022 • edited Loading

Choose a reason for hiding this comment

pcmoritz commented May 28, 2022

DmitriGekhtman commented May 28, 2022

Choose a reason for hiding this comment

pcmoritz commented May 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman May 28, 2022 • edited Loading

Choose a reason for hiding this comment

DmitriGekhtman May 28, 2022 • edited Loading

Choose a reason for hiding this comment

pcmoritz commented May 28, 2022

DmitriGekhtman commented May 24, 2022 •

edited

Loading

DmitriGekhtman commented May 27, 2022 •

edited

Loading

DmitriGekhtman May 28, 2022 •

edited

Loading

DmitriGekhtman May 28, 2022 •

edited

Loading