[autoscaler] Better defaults and config options #414

DmitriGekhtman · 2022-07-25T20:38:41Z

Consider reviewing the Ray PR ray-project/ray#26985 related to autoscaling defaults first.

Why are these changes needed?

This PR wraps up autoscaler-related work for the Ray 2.0.0 release. Here's a summary of the changes

Autoscaler container config specifcation.

Exposes Env and EnvFrom fields for the autoscaler container.

Autoscaler default resources.

Sets the autoscaler default resource limits and requests equal to each other at 500m CPU and 512Mi memory.
Reasons for this

It's easier to reason about pod scheduling with equal container limits and requests
Having different values for container requests and resources automatically decreases the scheduling QoS class for the head pod, which we do not want. https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

Aggressive scaling.

Makes default scaling options more aggressive (#359)
Upscaling options are now "Conservative, Default, Aggressive" with Default and Aggressive being aliases of one another
(so upscaling is not rate-limited by default any more).

Completing this item will require merging the Ray PR ray-project/ray#26985.
In fact, it might make sense to merge that one first.

Revert volume mount change

This PR reverts the change to volume mounts from #391.
It turns out that this change can cause the Ray head container to crashloop with a file permissions error...
Having the Autoscaler container try to create the log file (ray-project/ray#26748) if the Ray container hasn't already is good enough already.

We also remove a redundant volume mount from raycluster-autoscaler.yaml as it is already configured by the Ray operator.

Note

Post-approval, please leave merging to me -- I might need to run some additional manual tests.

Related issue number

Closes #358
Closes #359

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: Dmitri Gekhtman <[email protected]>

sriram-anyscale

I don't feel qualified to approve this - but LGTM

ericl · 2022-07-26T17:59:41Z

ray-operator/apis/ray/v1alpha1/raycluster_types.go

-	// Resources specifies resource requests and limits for the autoscaler container.
-	// Default values: 256m CPU request, 512m CPU limit, 256Mi memory request, 512Mi memory limit.
+	// Resources specifies optional resource request and limit overrides for the autoscaler container.
+	// Default values: 500m CPU request and limit. 512Mi memory request and limit.


Is this per node? This is insanely low, typical Ray nodes should be 8-32GiB in size, or more than 10-40x this value. Ray is not designed to operate with tiny nodes.

These are defaults for the autoscaler container, not for a Ray node.

Important related discussion:
#417

Jeffwan · 2022-07-26T05:06:05Z

ray-operator/apis/ray/v1alpha1/raycluster_types.go

+	// Optional list of environment variables to set in the autoscaler container.
+	Env []v1.EnvVar `json:"env,omitempty"`
+	// Optional list of sources to populate environment variables in the autoscaler container.
+	EnvFrom []v1.EnvFromSource `json:"envFrom,omitempty"`


any real world examples of overriding envs for autoscaler image?

BTW, the overall change looks good to me and I just like to know when should we use these fields. like autoscaling config tuning etc

It's a good question:

There are some debug flags for certain things like

exposing detailed resource messages from the GCS

modifying the autoscaler update interval

Jeffwan · 2022-07-26T20:14:25Z

please help run ./hack/update-codegen.sh to make sure clientset is updated as well

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman · 2022-07-26T21:25:14Z

./hack/update-codegen.sh

We should probably add pre-push hooks for things like this.

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman · 2022-07-26T22:09:10Z

Okidoki, thanks reviewers! Merging.

This PR wraps up autoscaler-related work for the Ray 2.0.0 release.

DmitriGekhtman added 15 commits July 23, 2022 18:51

remove-file-mount

f1e10a8

Signed-off-by: Dmitri Gekhtman <[email protected]>

Revert file change.

223a75d

Signed-off-by: Dmitri Gekhtman <[email protected]>

update-image

c99e7e5

Signed-off-by: Dmitri Gekhtman <[email protected]>

Modify defaults.

e50eccd

Signed-off-by: Dmitri Gekhtman <[email protected]>

Make manifests.

7ee091c

Signed-off-by: Dmitri Gekhtman <[email protected]>

wip

41f4db0

Signed-off-by: Dmitri Gekhtman <[email protected]>

Adjust test.

9d2872f

Signed-off-by: Dmitri Gekhtman <[email protected]>

wip

0708146

Signed-off-by: Dmitri Gekhtman <[email protected]>

Add to doc string.

eeee2ec

Signed-off-by: Dmitri Gekhtman <[email protected]>

Add env.

414f775

Signed-off-by: Dmitri Gekhtman <[email protected]>

wip

b270b79

Signed-off-by: Dmitri Gekhtman <[email protected]>

Build and fix.

cfb8825

Signed-off-by: Dmitri Gekhtman <[email protected]>

fix

5a516d0

Signed-off-by: Dmitri Gekhtman <[email protected]>

Merge branch 'master' into dmitri/remove-file-mount

03b9c3e

lint

bd62087

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman assigned pcmoritz Jul 25, 2022

DmitriGekhtman requested a review from Jeffwan July 25, 2022 21:32

DmitriGekhtman assigned wuisawesome Jul 25, 2022

DmitriGekhtman mentioned this pull request Jul 25, 2022

[kuberay] More aggressive autoscaling defaults ray-project/ray#26985

Merged

6 tasks

DmitriGekhtman assigned edoakes and brucez-anyscale Jul 26, 2022

sriram-anyscale reviewed Jul 26, 2022

View reviewed changes

ericl reviewed Jul 26, 2022

View reviewed changes

Jeffwan reviewed Jul 26, 2022

View reviewed changes

update-images

c856255

Signed-off-by: Dmitri Gekhtman <[email protected]>

Update image and generate code.

2dbd32b

Signed-off-by: Dmitri Gekhtman <[email protected]>

DmitriGekhtman merged commit 0ff8ff5 into ray-project:master Jul 26, 2022

DmitriGekhtman deleted the dmitri/remove-file-mount branch July 26, 2022 22:16

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[autoscaler] Better defaults and config options (ray-project#414)

068edd4

This PR wraps up autoscaler-related work for the Ray 2.0.0 release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Better defaults and config options #414

[autoscaler] Better defaults and config options #414

DmitriGekhtman commented Jul 25, 2022 •

edited

Loading

sriram-anyscale left a comment

ericl Jul 26, 2022

DmitriGekhtman Jul 26, 2022

DmitriGekhtman Jul 26, 2022

Jeffwan Jul 26, 2022

Jeffwan Jul 26, 2022

DmitriGekhtman Jul 26, 2022

Jeffwan commented Jul 26, 2022

DmitriGekhtman commented Jul 26, 2022

DmitriGekhtman commented Jul 26, 2022

[autoscaler] Better defaults and config options #414

[autoscaler] Better defaults and config options #414

Conversation

DmitriGekhtman commented Jul 25, 2022 • edited Loading

Why are these changes needed?

Autoscaler container config specifcation.

Autoscaler default resources.

Aggressive scaling.

Revert volume mount change

Note

Related issue number

Checks

sriram-anyscale left a comment

Choose a reason for hiding this comment

ericl Jul 26, 2022

Choose a reason for hiding this comment

DmitriGekhtman Jul 26, 2022

Choose a reason for hiding this comment

DmitriGekhtman Jul 26, 2022

Choose a reason for hiding this comment

Jeffwan Jul 26, 2022

Choose a reason for hiding this comment

Jeffwan Jul 26, 2022

Choose a reason for hiding this comment

DmitriGekhtman Jul 26, 2022

Choose a reason for hiding this comment

Jeffwan commented Jul 26, 2022

DmitriGekhtman commented Jul 26, 2022

DmitriGekhtman commented Jul 26, 2022

DmitriGekhtman commented Jul 25, 2022 •

edited

Loading