Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Better defaults and config options #414

Merged

Conversation

DmitriGekhtman
Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman commented Jul 25, 2022

Consider reviewing the Ray PR ray-project/ray#26985 related to autoscaling defaults first.

Why are these changes needed?

This PR wraps up autoscaler-related work for the Ray 2.0.0 release. Here's a summary of the changes

Autoscaler container config specifcation.

Exposes Env and EnvFrom fields for the autoscaler container.

Autoscaler default resources.

Sets the autoscaler default resource limits and requests equal to each other at 500m CPU and 512Mi memory.
Reasons for this

Aggressive scaling.

Makes default scaling options more aggressive (#359)
Upscaling options are now "Conservative, Default, Aggressive" with Default and Aggressive being aliases of one another
(so upscaling is not rate-limited by default any more).

Completing this item will require merging the Ray PR ray-project/ray#26985.
In fact, it might make sense to merge that one first.

Revert volume mount change

This PR reverts the change to volume mounts from #391.
It turns out that this change can cause the Ray head container to crashloop with a file permissions error...
Having the Autoscaler container try to create the log file (ray-project/ray#26748) if the Ray container hasn't already is good enough already.

We also remove a redundant volume mount from raycluster-autoscaler.yaml as it is already configured by the Ray operator.

Note

Post-approval, please leave merging to me -- I might need to run some additional manual tests.

Related issue number

Closes #358
Closes #359

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Copy link
Collaborator

@sriram-anyscale sriram-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel qualified to approve this - but LGTM

// Resources specifies resource requests and limits for the autoscaler container.
// Default values: 256m CPU request, 512m CPU limit, 256Mi memory request, 512Mi memory limit.
// Resources specifies optional resource request and limit overrides for the autoscaler container.
// Default values: 500m CPU request and limit. 512Mi memory request and limit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this per node? This is insanely low, typical Ray nodes should be 8-32GiB in size, or more than 10-40x this value. Ray is not designed to operate with tiny nodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are defaults for the autoscaler container, not for a Ray node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important related discussion:
#417

// Optional list of environment variables to set in the autoscaler container.
Env []v1.EnvVar `json:"env,omitempty"`
// Optional list of sources to populate environment variables in the autoscaler container.
EnvFrom []v1.EnvFromSource `json:"envFrom,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any real world examples of overriding envs for autoscaler image?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the overall change looks good to me and I just like to know when should we use these fields. like autoscaling config tuning etc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good question:

There are some debug flags for certain things like

  • exposing detailed resource messages from the GCS
  • modifying the autoscaler update interval

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jul 26, 2022

please help run ./hack/update-codegen.sh to make sure clientset is updated as well

Signed-off-by: Dmitri Gekhtman <[email protected]>
@DmitriGekhtman
Copy link
Collaborator Author

./hack/update-codegen.sh

We should probably add pre-push hooks for things like this.

Signed-off-by: Dmitri Gekhtman <[email protected]>
@DmitriGekhtman
Copy link
Collaborator Author

Okidoki, thanks reviewers! Merging.

@DmitriGekhtman DmitriGekhtman merged commit 0ff8ff5 into ray-project:master Jul 26, 2022
@DmitriGekhtman DmitriGekhtman deleted the dmitri/remove-file-mount branch July 26, 2022 22:16
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
This PR wraps up autoscaler-related work for the Ray 2.0.0 release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Set better autoscaling defaults [Feature][Minor] Add env option to autoscaler config
8 participants