-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Better defaults and config options #414
[autoscaler] Better defaults and config options #414
Conversation
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
Signed-off-by: Dmitri Gekhtman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel qualified to approve this - but LGTM
// Resources specifies resource requests and limits for the autoscaler container. | ||
// Default values: 256m CPU request, 512m CPU limit, 256Mi memory request, 512Mi memory limit. | ||
// Resources specifies optional resource request and limit overrides for the autoscaler container. | ||
// Default values: 500m CPU request and limit. 512Mi memory request and limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this per node? This is insanely low, typical Ray nodes should be 8-32GiB in size, or more than 10-40x this value. Ray is not designed to operate with tiny nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are defaults for the autoscaler container, not for a Ray node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important related discussion:
#417
// Optional list of environment variables to set in the autoscaler container. | ||
Env []v1.EnvVar `json:"env,omitempty"` | ||
// Optional list of sources to populate environment variables in the autoscaler container. | ||
EnvFrom []v1.EnvFromSource `json:"envFrom,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any real world examples of overriding envs for autoscaler image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the overall change looks good to me and I just like to know when should we use these fields. like autoscaling config tuning etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good question:
There are some debug flags for certain things like
- exposing detailed resource messages from the GCS
- modifying the autoscaler update interval
please help run |
Signed-off-by: Dmitri Gekhtman <[email protected]>
We should probably add pre-push hooks for things like this. |
Signed-off-by: Dmitri Gekhtman <[email protected]>
Okidoki, thanks reviewers! Merging. |
This PR wraps up autoscaler-related work for the Ray 2.0.0 release.
Consider reviewing the Ray PR ray-project/ray#26985 related to autoscaling defaults first.
Why are these changes needed?
This PR wraps up autoscaler-related work for the Ray 2.0.0 release. Here's a summary of the changes
Autoscaler container config specifcation.
Exposes Env and EnvFrom fields for the autoscaler container.
Autoscaler default resources.
Sets the autoscaler default resource limits and requests equal to each other at 500m CPU and 512Mi memory.
Reasons for this
Aggressive scaling.
Makes default scaling options more aggressive (#359)
Upscaling options are now "Conservative, Default, Aggressive" with Default and Aggressive being aliases of one another
(so upscaling is not rate-limited by default any more).
Completing this item will require merging the Ray PR ray-project/ray#26985.
In fact, it might make sense to merge that one first.
Revert volume mount change
This PR reverts the change to volume mounts from #391.
It turns out that this change can cause the Ray head container to crashloop with a file permissions error...
Having the Autoscaler container try to create the log file (ray-project/ray#26748) if the Ray container hasn't already is good enough already.
We also remove a redundant volume mount from
raycluster-autoscaler.yaml
as it is already configured by the Ray operator.Note
Post-approval, please leave merging to me -- I might need to run some additional manual tests.
Related issue number
Closes #358
Closes #359
Checks