Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][autoscaler] Add autoscaler container overrides and config options for scale behavior. #278

Merged

Conversation

DmitriGekhtman
Copy link
Collaborator

@DmitriGekhtman DmitriGekhtman commented May 31, 2022

Overview

Closes #246. Takes a simpler and flatter config approach than the one suggested in the RFC.

This PR is a draft. Once we agree on the config changes, I'll add a simple test.
Added a unit test.

Documentation will be added in a future PR -- after the autoscaler is modified to use the relevant new fields.

Details

This PR adds an autoscalerOptions field to the RayCluster CRD. autoscalerOptions has the following subfields.

upscalingMode

An upcoming Ray PR will have the autoscaler translate this field into upscaling_speed. The allowed values of Default and Aggressive will translate into an upscaling_speed of 1 and 1000, respectively.

idleTimeoutSeconds

An upcoming Ray PR will have the autoscaler translate this field into idle_timeout_minutes. Seconds are used because timeouts of under a minute are sometimes useful (especially in test environments). On the other hand, there are issues with using floats in CRDs.

resources

Override for the autoscaler container's resource requests. For large clusters, users should monitor the autoscaler's resource usage and set this as needed. (TODO: Run some benchmarks and document the results.)

image

Override for the image used by the autoscaler container. The main application I have in mind for this is testing autoscaler changes in the Ray repo's CI.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@DmitriGekhtman DmitriGekhtman changed the title [draft][autoscaler] Add autoscaler container overrides and config options for scale behavior. [RFC][autoscaler] Add autoscaler container overrides and config options for scale behavior. May 31, 2022
@Jeffwan
Copy link
Collaborator

Jeffwan commented Jun 1, 2022

I am thinking whether we should bump the version to v1alpha2 this time. Adding new fields to existing CRD won't bring breaking issue so v1alpha1 works as well. Bumping version introduce additional work like cr mutation etc. what do you guys think?

@pcmoritz
Copy link
Collaborator

pcmoritz commented Jun 1, 2022

On the question of bumping, why don't we bump it when we make incompatible changes (which I'm pretty sure will happen in the future before KubeRay becomes generally available)? This way we can avoid people having to do CR mutation multiple times :)

@pcmoritz
Copy link
Collaborator

pcmoritz commented Jun 1, 2022

This PR looks good to me to make it more convenient for people to use Ray's autoscaler. It won't help people who want to bring their own autoscaler, but they can do so in the classical way (by providing a pod or even an external component that KubeRay is not aware of at all).

description: UpscalineMode is "Default" or "Aggressive.
enum:
- Default
- Aggressive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering, where is the upscaling_speed?

Copy link
Collaborator Author

@DmitriGekhtman DmitriGekhtman Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allowed values of "Default" and "Aggressive" will translate into an upscaling_speed of 1 and 1000, respectively.

(1000 is a stand-in for infinity.)

The reason is that we've never seen anyone have a need for anything strictly between 1 and infinity :)

An upcoming Ray PR will have the autoscaler parse the new field.

@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented Jun 1, 2022

This PR looks good to me to make it more convenient for people to use Ray's autoscaler. It won't help people who want to bring their own autoscaler, but they can do so in the classical way (by providing a pod or even an external component that KubeRay is not aware of at all).

Yep, this is the idea. There's also the option of embedding a custom sidecar yourself.

Another possibility down the line is to expose additional fields container fields here, besides resource and image.

I considered inserting a whole container spec's worth of overrides into the CRD (some CRDs do that)...then I started thinking about how to merge with the default autoscaler container...then I found myself in a Go reflection rabbit hole and decided to go with a simpler approach.

@DmitriGekhtman
Copy link
Collaborator Author

Eh, yeah, +1 for the lazy bump policy
(bump when incompatible changes are desired).

…uberay/ray-operator/bin/controller-gen "crd:maxDescLen=100,trivialVersions=true,preserveUnknownFields=false,generateEmbeddedObjectMeta=true" rbac:roleName=manager-role webhook paths="./..." output:crd:artifacts:config=config/crd/bases

/Users/dmitrigekhtman/code/kuberay/ray-operator/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
mkdir -p /Users/dmitrigekhtman/code/kuberay/ray-operator/testbin
test -f /Users/dmitrigekhtman/code/kuberay/ray-operator/testbin/setup-envtest.sh || curl -sSLo /Users/dmitrigekhtman/code/kuberay/ray-operator/testbin/setup-envtest.sh https://raw.githubusercontent.com/kubernetes-sigs/controller-runtime/v0.7.2/hack/setup-envtest.sh
source /Users/dmitrigekhtman/code/kuberay/ray-operator/testbin/setup-envtest.sh; fetch_envtest_tools /Users/dmitrigekhtman/code/kuberay/ray-operator/testbin; setup_envtest_env /Users/dmitrigekhtman/code/kuberay/ray-operator/testbin; go test ./... -coverprofile cover.out
�[1;33mUsing cached envtest tools from /Users/dmitrigekhtman/code/kuberay/ray-operator/testbin�[0m
�[1;33msetting up env vars�[0m
?   	github.com/ray-project/kuberay/ray-operator	[no test files]
ok  	github.com/ray-project/kuberay/ray-operator/apis/ray/v1alpha1	0.012s	coverage: 0.7% of statements
ok  	github.com/ray-project/kuberay/ray-operator/controllers/ray	13.268s	coverage: 61.6% of statements
ok  	github.com/ray-project/kuberay/ray-operator/controllers/ray/common	0.015s	coverage: 75.6% of statements
ok  	github.com/ray-project/kuberay/ray-operator/controllers/ray/utils	0.012s	coverage: 32.8% of statements
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/clientset/versioned	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/clientset/versioned/fake	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/clientset/versioned/scheme	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/clientset/versioned/typed/raycluster/v1alpha1	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/clientset/versioned/typed/raycluster/v1alpha1/fake	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/informers/externalversions	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/informers/externalversions/internalinterfaces	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/informers/externalversions/raycluster	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/informers/externalversions/raycluster/v1alpha1	[no test files]
?   	github.com/ray-project/kuberay/ray-operator/pkg/client/listers/raycluster/v1alpha1	[no test files].
@DmitriGekhtman DmitriGekhtman marked this pull request as ready for review June 1, 2022 06:45
@DmitriGekhtman
Copy link
Collaborator Author

DmitriGekhtman commented Jun 1, 2022

I manually tested to confirm the new fields work as expected.

@DmitriGekhtman DmitriGekhtman merged commit 0abb204 into ray-project:master Jun 1, 2022
@DmitriGekhtman DmitriGekhtman deleted the dmitri/add_autoscaler_fields branch June 1, 2022 22:21
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an explanation for this pattern? Same for line 69

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, these are the auto-generated internals of a K8s container's resource field.

The readable source of truth is in raycluster_types.go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is a regex encoding the rules for specifying resource quantities, as described in the K8s docs:
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-units-in-kubernetes

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ns for scale behavior. (ray-project#278)

This PR adds an autoscalerOptions field to the RayCluster CRD.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][RFC] Expose Ray autoscaler configuration options
6 participants