-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tidy custom resources #336
Tidy custom resources #336
Conversation
@@ -329,8 +331,6 @@ github.com/mitchellh/go-homedir v1.0.0/go.mod h1:SfyaCUpYCn1Vlf4IUYiD9fPX4A5wJrk | |||
github.com/mitchellh/go-homedir v1.1.0/go.mod h1:SfyaCUpYCn1Vlf4IUYiD9fPX4A5wJrkLzIz1N1q0pr0= | |||
github.com/mitchellh/go-testing-interface v1.0.0/go.mod h1:kRemZodwjscx+RGhAo8eIhFbs2+BFgRtFPeD/KE+zxI= | |||
github.com/mitchellh/gox v0.4.0/go.mod h1:Sd9lOJ0+aimLBi73mGofS1ycjY8lL3uZM3JPS42BGNg= | |||
github.com/mitchellh/hashstructure/v2 v2.0.2 h1:vGKWl0YJqUNxE8d+h8f6NJLcCJrgbhC4NcD46KavDd4= | |||
github.com/mitchellh/hashstructure/v2 v2.0.2/go.mod h1:MG3aRVU/N29oo/V/IhBX8GR/zz4kQkprJgF2EVszyDE= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this one was unused and got removed by go mod tidy
// capacities of Ray pods in this group. | ||
// Equivalent to RayStartParams.resources, which must be specified as a json string. | ||
// Disallowed keys: "GPU", "CPU", "memory", "object_store_memory". Instead, use rayStartParams["num-cpus"] etc. | ||
RayCustomResources map[string]int64 `json:"rayCustomResources,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking in the real world cases, normally we need to claim the resource twice ( kubernetes level and ray level). for example, If I need an inference chip like aws.amazon.com/neuron: 1
.
kubernetes level configuration
resources:
limits:
aws.amazon.com/neuron: 1
ray level configuration
rayCustomResources
aws.amazon.com/neuron: 1
Is that expected? Not sure if the resource name are always same between kubernetes and ray. If so, we probably can use kubernetes resource as source of truth to build ray resource configurations. If not, I think separation is necessary. (I kind of forget our conclusion in last community meeting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'd prefer to keep the configuration separate -- this PR isn't introducing new behavior, just cleaning up the existing interface.
rayCustomResources
is an alias for rayStartParams["resources"]
. You use a map instead of a triply-quoted json string.
For the use-case described in #167 (multiple image types) it's not needed to set up a Kubernetes extended resource.
The Ray resource annotation is just used as a hint to the Ray scheduler to schedule certain tasks on workers using certain images.
For use-cases like this, it's too heavy-weight to require the user to patch nodes with extended resource capacities (or run a daemonset to do the same).
In the other direction, if your Ray pod already uses K8s extended resources, it could be useful for Ray to automatically detect those as "custom resources". Right now, we do that only with "/gpu" extended resources, but we might want to generalize the behavior. We could make this generalization later, if that's desired by users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing my opinion:
-
If a pod uses a K8s extended resource, it may be useful to automatically advertise it to Ray as a "custom resource". (But we're not doing that in this PR.)
-
Advertising a "custom resource" to Ray should not require labelling K8s nodes with extended resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarize looks good to me. We can do it stage by stage
The change looks good to me. Please help rebase the master change |
@DmitriGekhtman do we have a use-case where the RayCustomResource is not a physical HW on the node (like a TPU or NPU)? |
Yeah, this one #167. |
Paraphrasing @pcmoritz from offline: I think this probably makes sense -- the configuration is a little awkward, but the functionality is already present. |
Basically, any time you have multiple worker groups and you want certain tasks to be scheduled on a specific worker group. |
Will close for now and consider reopening if there are further complaints about the format. |
@akanso as a related item, we can consider automatically advertising K8s extended resources to Ray by inserting those into the relevant rayStartParam |
Why are these changes needed?
Background
Closes #167
Ray start has a
resources
option which allows users to specify custom resource capacities for a Ray node.Custom resources serve as hints to the Ray scheduler and autoscaler: If a Ray node has custom resource
{"BLAH": 10}
, you can schedule 10 tasks decorated with@ray.remote(resources={"BLAH": 1})
. Attempting to schedule an 11th such task will cause the autoscaler to bring up another Ray node to accommodate the request.One way to use the
resources
field in the context of KubeRay is described in #167:Problem
The value of
resources
must be supplied as a json string, which makes it awkward to specify correctly in a yaml file:Solution
To provide a more convenient interface, this PR introduces an optional map field
rayCustomResources
.Before:
After:
Properties of the change in this PR
An upcoming Ray PR would update the autoscaler to read the newly introduced field when determining node resource capacities.
Note on CPU/GPU/memory
The
resources
option forray start
does not support the keysCPU
,GPU
, ormemory
. These fields are inferred from the container spec and can be overridden with the optionalray start
parametersnum-cpus
,num-gpus
,memory
.Attempting to supply keys
CPU
,GPU
, ormemory
toresources
will cause Ray start to fail with an error message telling the user not to do that.I considered adding CRD validation to exclude these keys. However, Kubernetes does not support "patternProperties" in CRDs -- you can't use a regex to restrict the keys of a map/object. A potential workaround would be to use a list of
key
,value
pairs and add validation forkey
. However, that would make the interface and the code more complicated.We already do some validation of rayStartParams in the operator code.
TODO in a follow-up PR: Validate absence of
CPU
,GPU
,memory
keys in the same spot, after #334 is merged.Autoscaler
A follow-up Ray PR would have the autoscaler consider the
rayCustomResources
field when determining resource capacities.Related issue number
Closes #167
Checks