Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Users should be able to define custom resources in worker groups #167

Closed
1 of 2 tasks
ebr opened this issue Mar 2, 2022 · 9 comments
Closed
1 of 2 tasks
Assignees
Labels
enhancement New feature or request operator

Comments

@ebr
Copy link
Contributor

ebr commented Mar 2, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

When using the original ray-operator, the cluster.ray.io/v1 API included a spec for rayResource, which could be used for tagging worker groups as providers of custom, user-defined resources. This seems to be missing from the ray.io/v1alpha1 API, and it would be useful to have it back.

Use case

A use case for this might be to deploy a heterogenous cluster with multiple worker groups, where each worker group uses a different image packaged with different 3rd-party utilities. Some tasks that require specific utilities could then be marked as requiring such resource, and only execute on the workers that provide it.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@ebr ebr added the enhancement New feature or request label Mar 2, 2022
@chenk008
Copy link
Contributor

chenk008 commented Mar 3, 2022

@ebr Are you using an old ray-operator? The CRD has changed last year, and the rayResource doesn't exist any more.

@ebr
Copy link
Contributor Author

ebr commented Mar 3, 2022

That is possible - we've been using ray-operator for a few months. I'm mainly asking whether there are any plans to bring back some kind of mechanism for defining custom resources. Also, we were using rayResources: {"CPU":0} on the head node to prevent the head from doing computational workloads. Wondering how we can achieve this now without rayResources.

@juangtato-ds
Copy link

juangtato-ds commented Mar 8, 2022

@ebr in https://github.com/ray-project/kuberay/blob/master/docs/best-practice/worker-head-reconnection.md indicates how to set 0 cpus, by setting the startup parameter num-cpus.

We also have been trying to configure custom resources for our worker groups. But we didn't achieve to launch the pod.

If we configure the worker group with:

rayStartParams:
  redis-password: 'foobared'
  node-ip-address: $MY_POD_IP
  block: 'true'
  resources: '{ "only_cpu" : 9001 }'

But the pod creation fails with:

Error: Got unexpected extra arguments (only_cpu : 9001 })

The weird thing is when change it a little bit a different error appears:

rayStartParams:
  redis-password: 'foobared'
  node-ip-address: $MY_POD_IP
  block: 'true'
  resources: '{"only_cpu":9001}'`
  Pod fails with: `2022-03-07 23:36:03,162 PANIC scripts.py:503 -- Valid values look like this: `--resources='{"CustomResource3": 1, "CustomResource2": 2}'

Error:

Valid values look like this: `{}`
2022-03-07 23:36:03,162 ERR scripts.py:500 -- `--resources` is not a valid JSON string.`

In the commit d54ea70 there is the following comment:

      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the unfortunate format demonstrated below.

But there is no "demostration below".

@ebr
Copy link
Contributor Author

ebr commented Mar 8, 2022

@juangtato-ds Thank you for pointing me at this! i figured it out - the "unfortunate format" is that you must escape the double quotes. So when deploying Ray clusters using the Helm chart, this worked:

resources: "'{\"customRes\": 1, \"anotherOne\": 2}'"

resulting in the following command in the pod spec:

ray start --resources='{"customRes": 1, "anotherOne": 2}' --block ....

@juangtato-ds
Copy link

@ebr thanks! Didn't try out that one. It also worked for us.

For this scenario, maybe resources attribute specificación should admit a map, something like:

rayStartParams:
  # ...
  resources: 
    customRes: 1
    anotherOne: 2

@DmitriGekhtman
Copy link
Collaborator

Current way of specifying resource in Ray start params is pretty painful, definitely this should be fixed.

@Jeffwan Jeffwan added help wanted Extra attention is needed operator labels May 30, 2022
@DmitriGekhtman DmitriGekhtman removed the help wanted Extra attention is needed label May 31, 2022
@DmitriGekhtman DmitriGekhtman self-assigned this May 31, 2022
@DmitriGekhtman
Copy link
Collaborator

Starting to work on this now.

@kevin85421
Copy link
Member

Make head more stable: when creating the cluster, allocate sufficient amount of resources on head pod such that it tends to be stable and not easy to crash. You can also set {"num-cpus": "0"} in "rayStartParams" of "headGroupSpec" such that Ray scheduler will skip the head node when scheduling workloads. This also helps to maintain the stability of the head.

https://github.com/ray-project/kuberay/blob/master/docs/best-practice/worker-head-reconnection.md#best-practice

^ Ray scheduler will skip the head node when scheduling workloads.

@DmitriGekhtman
Copy link
Collaborator

We decided to stick with rayStartParams["resources"] as the way do this:
https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#id1
It could be possible to simplify the required format for the resource string, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request operator
Projects
None yet
6 participants