[RFE] Change num_gpus to a dict to support arbitrary accelerators #467

Xaenalt · 2024-02-22T05:24:41Z

Name of Feature or Improvement

I'd like to change from a hardcoding of nvidia.com/gpu to instead having a dict or something of resources. There are other accelerators and it'd be nice to specify them with an arbitrary key/value rather than hardcoding nvidia.com/gpu

Description of Problem the Feature Should Solve

Currently hardcoding nvidia.com/gpu is suboptimal since there are other accelerators, habana.ai/gaudi to name one, but there are other potential resources and accelerators, some possibly even not public. It would be a benefit to usability to specify these additional resources without editing the template.

Describe the Solution You Would Like to See

I'd like to see a constructor something like:

cluster = Cluster(ClusterConfiguration(
    name='raytest',
    namespace='ray-demo',
    num_workers=2,
    min_cpus=8,
    max_cpus=8,
    min_memory=12,
    max_memory=12,
    resources={"habana.ai/gaudi": 1},
    image="quay.io/spryor/ray:synapseai-1.13-torch",
    instascale=False
))

Which would just add the keys/values from the resources variable into the resources requests/limits section. Perhaps an option to set requests/limits separately something like for splitting, but first pass it's totally fine if it's just requests == limits since for hardware devices it's required they be equal

Describe Alternatives You Have Considered

Some alternative format ideas are maybe something like min_resources and max_resources, or a string format like "someresource": "1/2" for request 1 limit 2, etc.

Additional Context

In this case, the request is around Habana Gaudi devices, but the scope exists beyond that

The text was updated successfully, but these errors were encountered:

anishasthana · 2024-02-22T14:34:44Z

cc @Bobbins228

Bobbins228 · 2024-02-26T09:42:57Z

This sounds like a useful change 👍

KPostOffice · 2024-09-19T13:41:00Z

Solved with #531

KPostOffice closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] Change num_gpus to a dict to support arbitrary accelerators #467

[RFE] Change num_gpus to a dict to support arbitrary accelerators #467

Xaenalt commented Feb 22, 2024

anishasthana commented Feb 22, 2024

Bobbins228 commented Feb 26, 2024

KPostOffice commented Sep 19, 2024

[RFE] Change num_gpus to a dict to support arbitrary accelerators #467

[RFE] Change num_gpus to a dict to support arbitrary accelerators #467

Comments

Xaenalt commented Feb 22, 2024

Name of Feature or Improvement

Description of Problem the Feature Should Solve

Describe the Solution You Would Like to See

Describe Alternatives You Have Considered

Additional Context

anishasthana commented Feb 22, 2024

Bobbins228 commented Feb 26, 2024

KPostOffice commented Sep 19, 2024