Ray does not handle MIG devices #12413

maxclaey · 2020-11-25T18:51:23Z

What is the problem?

Today we discovered an issue with a ray deployment on a DGX A100. NVIDIAs new Ampere cards support MIG (multi instance GPUs) where a physical GPU is split into multiple virtual GPUs, that each can be used as a normal CUDA device. However, in /proc/driver/nvidia/gpus only the physical GPUs show up. So in case you have a single 40GB GPU with 4 10GB MIG devices, you will only see 1 GPU, while in fact there are 4 (virtual ones).

Furthermore, these MIG devices can no longer be specified with CUDA_VISIBLE_DEVICES using a simple ID, but the actual UUIDs should be used.

I had a quick shot at writing a script that can fetch all MIG devices and all regular devices not in MIG mode (see attachment). Maybe it could be interesting to incorporate something like this to make ray able to handle NVIDIAs latest hardware? Thanks in advance for considering

gpu_uuid_fetching.txt (File uploaded as .txt file instead of .py due to GitHub restrictions)

Reproduction (REQUIRED)

Check the available GPU resources on a system with Ampere GPUs and MIG enabled. The number of GPU resources will reflect the number of physical GPUs, rather than the amount of MIG devices.

[X ] I have verified my script runs in a clean environment and reproduces the issue.
[X ] I have verified the issue also occurs with the latest wheels.

richardliaw · 2020-11-25T19:44:13Z

This seems quite cool! How did you get access to the a100?

maxclaey · 2020-11-25T20:05:31Z

We have one available at Robovision :)

richardliaw · 2020-11-25T23:51:03Z

@ericl @wuisawesome @simon-mo we probably want to have some way of plugging in accelerator detectors. They would set the correct environment variables, and handle resources similar to how GPUs are handled.

This will allow us to easily support TPUs, new hardware, etc.

PidgeyBE · 2020-12-01T10:48:54Z

FYI: If we set CUDA_VISIBLE_DEVICES before starting raylet, and set --num_gpus to the number of MIGs, it works!

dominicshanshan · 2021-06-10T08:42:06Z

FYI: If we set CUDA_VISIBLE_DEVICES before starting raylet, and set --num_gpus to the number of MIGs, it works!

I tried start 4 MIG in one docker, then start raylet, nvidia-smi can identify 4MIG, but ray can only run one MIG

DuoblaK · 2022-09-27T13:31:03Z

What is the problem?

Today we discovered an issue with a ray deployment on a DGX A100. NVIDIAs new Ampere cards support MIG (multi instance GPUs) where a physical GPU is split into multiple virtual GPUs, that each can be used as a normal CUDA device. However, in /proc/driver/nvidia/gpus only the physical GPUs show up. So in case you have a single 40GB GPU with 4 10GB MIG devices, you will only see 1 GPU, while in fact there are 4 (virtual ones).

Furthermore, these MIG devices can no longer be specified with CUDA_VISIBLE_DEVICES using a simple ID, but the actual UUIDs should be used.

I had a quick shot at writing a script that can fetch all MIG devices and all regular devices not in MIG mode (see attachment). Maybe it could be interesting to incorporate something like this to make ray able to handle NVIDIAs latest hardware? Thanks in advance for considering

gpu_uuid_fetching.txt (File uploaded as .txt file instead of .py due to GitHub restrictions)

Reproduction (REQUIRED)

Check the available GPU resources on a system with Ampere GPUs and MIG enabled. The number of GPU resources will reflect the number of physical GPUs, rather than the amount of MIG devices.

[X ] I have verified my script runs in a clean environment and reproduces the issue.

[X ] I have verified the issue also occurs with the latest wheels.

Hello，has this question solved for you?
I have met the same problem.
And I tried ray 2.0.0 still exits this problem.
Plus, if my mig device exits on the same gpu device, ray can see different mig device, else if my mig exits on the same gpu device, then ray can see only one mig device.

ericl · 2022-09-27T17:10:56Z

As a workaround, I believe the following should work (referencing #12413 (comment)):

Set CUDA_VISIBLE_DEVICES=uuid1,uuid2 prior to "ray start".
Also set --num-gpus explicitly in "ray start".

e.g., CUDA_VISIBLE_DEVICES=uuid1,uuid2,uuid3 ray start --num-gpus=3.

Should we auto-detect MIG devices by default? @amogkam

XuehaiPan · 2022-09-27T17:38:02Z

I think we should use libcuda.so to detect CUDA devices, see also #17914 (comment).

[RLLIB] Windows WSL GPU Not Detected #17914

Currently, CUDA program can only enumerate one MIG device, see https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices. Ray should read all available MIG devices then set CUDA_VISIBLE_DEVICES for each remote actor.

DuoblaK · 2022-09-28T15:24:35Z

As a workaround, I believe the following should work (referencing #12413 (comment)):

Set CUDA_VISIBLE_DEVICES=uuid1,uuid2 prior to "ray start".

Also set --num-gpus explicitly in "ray start".

e.g., CUDA_VISIBLE_DEVICES=uuid1,uuid2,uuid3 ray start --num-gpus=3.

Should we auto-detect MIG devices by default? @amogkam

Thanks for you reply, and I tried as you said, it really worked in the condition in which pod running on single node in k8s pod, multi-nodes still has problem and I am working on it, it should be working with the same way.
Thanks again for your reply, it helps me a lot.

huaiyizhao · 2022-12-11T17:37:36Z

Hello, any updates on this thread besides the workaround by @ericl?

peterghaddad · 2023-03-14T01:17:46Z

@XuehaiPan @amogkam Any updates on this?

maxclaey added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2020

richardliaw added enhancement Request for new feature and/or capability P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2020

simon-mo self-assigned this Nov 25, 2020

edoakes unassigned simon-mo Oct 18, 2021

jon-chuang mentioned this issue Dec 31, 2021

[Feature] [Autoscaler] Scaling Intelligently Based on Observed Resource Bottlenecks (related: task & actor profiling) #21301

Open

2 tasks

vicyap mentioned this issue Jan 12, 2022

[Feature] Support Nvidia A100 accelerator_type #21557

Closed

2 tasks

ericl added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed P2 Important issue, but not time-critical labels Sep 27, 2022

richardliaw added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 7, 2022

jjyao assigned jonathan-anyscale Oct 17, 2023

This was referenced Nov 13, 2023

[Core] Dynamic MIG support #41092

Open

[Core] mig auto detection #41103

Open

jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray does not handle MIG devices #12413

Ray does not handle MIG devices #12413

maxclaey commented Nov 25, 2020

richardliaw commented Nov 25, 2020

maxclaey commented Nov 25, 2020

richardliaw commented Nov 25, 2020

PidgeyBE commented Dec 1, 2020

dominicshanshan commented Jun 10, 2021

DuoblaK commented Sep 27, 2022

What is the problem?

Reproduction (REQUIRED)

ericl commented Sep 27, 2022

XuehaiPan commented Sep 27, 2022 •

edited

Loading

DuoblaK commented Sep 28, 2022

huaiyizhao commented Dec 11, 2022

peterghaddad commented Mar 14, 2023

Ray does not handle MIG devices #12413

Ray does not handle MIG devices #12413

Comments

maxclaey commented Nov 25, 2020

What is the problem?

Reproduction (REQUIRED)

richardliaw commented Nov 25, 2020

maxclaey commented Nov 25, 2020

richardliaw commented Nov 25, 2020

PidgeyBE commented Dec 1, 2020

dominicshanshan commented Jun 10, 2021

DuoblaK commented Sep 27, 2022

What is the problem?

Reproduction (REQUIRED)

ericl commented Sep 27, 2022

XuehaiPan commented Sep 27, 2022 • edited Loading

DuoblaK commented Sep 28, 2022

huaiyizhao commented Dec 11, 2022

peterghaddad commented Mar 14, 2023

XuehaiPan commented Sep 27, 2022 •

edited

Loading