Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray does not handle MIG devices #12413

Open
maxclaey opened this issue Nov 25, 2020 · 11 comments
Open

Ray does not handle MIG devices #12413

maxclaey opened this issue Nov 25, 2020 · 11 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity

Comments

@maxclaey
Copy link

What is the problem?

Today we discovered an issue with a ray deployment on a DGX A100. NVIDIAs new Ampere cards support MIG (multi instance GPUs) where a physical GPU is split into multiple virtual GPUs, that each can be used as a normal CUDA device. However, in /proc/driver/nvidia/gpus only the physical GPUs show up. So in case you have a single 40GB GPU with 4 10GB MIG devices, you will only see 1 GPU, while in fact there are 4 (virtual ones).

Furthermore, these MIG devices can no longer be specified with CUDA_VISIBLE_DEVICES using a simple ID, but the actual UUIDs should be used.

I had a quick shot at writing a script that can fetch all MIG devices and all regular devices not in MIG mode (see attachment). Maybe it could be interesting to incorporate something like this to make ray able to handle NVIDIAs latest hardware? Thanks in advance for considering

gpu_uuid_fetching.txt (File uploaded as .txt file instead of .py due to GitHub restrictions)

Reproduction (REQUIRED)

Check the available GPU resources on a system with Ampere GPUs and MIG enabled. The number of GPU resources will reflect the number of physical GPUs, rather than the amount of MIG devices.

  • [X ] I have verified my script runs in a clean environment and reproduces the issue.
  • [X ] I have verified the issue also occurs with the latest wheels.
@maxclaey maxclaey added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2020
@richardliaw richardliaw added enhancement Request for new feature and/or capability P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 25, 2020
@richardliaw
Copy link
Contributor

This seems quite cool! How did you get access to the a100?

@maxclaey
Copy link
Author

We have one available at Robovision :)

@simon-mo simon-mo self-assigned this Nov 25, 2020
@richardliaw
Copy link
Contributor

@ericl @wuisawesome @simon-mo we probably want to have some way of plugging in accelerator detectors. They would set the correct environment variables, and handle resources similar to how GPUs are handled.

This will allow us to easily support TPUs, new hardware, etc.

@PidgeyBE
Copy link
Contributor

PidgeyBE commented Dec 1, 2020

FYI: If we set CUDA_VISIBLE_DEVICES before starting raylet, and set --num_gpus to the number of MIGs, it works!

@dominicshanshan
Copy link

FYI: If we set CUDA_VISIBLE_DEVICES before starting raylet, and set --num_gpus to the number of MIGs, it works!

I tried start 4 MIG in one docker, then start raylet, nvidia-smi can identify 4MIG, but ray can only run one MIG

@DuoblaK
Copy link

DuoblaK commented Sep 27, 2022

What is the problem?

Today we discovered an issue with a ray deployment on a DGX A100. NVIDIAs new Ampere cards support MIG (multi instance GPUs) where a physical GPU is split into multiple virtual GPUs, that each can be used as a normal CUDA device. However, in /proc/driver/nvidia/gpus only the physical GPUs show up. So in case you have a single 40GB GPU with 4 10GB MIG devices, you will only see 1 GPU, while in fact there are 4 (virtual ones).

Furthermore, these MIG devices can no longer be specified with CUDA_VISIBLE_DEVICES using a simple ID, but the actual UUIDs should be used.

I had a quick shot at writing a script that can fetch all MIG devices and all regular devices not in MIG mode (see attachment). Maybe it could be interesting to incorporate something like this to make ray able to handle NVIDIAs latest hardware? Thanks in advance for considering

gpu_uuid_fetching.txt (File uploaded as .txt file instead of .py due to GitHub restrictions)

Reproduction (REQUIRED)

Check the available GPU resources on a system with Ampere GPUs and MIG enabled. The number of GPU resources will reflect the number of physical GPUs, rather than the amount of MIG devices.

  • [X ] I have verified my script runs in a clean environment and reproduces the issue.
  • [X ] I have verified the issue also occurs with the latest wheels.

Hello,has this question solved for you?
I have met the same problem.
And I tried ray 2.0.0 still exits this problem.
Plus, if my mig device exits on the same gpu device, ray can see different mig device, else if my mig exits on the same gpu device, then ray can see only one mig device.

@ericl ericl added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed P2 Important issue, but not time-critical labels Sep 27, 2022
@ericl
Copy link
Contributor

ericl commented Sep 27, 2022

As a workaround, I believe the following should work (referencing #12413 (comment)):

  1. Set CUDA_VISIBLE_DEVICES=uuid1,uuid2 prior to "ray start".
  2. Also set --num-gpus explicitly in "ray start".

e.g., CUDA_VISIBLE_DEVICES=uuid1,uuid2,uuid3 ray start --num-gpus=3.

Should we auto-detect MIG devices by default? @amogkam

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Sep 27, 2022

I think we should use libcuda.so to detect CUDA devices, see also #17914 (comment).

Currently, CUDA program can only enumerate one MIG device, see https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices. Ray should read all available MIG devices then set CUDA_VISIBLE_DEVICES for each remote actor.

@DuoblaK
Copy link

DuoblaK commented Sep 28, 2022

As a workaround, I believe the following should work (referencing #12413 (comment)):

  1. Set CUDA_VISIBLE_DEVICES=uuid1,uuid2 prior to "ray start".
  2. Also set --num-gpus explicitly in "ray start".

e.g., CUDA_VISIBLE_DEVICES=uuid1,uuid2,uuid3 ray start --num-gpus=3.

Should we auto-detect MIG devices by default? @amogkam

Thanks for you reply, and I tried as you said, it really worked in the condition in which pod running on single node in k8s pod, multi-nodes still has problem and I am working on it, it should be working with the same way.
Thanks again for your reply, it helps me a lot.

@richardliaw richardliaw added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 7, 2022
@huaiyizhao
Copy link

Hello, any updates on this thread besides the workaround by @ericl?

@peterghaddad
Copy link
Contributor

@XuehaiPan @amogkam Any updates on this?

@jjyao jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests