-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray does not handle MIG devices #12413
Comments
This seems quite cool! How did you get access to the a100? |
We have one available at Robovision :) |
@ericl @wuisawesome @simon-mo we probably want to have some way of plugging in accelerator detectors. They would set the correct environment variables, and handle resources similar to how GPUs are handled. This will allow us to easily support TPUs, new hardware, etc. |
FYI: If we set CUDA_VISIBLE_DEVICES before starting raylet, and set |
I tried start 4 MIG in one docker, then start raylet, nvidia-smi can identify 4MIG, but ray can only run one MIG |
Hello,has this question solved for you? |
As a workaround, I believe the following should work (referencing #12413 (comment)):
e.g., Should we auto-detect MIG devices by default? @amogkam |
I think we should use Currently, CUDA program can only enumerate one MIG device, see https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-visible-devices. Ray should read all available MIG devices then set |
Thanks for you reply, and I tried as you said, it really worked in the condition in which pod running on single node in k8s pod, multi-nodes still has problem and I am working on it, it should be working with the same way. |
Hello, any updates on this thread besides the workaround by @ericl? |
@XuehaiPan @amogkam Any updates on this? |
What is the problem?
Today we discovered an issue with a ray deployment on a DGX A100. NVIDIAs new Ampere cards support MIG (multi instance GPUs) where a physical GPU is split into multiple virtual GPUs, that each can be used as a normal CUDA device. However, in /proc/driver/nvidia/gpus only the physical GPUs show up. So in case you have a single 40GB GPU with 4 10GB MIG devices, you will only see 1 GPU, while in fact there are 4 (virtual ones).
Furthermore, these MIG devices can no longer be specified with CUDA_VISIBLE_DEVICES using a simple ID, but the actual UUIDs should be used.
I had a quick shot at writing a script that can fetch all MIG devices and all regular devices not in MIG mode (see attachment). Maybe it could be interesting to incorporate something like this to make ray able to handle NVIDIAs latest hardware? Thanks in advance for considering
gpu_uuid_fetching.txt (File uploaded as .txt file instead of .py due to GitHub restrictions)
Reproduction (REQUIRED)
Check the available GPU resources on a system with Ampere GPUs and MIG enabled. The number of GPU resources will reflect the number of physical GPUs, rather than the amount of MIG devices.
The text was updated successfully, but these errors were encountered: