-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialization sometimes fails on multi-GPU nodes due to race condition #88
Comments
Jens Glaser thanks for the fix. We will pull this in shortly. |
Some function in the rocm_smi_lib only allows one process to access at a time. The rocm_smi_lib used an inter-process mutex to protect it. If another process is using this function, the process will call pthread_mutex_timedlock() to wait 5 seconds. The error code 110 means timeout. After 5 seconds, another process does not release the mutex, then the init() is fail with above errors. Is the number in the error message process id (i.e. 347 in the below example)? How many processes are trying to call the rsmi_init(0) at the same time, and how many processes are successful? If we attach gdb to successful processes, did it block at some rocm_smi_lib function? Thanks. |
Bill, could you specify which function requires the mutex? Yes, the number in front of the ":" is the global process rank. Eight processes per node are calling the I'll have to attach the debugger and will let you know. |
A lot of rocm_smi function require the mutex. In the unit test, you can find most of them. I tried to reproduce this issue in my machine (I only have 1 GPU) with 1000 process and no lucky. You said "beyond ~16 nodes there is always a high probability", do you mean you have 16 computers, and each had 8 GPUs? Thanks. |
I am also encountering the same issue. Line number points to static rsmi_status_t status = rsmi_init(0); I use the below 4 ROCm API's to monitor, when I add the 4th one it started giving me this error. Below is the error message: |
yes.... see here https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html |
When using pytorch with the NCCL/RCCL backend on a system with eight GPUs/node, I get initialization failures of the following kind:
The reason is that rocm_smi_lib creates a mutex in
/dev/shm
whose name is independent of the process id, which creates a race condition.The text was updated successfully, but these errors were encountered: