Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PID for unique mutex name in /dev/shm #89

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jglaser
Copy link

@jglaser jglaser commented Jan 5, 2022

Fixes a race condition if the library is simultaneously invoked from multiple processes

fixes #88

Elena Sakhnovitch and others added 2 commits October 4, 2021 15:04
signed-off-by: Elena Sakhnovitch
Change-Id: I265ba32bc3777db5f04f1924547fe432ba78c3d0
(cherry picked from commit 2f84906)
this fixes a race condition if the library is invoked simultaneously
invoked from multiple processes
@jglaser
Copy link
Author

jglaser commented Jan 21, 2022

ping. Is anyone seeing this? Do you need more context?

@bill-shuzhou-liu
Copy link
Contributor

Thanks to look into this. Although multiple clients can access rocm_smi_lib at the same time, some function only allow one process can be accessed at a time. The shared memory file is used as a mutex to protect those function.

In this proposed change, we may create multiple mutex, which then may allow multiple process access those function concurrently.

When you got this error, did you have multiple process use the rocm_smi_lib concurrently? One possibility is that process1 acquire the mutex and then crash. After that, process2 cannot acquire the mutex any more. In that case, you may need to delete the shared memory file manually.

@jglaser
Copy link
Author

jglaser commented Jan 24, 2022

Hi... RCCL uses rocm_smi under the hood
https://github.com/ROCmSoftwarePlatform/rccl/blob/4643a17f83900dd84676fc61ebf03be0d9584d68/src/misc/rocm_smi_wrap.cc#L37-L43

pytorch uses RCCL for distributed training, and instantiates multiple processes per node when there are multiple GPUs in a node. This leads to the race condition. It is not possible to manually delete the shared memory files, because the processes are launched simultaneously.

@jglaser
Copy link
Author

jglaser commented Jan 24, 2022

Thanks to look into this. Although multiple clients can access rocm_smi_lib at the same time, some function only allow one process can be accessed at a time. The shared memory file is used as a mutex to protect those function.

In this proposed change, we may create multiple mutex, which then may allow multiple process access those function concurrently.

When you got this error, did you have multiple process use the rocm_smi_lib concurrently? One possibility is that process1 acquire the mutex and then crash. After that, process2 cannot acquire the mutex any more. In that case, you may need to delete the shared memory file manually.

Perhaps the way the mutex is set up is not thread (multi-process) safe to begin with?

@bill-shuzhou-liu
Copy link
Contributor

I see. So it is a l

Hi... RCCL uses rocm_smi under the hood https://github.com/ROCmSoftwarePlatform/rccl/blob/4643a17f83900dd84676fc61ebf03be0d9584d68/src/misc/rocm_smi_wrap.cc#L37-L43

pytorch uses RCCL for distributed training, and instantiates multiple processes per node when there are multiple GPUs in a node. This leads to the race condition. It is not possible to manually delete the shared memory files, because the processes are launched simultaneously.

Thank you. I will try to reproduce it.

bill-shuzhou-liu and others added 3 commits January 26, 2022 09:36
Install LICENSE.txt to share/doc/smi-lib

Change-Id: Idcbb70db8808111203e8e4a4c3ab4d1e070ac79d
Add rpm License header for cpack

Change-Id: I2f4a89015b6389cfde801f41d4f6e0f59e7087aa
pop_back() was causing a seg fault when pp_dpm_pcie file is empty and returns whitespace.

Signed-off-by: Divya Shikre <[email protected]>
Change-Id: I888f1f79751cd456e43751a5b96d08560a039677
(cherry picked from commit ec71380)
@jglaser
Copy link
Author

jglaser commented Mar 29, 2022

Has there been any progress on this issue?

The problem is still present in rocm 5.0.2 when launching pytorch with 8 GPUs/node on OLCF crusher.

15: pthread_mutex_unlock: Success
15: pthread_mutex_unlock: No such file or directory
15: pthread_mutex_timedlock() returned 131
15: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
15: pthread_mutex_timedlock() returned 131
15: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
15: pthread_mutex_timedlock() returned 131
15: Failed to initialize RSMI device mutex after 5 seconds. Previous execution may not have shutdown cleanly. To fix problem, stop all rocm_smi programs, and then delete the rocm_smi* shared memory files in /dev/shm.: Success
15: rsmi_init() failed
15: rsmi_init() failed
15: Traceback (most recent call last):
15:   File "../contact_pred/finetune_structure.py", line 374, in <module>
15:     main()

@bill-shuzhou-liu
Copy link
Contributor

The error returned by pthread_mutex_timedlock() is different from last time which was 110:
pthread_mutex_timedlock() returned 131
110: ETIMEDOUT, other process/thread hold the lock more than 5 seconds, and then timeout
131: ENOTRECOVERABLE, state not recoverable

Based on man page:

If a mutex is initialized with the PTHREAD_MUTEX_ROBUST attribute and its owner dies without unlocking it, ... ... If the next owner unlocks the mutex using pthread_mutex_unlock(3) before making it consistent, the mutex will be permanently unusable and any subsequent attempts to lock it using pthread_mutex_lock(3) will fail with the error ENOTRECOVERABLE.

Do we observed some process crash? Thanks.

@gounley
Copy link

gounley commented Jul 11, 2022

Has there been any progress on this issue? This is still occurring with rocm 5.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initialization sometimes fails on multi-GPU nodes due to race condition
4 participants