Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional deadlock on mpi_barrier; what to do? #12746

Open
gregfi opened this issue Aug 12, 2024 · 0 comments
Open

occasional deadlock on mpi_barrier; what to do? #12746

gregfi opened this issue Aug 12, 2024 · 0 comments

Comments

@gregfi
Copy link

gregfi commented Aug 12, 2024

My application gets sporadically deadlocked on mpi_barrier calls. It seems to happen at times when the network is under very heavy load and/or the machines are being oversubscribed. (I don't have any control over that.) This application is running OpenMPI 4.1.4 on SuSE Linux 12. My admins attached a debugger and printed a back trace to all the running processes, and the result is

PID 62880:

Using host libthread_db library "/lib64/libthread_db.so.1".
0x00002aead70f655d in poll () from /lib64/libc.so.6 
#0  0x00002aead70f655d in poll () from /lib64/libc.so.6
#1  0x00002aeae04d504e in poll_dispatch (base=0x2fd2ba0, tv=0x12) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/poll.c:165
#2  0x00002aeae04c9881 in opal_libevent2022_event_base_loop (base=0x2fd2ba0, flags=18) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/event.c:1630
#3  0x00002aeae047254e in opal_progress () from /tools/openmpi/4.1.4/lib/libopen-pal.so.40
#4  0x00002aeaf0818d74 in mca_pml_ob1_send () from /tools/openmpi/4.1.4/lib/openmpi/mca_pml_ob1.so
#5  0x00002aead6b51f9d in ompi_coll_base_barrier_intra_recursivedoubling () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#6  0x00002aead6b03e11 in PMPI_Barrier () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#7  0x00002aead6877f43 in pmpi_barrier__ () from /tools/openmpi/4.1.4/lib/libmpi_mpifh.so.40
#8  0x00002aead6400be2 in mpi_barrier_f08_ () from /tools/openmpi/4.1.4/lib/libmpi_usempif08.so.40
#9  0x00000000005bb177 in (same location in application code)

All other PIDs:

Using host libthread_db library "/lib64/libthread_db.so.1".
0x00002afe6e6c355d in poll () from /lib64/libc.so.6 
#0  0x00002afe6e6c355d in poll () from /lib64/libc.so.6
#1  0x00002afe77aa204e in poll_dispatch (base=0x119bbe0, tv=0x9) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/poll.c:165
#2  0x00002afe77a96881 in opal_libevent2022_event_base_loop (base=0x119bbe0, flags=9) at ../../../../.././openmpi-4.1.4/opal/mca/event/libevent2022/libevent/event.c:1630
#3  0x00002afe77a3f54e in opal_progress () from /tools/openmpi/4.1.4/lib/libopen-pal.so.40
#4  0x00002afe6e0ba42b in ompi_request_default_wait () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#5  0x00002afe6e11ef0e in ompi_coll_base_barrier_intra_recursivedoubling () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#6  0x00002afe6e0d0e11 in PMPI_Barrier () from /tools/openmpi/4.1.4/lib/libmpi.so.40
#7  0x00002afe6de44f43 in pmpi_barrier__ () from /tools/openmpi/4.1.4/lib/libmpi_mpifh.so.40
#8  0x00002afe6d9cdbe2 in mpi_barrier_f08_ () from /tools/openmpi/4.1.4/lib/libmpi_usempif08.so.40
#9  0x00000000005bb177 in (same location in application code)

What I see: PID 62880 is waiting on mca_pml_ob1_send; the others are on ompi_request_default_wait.

This problem only occurs sporadically. I chewed through ~10,000 core-hours this weekend trying to reproduce the issue and failed - likely because the system was less loaded over the weekend. The jobs are being run with -map-by socket --bind-to socket --rank-by core --mca btl_tcp_if_include 10.216.0.0/16 in order to force all traffic over a single interface.

Also, puzzlingly, I see the following printed to stderr:

[hpap14n4:08897] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08911] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08913] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)
[hpap14n4:08929] mca_base_component_repository_open: unable to open mca_fs_lustre: liblustreapi.so: cannot open shared object file: No such file or directory (ignored)

This is odd because liblustreapi.so should resolve reliably to /usr/lib64/liblustreapi.so, which is installed locally on each machine (so no funny business with network mappings).

Does anyone have any guesses as to what might be going on, or how I might mitigate these kinds of failures?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants