Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMM_SPAWN is broken in 2.1.x HEAD #2030

Closed
jsquyres opened this issue Aug 30, 2016 · 46 comments
Closed

COMM_SPAWN is broken in 2.1.x HEAD #2030

jsquyres opened this issue Aug 30, 2016 · 46 comments

Comments

@jsquyres
Copy link
Member

Recent Cisco and NVIDIA MTT results are showing COMM_SPAWN and related dynamic operations are failing.

We need to find out if it was broken in v2.0.0 and see if this is a regression. This will determine whether it's a v2.0.1 blocker or not (because we'd really like to release the other important fixes in v2.0.1 ASAP, and start working on the short timeline for v2.1.0 and v2.0.2, etc.).

@sjeaugey is going to test on his tree and see if the recent v2.x commit(s) about bringing over PMIx 1.1.5 are the root of this particular problem.

@sjeaugey
Copy link
Member

Strangely enough, IBM MTT shows the issue on master but not on v2.x.

Also, I looked at the 1.1.5rc commit and it is far from being a significant change. If this commit is the issue, it should be easy to fix.

@sjeaugey
Copy link
Member

I'm currently unable to reproduce the problem on v2.x outside of MTT. On MTT though, it failed both with tcp and smcuda, and the error is a clear segmentation fault in orted (not the random pmix failure when there is a remaining socket in /tmp).

@jsquyres
Copy link
Member Author

@sjeaugey Can you provide a backtrace?

@jsquyres
Copy link
Member Author

@sjeaugey Also, can you force your MTT to use a tarball or git clone with that PMIx commit reverted?

@sjeaugey
Copy link
Member

@jsquyres I still need to be able to reproduce the problem. I'm trying to launch MTT building only v2.x and only IBM dynamic/spawn, but maybe it only fails when spawn is launched after in the middle of all other tests.

@jsquyres
Copy link
Member Author

@sjeaugey You might be able to run this indirectly via MTT. E.g. (I didn't look at your MTT results; I'm guessing/assuming it's the ibm section where you're seeing failures, and fill in ... below for your relevant .ini file and scratch directory):

$ client/mtt --file ... --scratch ... --verbose --mpi-get
$ client/mtt --file ... --scratch ... --verbose --mpi-install
$ client/mtt --file ... --scratch ... --verbose --test-get --section ibm
$ client/mtt --file ... --scratch ... --verbose --test-build --section ibm
$ client/mtt --file ... --scratch ... --verbose --test-run --section ibm-run-the-bad-test

I.e., make a test run section named ibm-run-the-bad-test that just runs the one test that is failing. That way you can run the test via MTT and hopefully be able to reproduce the error.

Make sense?

@sjeaugey
Copy link
Member

@jsquyres Yes it makes sense. Still, no luck. When I only run dynamic/spawn, the tests pass.

@rhc54 I'm starting to wonder if I were just out of luck and hit the pmix problem (when the socket is already in /tmp), only that in the case of spawns, instead of the usual "Error in file orted/pmix/pmix_server.c at line 254" message, I get a segmentation fault.

@sjeaugey
Copy link
Member

I added tmp cleanup before my nightly MTT run. We'll see tomorrow if it solved the issue.

@rhc54
Copy link
Contributor

rhc54 commented Aug 31, 2016

@sjeaugey Looks like you ran clean last night - can you clarify exactly what you did? Did you just clean the tmp? Or did you also roll back the PMIx 1.1.5 commit?

@sjeaugey
Copy link
Member

@rhc54 I wouldn't say that. I got rid of some failures (caused by remaining sockets in tmp) but only 2 failures remain : spawn and spawn_with_env_vars. And spawn_multiple timed out too.

I'm still unable to reproduce it outside of the nightly runs however. Working on it.

@rhc54
Copy link
Contributor

rhc54 commented Aug 31, 2016

how strange...the nightly summary didn't show those failures. Is there a core file you can look at? A line number where those daemons are crashing would really help.

@sjeaugey
Copy link
Member

Some new interesting fact : on my second node, the spawn process are still there, stuck :

[sjeaugey@drossetti-ivy5 ~]$ ps -ef | grep spawn
sjeaugey 18626     1 99 Aug29 ?        2-04:35:50 dynamic/spawn this is argv 1 this is argv 2
sjeaugey 22338     1  3 Aug30 ?        00:58:16 dynamic/spawn this is argv 1 this is argv 2
sjeaugey 22447     1 99 Aug30 ?        1-04:08:22 dynamic/spawn_with_env_vars
sjeaugey 23013     1  3 06:02 ?        00:06:57 dynamic/spawn this is argv 1 this is argv 2
sjeaugey 23064     1 99 06:02 ?        03:35:28 dynamic/spawn_with_env_vars
sjeaugey 23122     1 99 06:02 ?        03:35:31 dynamic/spawn_with_env_vars

I tried to gstack them. That killed some of the processes, but some others survived and gave me a stack trace :

[sjeaugey@drossetti-ivy5 ~]$ gstack 18626
Thread 3 (Thread 0x7fbb3b437700 (LWP 18628)):
#0  0x00007fbb3d048173 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fbb3c9ea998 in epoll_dispatch () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#2  0x00007fbb3c9eddfe in opal_libevent2022_event_base_loop () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#3  0x00007fbb3b4650bd in progress_engine () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_pmix_pmix112.so
#4  0x00007fbb3d2fa9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fbb3d047b7d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fbb3a436700 (LWP 18631)):
#0  0x00007fbb3d00bced in nanosleep () from /lib64/libc.so.6
#1  0x00007fbb3d040e64 in usleep () from /lib64/libc.so.6
#2  0x00007fbb3b46d715 in OPAL_PMIX_PMIX112_PMIx_Abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_pmix_pmix112.so
#3  0x00007fbb3b44202a in pmix1_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_pmix_pmix112.so
#4  0x00007fbb3b6b7bf0 in rte_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_ess_pmi.so
#5  0x00007fbb3d608157 in ompi_rte_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#6  0x00007fbb3d566210 in ompi_mpi_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#7  0x00007fbb3d54ce35 in ompi_errhandler_runtime_callback () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#8  0x00007fbb3cce7bdd in orte_errmgr_base_execute_error_callbacks () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-rte.so.20
#9  0x00007fbb38a23972 in proc_errors () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_errmgr_default_app.so
#10 0x00007fbb3c9ee851 in opal_libevent2022_event_base_loop () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#11 0x00007fbb3c995b22 in progress_engine () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#12 0x00007fbb3d2fa9d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fbb3d047b7d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fbb3da91700 (LWP 18626)):
#0  0x00007fbb30c5cf61 in sm_fifo_read () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_btl_smcuda.so
#1  0x00007fbb30c5f111 in mca_btl_smcuda_component_progress () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_btl_smcuda.so
#2  0x00007fbb3c9903ae in opal_progress () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#3  0x00007fbb3d564c46 in ompi_request_wait_completion () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#4  0x00007fbb3d564c80 in ompi_request_default_wait () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#5  0x00007fbb3d5e059b in ompi_coll_base_bcast_intra_generic () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#6  0x00007fbb3d5e0c7f in ompi_coll_base_bcast_intra_binomial () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#7  0x00007fbb2aa05238 in ompi_coll_tuned_bcast_intra_dec_fixed () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_coll_tuned.so
#8  0x00007fbb3d543706 in ompi_comm_allreduce_intra_pmix () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#9  0x00007fbb3d540e13 in ompi_comm_nextcid () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#10 0x00007fbb3d5492bd in ompi_dpm_connect_accept () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#11 0x00007fbb3d54b793 in ompi_dpm_dyn_init () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#12 0x00007fbb3d567174 in ompi_mpi_init () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#13 0x00007fbb3d59a00c in PMPI_Init () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#14 0x0000000000401652 in main ()

I'll kill those remaining processes eventually (they could be the cause of the failures), but if you want me to look at something in particular, please tell me.

@rhc54
Copy link
Contributor

rhc54 commented Aug 31, 2016

Yeah, I can believe they would be stuck - they are trying to Abort and the daemon is gone, so nobody they can tell. I believe we have solved that since the 1.1 series, but I'll check to ensure we do.

Real question is: why are the orted's crashing? Any chance for a line number from a core file there?

@sjeaugey
Copy link
Member

Still trying to get a core ... worst case, I'll set ulimit in my nightly script.

@sjeaugey
Copy link
Member

sjeaugey commented Sep 1, 2016

@rhc54 I got a core file. Here is the orted backtrace :

Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(cuda-gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffedb1c9bcf in mca_oob_usock_send_handler (sd=37, flags=4, cbdata=0x1ba00b0) at oob_usock_sendrecv.c:239
#2  0x00007ffedeb8a406 in event_persist_closure (ev=<optimized out>, base=0x1adc580) at event.c:1321
#3  event_process_active_single_queue (activeq=0x1adcaf0, base=0x1adc580) at event.c:1365
#4  event_process_active (base=<optimized out>) at event.c:1440
#5  opal_libevent2022_event_base_loop (base=0x1adc580, flags=1) at event.c:1644
#6  0x00007ffedee6f2b6 in orte_daemon (argc=33, argv=0x7fffb6fca138) at orted/orted_main.c:848
#7  0x000000000040093e in main (argc=33, argv=0x7fffb6fca138) at orted.c:60

@rhc54
Copy link
Contributor

rhc54 commented Sep 1, 2016

can you print the value of "msg" and, if not NULL, the contents (*msg)?

@sjeaugey
Copy link
Member

sjeaugey commented Sep 1, 2016

I followed the whole code on line 239 (big macro calling a lot of other macros) but everything seems normal.

Even looking at the assembly, the line where it crashed is :

   0x00007ffedb1c9bc7 <+3735>:  mov    0x58(%rsi),%edi
   0x00007ffedb1c9bca <+3738>:  mov    %r9,%rsi
   0x00007ffedb1c9bcd <+3741>:  callq  *%rax
=> 0x00007ffedb1c9bcf <+3743>:  mov    -0x30(%rbp),%rax
   0x00007ffedb1c9bd3 <+3747>:  mov    0x68(%rax),%rax
   0x00007ffedb1c9bd7 <+3751>:  mov    0x8(%rax),%rax
   0x00007ffedb1c9bdb <+3755>:  test   %rax,%rax

%rbp is equal to 0x7fffb6fc9cb0, and the value of -0x30(%rbp) is accessible and equal to 0x30.

I don't get it.

@sjeaugey
Copy link
Member

sjeaugey commented Sep 1, 2016

For reference, the content of msg and msg->msg :

 (cuda-gdb) p *msg
$6 = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7ffedb3cf9c0 <mca_oob_usock_send_t_class>, obj_reference_count = 1, 
      cls_init_file_name = 0x7ffedb1cc189 <__PRETTY_FUNCTION__.4085+361> "oob_usock.c", cls_init_lineno = 314}, opal_list_next = 0x0, opal_list_prev = 0x0, item_free = 1, 
    opal_list_item_refcount = 0, opal_list_item_belong_to = 0x0}, hdr = {origin = {jobid = 2495021056, vpid = 0}, dst = {jobid = 2495021058, vpid = 0}, type = MCA_OOB_USOCK_USER, 
    tag = 37, nbytes = 0}, msg = 0x1b8d660, data = 0x0, hdr_sent = true, iovnum = 1, sdptr = 0x0, sdbytes = 0}
(cuda-gdb) p *msg->msg
$7 = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7ffedf0f7e80 <orte_rml_send_t_class>, obj_reference_count = 1, 
      cls_init_file_name = 0x7ffedafbc48a <__PRETTY_FUNCTION__.12041+554> "oob_tcp_sendrecv.c", cls_init_lineno = 566}, opal_list_next = 0x0, opal_list_prev = 0x0, item_free = 1, 
    opal_list_item_refcount = 0, opal_list_item_belong_to = 0x0}, dst = {jobid = 2495021058, vpid = 0}, origin = {jobid = 2495021056, vpid = 0}, status = 0, tag = 37, cbfunc = {
    iov = 0x0, buffer = 0x0}, cbdata = 0x0, iov = 0x0, count = 0, buffer = 0x0, data = 0x0}

@rhc54
Copy link
Contributor

rhc54 commented Sep 1, 2016

I see the issue - it was fixed on master. Can you try this patch?

diff --git a/orte/mca/rml/base/base.h b/orte/mca/rml/base/base.h
index 66d1c63..6b29d07 100644
--- a/orte/mca/rml/base/base.h
+++ b/orte/mca/rml/base/base.h
@@ -229,7 +229,7 @@ OBJ_CLASS_DECLARATION(orte_rml_recv_request_t);
                                 (m)->iov, (m)->count,                   \
                                 (m)->tag, (m)->cbdata);                 \
             }                                                           \
-        } else {                                                        \
+        } else if (NULL != (m)->cbfunc.buffer) {                        \
             /* non-blocking buffer send */                              \
             (m)->cbfunc.buffer((m)->status, &((m)->origin),             \
                                (m)->buffer,                             \

@sjeaugey
Copy link
Member

sjeaugey commented Sep 1, 2016

Ahah ... I was reading the code of master instead of v2.x. Of course I didn't see the issue. Thanks !

As for the patch, well, since I still didn't figure out how to reproduce the issue ... I cannot really test it.

@rhc54
Copy link
Contributor

rhc54 commented Sep 1, 2016

Okay, I have filed a PR - if you can "review" it, then we can see if it passes MTT

@jsquyres
Copy link
Member Author

jsquyres commented Sep 1, 2016

fixed in open-mpi/ompi-release#1359

@jsquyres jsquyres closed this as completed Sep 1, 2016
@sjeaugey
Copy link
Member

sjeaugey commented Sep 1, 2016

@jsquyres please wait until tomorrow/friday to close the bug. I couldn't check it actually fixed the bug (because I cannot reproduce the issue). Hence the need to push the fix to v2.x to see if MTT results are better.

@sjeaugey sjeaugey reopened this Sep 1, 2016
@jsquyres
Copy link
Member Author

jsquyres commented Sep 1, 2016

Ok.

@sjeaugey
Copy link
Member

sjeaugey commented Sep 2, 2016

I don't have any other output. Please also not that esslingen MTT is showing the same exact list of timed out tests (c_reqops, intercomm_create, spawn, spawn_with_env_vars, spawn_multiple).

@hppritcha hppritcha modified the milestones: v2.0.2, v2.0.1 Sep 4, 2016
@hppritcha
Copy link
Member

Move to 2.0.2 since spawn appears to be still broken on 2.x

@sjeaugey
Copy link
Member

sjeaugey commented Sep 6, 2016

@jsquyres @rhc54 would you know who is running MTT at esslingen ? It would be useful to check if they can manually reproduce the issue (since it also appears on their v2.x MTT.).

@adrianreber
Copy link
Member

I am running it. I will try to reproduce it outside of MTT.

@rhc54
Copy link
Contributor

rhc54 commented Sep 6, 2016

I was able to get no-disconnect to hang when run on more than one node, so I can make that happen at-will. I still cannot get any other dynamic test in that suite to fail or hang.

no-disconnect does not hang on master, so this appears to be something specific to v2.x. However, I am getting an error message on master from the TCP btl during finalize at the very end of the test:

 Warning :: opal_list_remove_item - the item 0xd142e0 is not on the list 0x7f99e64ea8e8 

Adding an assert in that spot generates the following stacktrace:

#0  0x00007f99f11d75f7 in raise () from /usr/lib64/libc.so.6
#1  0x00007f99f11d8e28 in abort () from /usr/lib64/libc.so.6
#2  0x00007f99f11d0566 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x00007f99f11d0612 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00007f99e62da3b7 in opal_list_remove_item (list=0x7f99e64ea8e8 <mca_btl_tcp_component+552>, item=0xd142e0) at ../../../../opal/class/opal_list.h:492
#5  0x00007f99e62da892 in mca_btl_tcp_event_destruct (event=0xd142e0) at btl_tcp_component.c:199
#6  0x00007f99e62da19c in opal_obj_run_destructors (object=0xd142e0) at ../../../../opal/class/opal_object.h:462
#7  0x00007f99e62db982 in mca_btl_tcp_component_close () at btl_tcp_component.c:460
#8  0x00007f99f0bd3331 in mca_base_component_close (component=0x7f99e64ea6c0 <mca_btl_tcp_component>, output_id=-1) at mca_base_components_close.c:53
#9  0x00007f99f0bd33f1 in mca_base_components_close (output_id=-1, components=0x7f99f0e9c730 <opal_btl_base_framework+80>, skip=0x0)
    at mca_base_components_close.c:85
#10 0x00007f99f0bd3398 in mca_base_framework_components_close (framework=0x7f99f0e9c6e0 <opal_btl_base_framework>, skip=0x0)
    at mca_base_components_close.c:65
#11 0x00007f99f0bf7b39 in mca_btl_base_close () at base/btl_base_frame.c:203
#12 0x00007f99f0be1a83 in mca_base_framework_close (framework=0x7f99f0e9c6e0 <opal_btl_base_framework>) at mca_base_framework.c:214
#13 0x00007f99f1862f0e in mca_bml_base_close () at base/bml_base_frame.c:130
#14 0x00007f99f0be1a83 in mca_base_framework_close (framework=0x7f99f1b119a0 <ompi_bml_base_framework>) at mca_base_framework.c:214
#15 0x00007f99f17decc8 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:444
#16 0x00007f99f180e0ad in PMPI_Finalize () at pfinalize.c:47
#17 0x00000000004013c9 in main (argc=1, argv=0x7ffc02501568) at no-disconnect.c:130

So it could be that we have a race condition in finalize that is causing the problem.

@hppritcha
Copy link
Member

I was not able to reproduce this problem using uGNI BTL.

hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 13, 2016
This commit fixes an abort during finalize because pending events were
removed from the list twice.

References open-mpi#2030

Signed-off-by: Nathan Hjelm <[email protected]>
@jsquyres
Copy link
Member Author

Per the call today:

For the TCP segv, @hjelmn thinks that the items are being removed from the list twice (e.g., in the destructor) in debug builds. That will fix the segv. He just filed #2077 to fix this.

But there's also a hang issue that is not yet been fully diagnosed. Suggestion:

  1. Let's finish the process of:
    1. Make the 2.0.x branch
    2. Merge in a bunch of v2.1 PRs into the 2.x branch
    3. Bring the release branches back from ompi-release to ompi
  2. Then merge over PMIx 2.0 to the 2.x branch
    1. Note that PMIx 2.0 isn't hugely different that what is already there
    2. But PMIx 2.0 does have some interface changes which will cause changes in ORTE PMIx server infrastructure. That will take a little time to back port properly -- @rhc54 doesn't have the cycles in the immediate future to do this.
  3. See if PMIx 2.0 magically fixes this problem. If it doesn't, continue diagnosing from there.
  4. More specifically:
    1. We should probably let this problem alone in v2.0.x, unless the problem escalates to real users / etc.
    2. For v2.1.x., this problem probably needs to be fixed.

bosilca pushed a commit to bosilca/ompi that referenced this issue Sep 15, 2016
This commit fixes an abort during finalize because pending events were
removed from the list twice.

References open-mpi#2030

Signed-off-by: Nathan Hjelm <[email protected]>
@jsquyres jsquyres modified the milestones: v2.1.0, v2.0.2 Sep 22, 2016
@jsquyres
Copy link
Member Author

Update on where we are on this issue:

@hppritcha
Copy link
Member

This may not be fixed in 2.1.0 if we go with external pmix2 solution.

@jsquyres jsquyres changed the title COMM_SPAWN is broken in 2.0.x HEAD COMM_SPAWN is broken in 2.1.x HEAD Oct 17, 2016
@rhc54
Copy link
Contributor

rhc54 commented Oct 18, 2016

@hjelmn I just checked the v2.0.x branch again, and no-disconnect now hangs in the BTL finalize for the TCP BTL:

424       while (lock->u.lock == OPAL_ATOMIC_LOCKED) {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.163-3.el7.x86_64 elfutils-libs-0.163-3.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libibverbs-1.1.8-8.el7.x86_64 libnl3-3.2.21-10.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 systemd-libs-219-19.el7_2.13.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) where
#0  0x00007f7865817a1d in opal_atomic_lock (lock=0x7f7865a268c4 <mca_btl_tcp_component+548>) at ../../../../opal/include/opal/sys/atomic_impl.h:424
#1  0x00007f7865817c5d in opal_mutex_atomic_lock (m=0x7f7865a26860 <mca_btl_tcp_component+448>) at ../../../../opal/threads/mutex_unix.h:183
#2  0x00007f786581813d in mca_btl_tcp_event_destruct (event=0x1b73180) at btl_tcp_component.c:194
#3  0x00007f7865817b99 in opal_obj_run_destructors (object=0x1b73180) at ../../../../opal/class/opal_object.h:460
#4  0x00007f7865818ee7 in mca_btl_tcp_component_close () at btl_tcp_component.c:418
#5  0x00007f786fce4ba9 in mca_base_component_close (component=0x7f7865a266a0 <mca_btl_tcp_component>, output_id=-1) at mca_base_components_close.c:53
#6  0x00007f786fce4c69 in mca_base_components_close (output_id=-1, components=0x7f786ffa4250 <opal_btl_base_framework+80>, skip=0x0)
    at mca_base_components_close.c:85
#7  0x00007f786fce4c10 in mca_base_framework_components_close (framework=0x7f786ffa4200 <opal_btl_base_framework>, skip=0x0) at mca_base_components_close.c:65
#8  0x00007f786fd07d64 in mca_btl_base_close () at base/btl_base_frame.c:158
#9  0x00007f786fcf22c5 in mca_base_framework_close (framework=0x7f786ffa4200 <opal_btl_base_framework>) at mca_base_framework.c:214
#10 0x00007f7870920bd2 in mca_bml_base_close () at base/bml_base_frame.c:130
#11 0x00007f786fcf22c5 in mca_base_framework_close (framework=0x7f7870bbc4c0 <ompi_bml_base_framework>) at mca_base_framework.c:214
#12 0x00007f78708b6161 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:419
#13 0x00007f78708dda05 in PMPI_Finalize () at pfinalize.c:45
#14 0x00000000004013c9 in main (argc=1, argv=0x7ffffe18d1b8) at no-disconnect.c:130

I can't get any other dynamic test to fail - any ideas why the TCP BTL is locking up?

@hjelmn
Copy link
Member

hjelmn commented Oct 18, 2016

I thought I fixed this in 2.0.x. There was an easily identifiable bug in btl/tcp. Let me see if the commit came over.

@hjelmn
Copy link
Member

hjelmn commented Oct 18, 2016

Bring this over.

a681837

@rhc54
Copy link
Contributor

rhc54 commented Oct 18, 2016

will do - will report back later. Thx!

@hjelmn
Copy link
Member

hjelmn commented Oct 18, 2016

np

@rhc54
Copy link
Contributor

rhc54 commented Oct 19, 2016

Okay, we have 2.0.x fixed, but not 2.x - sorry for the confusion. On 2.x, we are getting a warning about removing an item that is no longer on a list:

 Warning :: opal_list_remove_item - the item 0xf30630 is not on the list 0x7fbb6497c8c8 

This comes at the end of no-disconnect, and is likely again from the TCP btl. I'll try to find out where.

@hjelmn
Copy link
Member

hjelmn commented Oct 19, 2016

Same bug, different symptom.

@rhc54
Copy link
Contributor

rhc54 commented Oct 19, 2016

Accept that the patch you suggested is already there

Sent from my iPhone

On Oct 18, 2016, at 8:20 PM, Nathan Hjelm [email protected] wrote:

Same bug, different symptom.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@jsquyres
Copy link
Member Author

@rhc54 Sorry; I still saw COMM_SPAWN fails over the weekend in the v2.0.x branch. ☹️

@jsquyres
Copy link
Member Author

This has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants