COMM_SPAWN is broken in 2.1.x HEAD #2030

jsquyres · 2016-08-30T15:36:17Z

Recent Cisco and NVIDIA MTT results are showing COMM_SPAWN and related dynamic operations are failing.

We need to find out if it was broken in v2.0.0 and see if this is a regression. This will determine whether it's a v2.0.1 blocker or not (because we'd really like to release the other important fixes in v2.0.1 ASAP, and start working on the short timeline for v2.1.0 and v2.0.2, etc.).

@sjeaugey is going to test on his tree and see if the recent v2.x commit(s) about bringing over PMIx 1.1.5 are the root of this particular problem.

sjeaugey · 2016-08-30T15:51:27Z

Strangely enough, IBM MTT shows the issue on master but not on v2.x.

Also, I looked at the 1.1.5rc commit and it is far from being a significant change. If this commit is the issue, it should be easy to fix.

sjeaugey · 2016-08-30T19:04:40Z

I'm currently unable to reproduce the problem on v2.x outside of MTT. On MTT though, it failed both with tcp and smcuda, and the error is a clear segmentation fault in orted (not the random pmix failure when there is a remaining socket in /tmp).

jsquyres · 2016-08-30T19:20:52Z

@sjeaugey Can you provide a backtrace?

jsquyres · 2016-08-30T19:47:35Z

@sjeaugey Also, can you force your MTT to use a tarball or git clone with that PMIx commit reverted?

sjeaugey · 2016-08-30T20:09:10Z

@jsquyres I still need to be able to reproduce the problem. I'm trying to launch MTT building only v2.x and only IBM dynamic/spawn, but maybe it only fails when spawn is launched after in the middle of all other tests.

jsquyres · 2016-08-30T20:37:15Z

@sjeaugey You might be able to run this indirectly via MTT. E.g. (I didn't look at your MTT results; I'm guessing/assuming it's the ibm section where you're seeing failures, and fill in ... below for your relevant .ini file and scratch directory):

$ client/mtt --file ... --scratch ... --verbose --mpi-get
$ client/mtt --file ... --scratch ... --verbose --mpi-install
$ client/mtt --file ... --scratch ... --verbose --test-get --section ibm
$ client/mtt --file ... --scratch ... --verbose --test-build --section ibm
$ client/mtt --file ... --scratch ... --verbose --test-run --section ibm-run-the-bad-test

I.e., make a test run section named ibm-run-the-bad-test that just runs the one test that is failing. That way you can run the test via MTT and hopefully be able to reproduce the error.

Make sense?

sjeaugey · 2016-08-30T20:49:01Z

@jsquyres Yes it makes sense. Still, no luck. When I only run dynamic/spawn, the tests pass.

@rhc54 I'm starting to wonder if I were just out of luck and hit the pmix problem (when the socket is already in /tmp), only that in the case of spawns, instead of the usual "Error in file orted/pmix/pmix_server.c at line 254" message, I get a segmentation fault.

sjeaugey · 2016-08-30T22:25:57Z

I added tmp cleanup before my nightly MTT run. We'll see tomorrow if it solved the issue.

rhc54 · 2016-08-31T14:55:20Z

@sjeaugey Looks like you ran clean last night - can you clarify exactly what you did? Did you just clean the tmp? Or did you also roll back the PMIx 1.1.5 commit?

sjeaugey · 2016-08-31T16:04:54Z

@rhc54 I wouldn't say that. I got rid of some failures (caused by remaining sockets in tmp) but only 2 failures remain : spawn and spawn_with_env_vars. And spawn_multiple timed out too.

I'm still unable to reproduce it outside of the nightly runs however. Working on it.

rhc54 · 2016-08-31T16:18:05Z

how strange...the nightly summary didn't show those failures. Is there a core file you can look at? A line number where those daemons are crashing would really help.

sjeaugey · 2016-08-31T16:40:51Z

Some new interesting fact : on my second node, the spawn process are still there, stuck :

[sjeaugey@drossetti-ivy5 ~]$ ps -ef | grep spawn
sjeaugey 18626     1 99 Aug29 ?        2-04:35:50 dynamic/spawn this is argv 1 this is argv 2
sjeaugey 22338     1  3 Aug30 ?        00:58:16 dynamic/spawn this is argv 1 this is argv 2
sjeaugey 22447     1 99 Aug30 ?        1-04:08:22 dynamic/spawn_with_env_vars
sjeaugey 23013     1  3 06:02 ?        00:06:57 dynamic/spawn this is argv 1 this is argv 2
sjeaugey 23064     1 99 06:02 ?        03:35:28 dynamic/spawn_with_env_vars
sjeaugey 23122     1 99 06:02 ?        03:35:31 dynamic/spawn_with_env_vars

I tried to gstack them. That killed some of the processes, but some others survived and gave me a stack trace :

[sjeaugey@drossetti-ivy5 ~]$ gstack 18626
Thread 3 (Thread 0x7fbb3b437700 (LWP 18628)):
#0  0x00007fbb3d048173 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fbb3c9ea998 in epoll_dispatch () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#2  0x00007fbb3c9eddfe in opal_libevent2022_event_base_loop () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#3  0x00007fbb3b4650bd in progress_engine () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_pmix_pmix112.so
#4  0x00007fbb3d2fa9d1 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fbb3d047b7d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fbb3a436700 (LWP 18631)):
#0  0x00007fbb3d00bced in nanosleep () from /lib64/libc.so.6
#1  0x00007fbb3d040e64 in usleep () from /lib64/libc.so.6
#2  0x00007fbb3b46d715 in OPAL_PMIX_PMIX112_PMIx_Abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_pmix_pmix112.so
#3  0x00007fbb3b44202a in pmix1_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_pmix_pmix112.so
#4  0x00007fbb3b6b7bf0 in rte_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_ess_pmi.so
#5  0x00007fbb3d608157 in ompi_rte_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#6  0x00007fbb3d566210 in ompi_mpi_abort () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#7  0x00007fbb3d54ce35 in ompi_errhandler_runtime_callback () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#8  0x00007fbb3cce7bdd in orte_errmgr_base_execute_error_callbacks () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-rte.so.20
#9  0x00007fbb38a23972 in proc_errors () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_errmgr_default_app.so
#10 0x00007fbb3c9ee851 in opal_libevent2022_event_base_loop () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#11 0x00007fbb3c995b22 in progress_engine () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#12 0x00007fbb3d2fa9d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fbb3d047b7d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fbb3da91700 (LWP 18626)):
#0  0x00007fbb30c5cf61 in sm_fifo_read () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_btl_smcuda.so
#1  0x00007fbb30c5f111 in mca_btl_smcuda_component_progress () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_btl_smcuda.so
#2  0x00007fbb3c9903ae in opal_progress () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libopen-pal.so.20
#3  0x00007fbb3d564c46 in ompi_request_wait_completion () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#4  0x00007fbb3d564c80 in ompi_request_default_wait () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#5  0x00007fbb3d5e059b in ompi_coll_base_bcast_intra_generic () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#6  0x00007fbb3d5e0c7f in ompi_coll_base_bcast_intra_binomial () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#7  0x00007fbb2aa05238 in ompi_coll_tuned_bcast_intra_dec_fixed () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/openmpi/mca_coll_tuned.so
#8  0x00007fbb3d543706 in ompi_comm_allreduce_intra_pmix () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#9  0x00007fbb3d540e13 in ompi_comm_nextcid () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#10 0x00007fbb3d5492bd in ompi_dpm_connect_accept () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#11 0x00007fbb3d54b793 in ompi_dpm_dyn_init () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#12 0x00007fbb3d567174 in ompi_mpi_init () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#13 0x00007fbb3d59a00c in PMPI_Init () from /ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-1/installs/EDVY/install/lib/libmpi.so.20
#14 0x0000000000401652 in main ()

I'll kill those remaining processes eventually (they could be the cause of the failures), but if you want me to look at something in particular, please tell me.

rhc54 · 2016-08-31T16:49:23Z

Yeah, I can believe they would be stuck - they are trying to Abort and the daemon is gone, so nobody they can tell. I believe we have solved that since the 1.1 series, but I'll check to ensure we do.

Real question is: why are the orted's crashing? Any chance for a line number from a core file there?

sjeaugey · 2016-08-31T16:51:14Z

Still trying to get a core ... worst case, I'll set ulimit in my nightly script.

sjeaugey · 2016-09-01T17:37:56Z

@rhc54 I got a core file. Here is the orted backtrace :

Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(cuda-gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffedb1c9bcf in mca_oob_usock_send_handler (sd=37, flags=4, cbdata=0x1ba00b0) at oob_usock_sendrecv.c:239
#2  0x00007ffedeb8a406 in event_persist_closure (ev=<optimized out>, base=0x1adc580) at event.c:1321
#3  event_process_active_single_queue (activeq=0x1adcaf0, base=0x1adc580) at event.c:1365
#4  event_process_active (base=<optimized out>) at event.c:1440
#5  opal_libevent2022_event_base_loop (base=0x1adc580, flags=1) at event.c:1644
#6  0x00007ffedee6f2b6 in orte_daemon (argc=33, argv=0x7fffb6fca138) at orted/orted_main.c:848
#7  0x000000000040093e in main (argc=33, argv=0x7fffb6fca138) at orted.c:60

rhc54 · 2016-09-01T18:00:38Z

can you print the value of "msg" and, if not NULL, the contents (*msg)?

sjeaugey · 2016-09-01T18:04:08Z

I followed the whole code on line 239 (big macro calling a lot of other macros) but everything seems normal.

Even looking at the assembly, the line where it crashed is :

   0x00007ffedb1c9bc7 <+3735>:  mov    0x58(%rsi),%edi
   0x00007ffedb1c9bca <+3738>:  mov    %r9,%rsi
   0x00007ffedb1c9bcd <+3741>:  callq  *%rax
=> 0x00007ffedb1c9bcf <+3743>:  mov    -0x30(%rbp),%rax
   0x00007ffedb1c9bd3 <+3747>:  mov    0x68(%rax),%rax
   0x00007ffedb1c9bd7 <+3751>:  mov    0x8(%rax),%rax
   0x00007ffedb1c9bdb <+3755>:  test   %rax,%rax

%rbp is equal to 0x7fffb6fc9cb0, and the value of -0x30(%rbp) is accessible and equal to 0x30.

I don't get it.

sjeaugey · 2016-09-01T18:09:29Z

For reference, the content of msg and msg->msg :

 (cuda-gdb) p *msg
$6 = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7ffedb3cf9c0 <mca_oob_usock_send_t_class>, obj_reference_count = 1, 
      cls_init_file_name = 0x7ffedb1cc189 <__PRETTY_FUNCTION__.4085+361> "oob_usock.c", cls_init_lineno = 314}, opal_list_next = 0x0, opal_list_prev = 0x0, item_free = 1, 
    opal_list_item_refcount = 0, opal_list_item_belong_to = 0x0}, hdr = {origin = {jobid = 2495021056, vpid = 0}, dst = {jobid = 2495021058, vpid = 0}, type = MCA_OOB_USOCK_USER, 
    tag = 37, nbytes = 0}, msg = 0x1b8d660, data = 0x0, hdr_sent = true, iovnum = 1, sdptr = 0x0, sdbytes = 0}
(cuda-gdb) p *msg->msg
$7 = {super = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7ffedf0f7e80 <orte_rml_send_t_class>, obj_reference_count = 1, 
      cls_init_file_name = 0x7ffedafbc48a <__PRETTY_FUNCTION__.12041+554> "oob_tcp_sendrecv.c", cls_init_lineno = 566}, opal_list_next = 0x0, opal_list_prev = 0x0, item_free = 1, 
    opal_list_item_refcount = 0, opal_list_item_belong_to = 0x0}, dst = {jobid = 2495021058, vpid = 0}, origin = {jobid = 2495021056, vpid = 0}, status = 0, tag = 37, cbfunc = {
    iov = 0x0, buffer = 0x0}, cbdata = 0x0, iov = 0x0, count = 0, buffer = 0x0, data = 0x0}

rhc54 · 2016-09-01T18:19:59Z

I see the issue - it was fixed on master. Can you try this patch?

diff --git a/orte/mca/rml/base/base.h b/orte/mca/rml/base/base.h
index 66d1c63..6b29d07 100644
--- a/orte/mca/rml/base/base.h
+++ b/orte/mca/rml/base/base.h
@@ -229,7 +229,7 @@ OBJ_CLASS_DECLARATION(orte_rml_recv_request_t);
                                 (m)->iov, (m)->count,                   \
                                 (m)->tag, (m)->cbdata);                 \
             }                                                           \
-        } else {                                                        \
+        } else if (NULL != (m)->cbfunc.buffer) {                        \
             /* non-blocking buffer send */                              \
             (m)->cbfunc.buffer((m)->status, &((m)->origin),             \
                                (m)->buffer,                             \

sjeaugey · 2016-09-01T18:22:38Z

Ahah ... I was reading the code of master instead of v2.x. Of course I didn't see the issue. Thanks !

As for the patch, well, since I still didn't figure out how to reproduce the issue ... I cannot really test it.

rhc54 · 2016-09-01T18:24:18Z

Okay, I have filed a PR - if you can "review" it, then we can see if it passes MTT

jsquyres · 2016-09-01T19:21:37Z

fixed in open-mpi/ompi-release#1359

sjeaugey · 2016-09-01T19:35:20Z

@jsquyres please wait until tomorrow/friday to close the bug. I couldn't check it actually fixed the bug (because I cannot reproduce the issue). Hence the need to push the fix to v2.x to see if MTT results are better.

jsquyres · 2016-09-01T19:40:18Z

Ok.

sjeaugey · 2016-09-02T23:49:30Z

I don't have any other output. Please also not that esslingen MTT is showing the same exact list of timed out tests (c_reqops, intercomm_create, spawn, spawn_with_env_vars, spawn_multiple).

hppritcha · 2016-09-04T02:47:51Z

Move to 2.0.2 since spawn appears to be still broken on 2.x

sjeaugey · 2016-09-06T16:42:22Z

@jsquyres @rhc54 would you know who is running MTT at esslingen ? It would be useful to check if they can manually reproduce the issue (since it also appears on their v2.x MTT.).

adrianreber · 2016-09-06T16:44:40Z

I am running it. I will try to reproduce it outside of MTT.

rhc54 · 2016-09-06T17:11:29Z

I was able to get no-disconnect to hang when run on more than one node, so I can make that happen at-will. I still cannot get any other dynamic test in that suite to fail or hang.

no-disconnect does not hang on master, so this appears to be something specific to v2.x. However, I am getting an error message on master from the TCP btl during finalize at the very end of the test:

 Warning :: opal_list_remove_item - the item 0xd142e0 is not on the list 0x7f99e64ea8e8

Adding an assert in that spot generates the following stacktrace:

#0  0x00007f99f11d75f7 in raise () from /usr/lib64/libc.so.6
#1  0x00007f99f11d8e28 in abort () from /usr/lib64/libc.so.6
#2  0x00007f99f11d0566 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x00007f99f11d0612 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x00007f99e62da3b7 in opal_list_remove_item (list=0x7f99e64ea8e8 <mca_btl_tcp_component+552>, item=0xd142e0) at ../../../../opal/class/opal_list.h:492
#5  0x00007f99e62da892 in mca_btl_tcp_event_destruct (event=0xd142e0) at btl_tcp_component.c:199
#6  0x00007f99e62da19c in opal_obj_run_destructors (object=0xd142e0) at ../../../../opal/class/opal_object.h:462
#7  0x00007f99e62db982 in mca_btl_tcp_component_close () at btl_tcp_component.c:460
#8  0x00007f99f0bd3331 in mca_base_component_close (component=0x7f99e64ea6c0 <mca_btl_tcp_component>, output_id=-1) at mca_base_components_close.c:53
#9  0x00007f99f0bd33f1 in mca_base_components_close (output_id=-1, components=0x7f99f0e9c730 <opal_btl_base_framework+80>, skip=0x0)
    at mca_base_components_close.c:85
#10 0x00007f99f0bd3398 in mca_base_framework_components_close (framework=0x7f99f0e9c6e0 <opal_btl_base_framework>, skip=0x0)
    at mca_base_components_close.c:65
#11 0x00007f99f0bf7b39 in mca_btl_base_close () at base/btl_base_frame.c:203
#12 0x00007f99f0be1a83 in mca_base_framework_close (framework=0x7f99f0e9c6e0 <opal_btl_base_framework>) at mca_base_framework.c:214
#13 0x00007f99f1862f0e in mca_bml_base_close () at base/bml_base_frame.c:130
#14 0x00007f99f0be1a83 in mca_base_framework_close (framework=0x7f99f1b119a0 <ompi_bml_base_framework>) at mca_base_framework.c:214
#15 0x00007f99f17decc8 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:444
#16 0x00007f99f180e0ad in PMPI_Finalize () at pfinalize.c:47
#17 0x00000000004013c9 in main (argc=1, argv=0x7ffc02501568) at no-disconnect.c:130

So it could be that we have a race condition in finalize that is causing the problem.

hppritcha · 2016-09-13T15:19:53Z

I was not able to reproduce this problem using uGNI BTL.

This commit fixes an abort during finalize because pending events were removed from the list twice. References open-mpi#2030 Signed-off-by: Nathan Hjelm <[email protected]>

jsquyres · 2016-09-13T17:31:29Z

Per the call today:

For the TCP segv, @hjelmn thinks that the items are being removed from the list twice (e.g., in the destructor) in debug builds. That will fix the segv. He just filed #2077 to fix this.

But there's also a hang issue that is not yet been fully diagnosed. Suggestion:

Let's finish the process of:
1. Make the 2.0.x branch
2. Merge in a bunch of v2.1 PRs into the 2.x branch
3. Bring the release branches back from ompi-release to ompi
Then merge over PMIx 2.0 to the 2.x branch
1. Note that PMIx 2.0 isn't hugely different that what is already there
2. But PMIx 2.0 does have some interface changes which will cause changes in ORTE PMIx server infrastructure. That will take a little time to back port properly -- @rhc54 doesn't have the cycles in the immediate future to do this.
See if PMIx 2.0 magically fixes this problem. If it doesn't, continue diagnosing from there.
More specifically:
1. We should probably let this problem alone in v2.0.x, unless the problem escalates to real users / etc.
2. For v2.1.x., this problem probably needs to be fixed.

This commit fixes an abort during finalize because pending events were removed from the list twice. References open-mpi#2030 Signed-off-by: Nathan Hjelm <[email protected]>

jsquyres · 2016-09-22T10:35:20Z

Update on where we are on this issue:

btl/tcp: fix double list remove #2077 has been merged (the TCP BTL fix)
Item 1 has been finished (make v2.0.x branch, merge a bunch of v2.1 PRs, ...etc.)
Item 2 has not been finished: see Need PMIX1.2.1 for v2.1 release #2072

hppritcha · 2016-10-03T19:54:47Z

This may not be fixed in 2.1.0 if we go with external pmix2 solution.

rhc54 · 2016-10-18T23:50:28Z

@hjelmn I just checked the v2.0.x branch again, and no-disconnect now hangs in the BTL finalize for the TCP BTL:

424       while (lock->u.lock == OPAL_ATOMIC_LOCKED) {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.163-3.el7.x86_64 elfutils-libs-0.163-3.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libibverbs-1.1.8-8.el7.x86_64 libnl3-3.2.21-10.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 systemd-libs-219-19.el7_2.13.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) where
#0  0x00007f7865817a1d in opal_atomic_lock (lock=0x7f7865a268c4 <mca_btl_tcp_component+548>) at ../../../../opal/include/opal/sys/atomic_impl.h:424
#1  0x00007f7865817c5d in opal_mutex_atomic_lock (m=0x7f7865a26860 <mca_btl_tcp_component+448>) at ../../../../opal/threads/mutex_unix.h:183
#2  0x00007f786581813d in mca_btl_tcp_event_destruct (event=0x1b73180) at btl_tcp_component.c:194
#3  0x00007f7865817b99 in opal_obj_run_destructors (object=0x1b73180) at ../../../../opal/class/opal_object.h:460
#4  0x00007f7865818ee7 in mca_btl_tcp_component_close () at btl_tcp_component.c:418
#5  0x00007f786fce4ba9 in mca_base_component_close (component=0x7f7865a266a0 <mca_btl_tcp_component>, output_id=-1) at mca_base_components_close.c:53
#6  0x00007f786fce4c69 in mca_base_components_close (output_id=-1, components=0x7f786ffa4250 <opal_btl_base_framework+80>, skip=0x0)
    at mca_base_components_close.c:85
#7  0x00007f786fce4c10 in mca_base_framework_components_close (framework=0x7f786ffa4200 <opal_btl_base_framework>, skip=0x0) at mca_base_components_close.c:65
#8  0x00007f786fd07d64 in mca_btl_base_close () at base/btl_base_frame.c:158
#9  0x00007f786fcf22c5 in mca_base_framework_close (framework=0x7f786ffa4200 <opal_btl_base_framework>) at mca_base_framework.c:214
#10 0x00007f7870920bd2 in mca_bml_base_close () at base/bml_base_frame.c:130
#11 0x00007f786fcf22c5 in mca_base_framework_close (framework=0x7f7870bbc4c0 <ompi_bml_base_framework>) at mca_base_framework.c:214
#12 0x00007f78708b6161 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:419
#13 0x00007f78708dda05 in PMPI_Finalize () at pfinalize.c:45
#14 0x00000000004013c9 in main (argc=1, argv=0x7ffffe18d1b8) at no-disconnect.c:130

I can't get any other dynamic test to fail - any ideas why the TCP BTL is locking up?

hjelmn · 2016-10-18T23:51:53Z

I thought I fixed this in 2.0.x. There was an easily identifiable bug in btl/tcp. Let me see if the commit came over.

hjelmn · 2016-10-18T23:52:54Z

Bring this over.

a681837

rhc54 · 2016-10-18T23:53:36Z

will do - will report back later. Thx!

hjelmn · 2016-10-18T23:55:09Z

np

rhc54 · 2016-10-19T01:18:46Z

Okay, we have 2.0.x fixed, but not 2.x - sorry for the confusion. On 2.x, we are getting a warning about removing an item that is no longer on a list:

 Warning :: opal_list_remove_item - the item 0xf30630 is not on the list 0x7fbb6497c8c8

This comes at the end of no-disconnect, and is likely again from the TCP btl. I'll try to find out where.

hjelmn · 2016-10-19T03:20:25Z

Same bug, different symptom.

rhc54 · 2016-10-19T04:07:09Z

Accept that the patch you suggested is already there

Sent from my iPhone

On Oct 18, 2016, at 8:20 PM, Nathan Hjelm [email protected] wrote:

Same bug, different symptom.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

jsquyres · 2016-10-24T19:38:26Z

@rhc54 Sorry; I still saw COMM_SPAWN fails over the weekend in the v2.0.x branch. ☹️

jsquyres · 2016-11-14T19:09:16Z

This has been fixed.

jsquyres added bug Severity: blocker labels Aug 30, 2016

jsquyres added this to the v2.0.1 milestone Aug 30, 2016

rhc54 mentioned this issue Sep 1, 2016

Fix a segfault during comm_spawn when we possibly send a message with a NULL callback function open-mpi/ompi-release#1359

Merged

jsquyres closed this as completed Sep 1, 2016

sjeaugey reopened this Sep 1, 2016

hppritcha modified the milestones: v2.0.2, v2.0.1 Sep 4, 2016

hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 13, 2016

btl/tcp: fix double list remove

a681837

This commit fixes an abort during finalize because pending events were removed from the list twice. References open-mpi#2030 Signed-off-by: Nathan Hjelm <[email protected]>

hjelmn mentioned this issue Sep 13, 2016

btl/tcp: fix double list remove #2077

Merged

bosilca pushed a commit to bosilca/ompi that referenced this issue Sep 15, 2016

btl/tcp: fix double list remove

8374991

This commit fixes an abort during finalize because pending events were removed from the list twice. References open-mpi#2030 Signed-off-by: Nathan Hjelm <[email protected]>

jsquyres modified the milestones: v2.1.0, v2.0.2 Sep 22, 2016

jsquyres changed the title ~~COMM_SPAWN is broken in 2.0.x HEAD~~ COMM_SPAWN is broken in 2.1.x HEAD Oct 17, 2016

jsquyres mentioned this issue Oct 17, 2016

COMM_SPAWN broken in v2.0.x #2234

Closed

rhc54 mentioned this issue Oct 19, 2016

Fix no-disconnect dynamics test #2246

Merged

jsquyres closed this as completed Nov 14, 2016

COMM_SPAWN is broken in 2.1.x HEAD #2030

COMM_SPAWN is broken in 2.1.x HEAD #2030

Comments

jsquyres commented Aug 30, 2016

sjeaugey commented Aug 30, 2016

sjeaugey commented Aug 30, 2016

jsquyres commented Aug 30, 2016

jsquyres commented Aug 30, 2016

sjeaugey commented Aug 30, 2016

jsquyres commented Aug 30, 2016

sjeaugey commented Aug 30, 2016

sjeaugey commented Aug 30, 2016

rhc54 commented Aug 31, 2016

sjeaugey commented Aug 31, 2016

rhc54 commented Aug 31, 2016

sjeaugey commented Aug 31, 2016

rhc54 commented Aug 31, 2016

sjeaugey commented Aug 31, 2016

sjeaugey commented Sep 1, 2016

rhc54 commented Sep 1, 2016

sjeaugey commented Sep 1, 2016

sjeaugey commented Sep 1, 2016

rhc54 commented Sep 1, 2016

sjeaugey commented Sep 1, 2016

rhc54 commented Sep 1, 2016

jsquyres commented Sep 1, 2016

sjeaugey commented Sep 1, 2016

jsquyres commented Sep 1, 2016

sjeaugey commented Sep 2, 2016

hppritcha commented Sep 4, 2016

sjeaugey commented Sep 6, 2016

adrianreber commented Sep 6, 2016

rhc54 commented Sep 6, 2016

hppritcha commented Sep 13, 2016

jsquyres commented Sep 13, 2016

jsquyres commented Sep 22, 2016

hppritcha commented Oct 3, 2016

rhc54 commented Oct 18, 2016

hjelmn commented Oct 18, 2016

hjelmn commented Oct 18, 2016

rhc54 commented Oct 18, 2016

hjelmn commented Oct 18, 2016

rhc54 commented Oct 19, 2016

hjelmn commented Oct 19, 2016

rhc54 commented Oct 19, 2016

jsquyres commented Oct 24, 2016

jsquyres commented Nov 14, 2016