Cleanup grpcomm race conditions #1335

rhc54 · 2016-02-01T04:40:17Z

@jsquyres I believe this may cleanup the problem you've been seeing at scale. Please give it a try and let me know what you see.

jsquyres · 2016-02-01T21:38:54Z

@rhc54 Unfortunately, no. I'm getting a segv in the orted with a fairly consistent backtrace:

#0  0x00007f86194906d6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x00007f8615f9d6ce in send_msg (fd=-1, args=4, cbdata=0x10970e0) at rml_oob_send.c:176
#2  0x00007f861a6e81ec in event_process_active_single_queue (activeq=0xfdc9b0, base=0xfdc440)
    at event.c:1370
#3  event_process_active (base=<optimized out>) at event.c:1440
#4  opal_libevent2022_event_base_loop (base=0xfdc440, flags=1) at event.c:1644
#5  0x00007f861a9db3a6 in orte_daemon (argc=27, argv=0x7ffe52ac55c8) at orted/orted_main.c:866
#6  0x0000000000400906 in main (argc=27, argv=0x7ffe52ac55c8) at orted.c:60

It looks like all the pointers passed to memcpy are bogus:

(gdb) up
#1  0x00007f8615f9d6ce in send_msg (fd=-1, args=4, cbdata=0x10970e0) at rml_oob_send.c:176
176                 memcpy(rcv->iov.iov_base, req->post.send.buffer->base_ptr, req->post.send.buffer->bytes_used);
(gdb) p rcv->iov.iov_base 
$1 = (void *) 0x0
(gdb) p req->post.send.buffer->base_ptr
$2 = 0x0
(gdb) p req->post.send.buffer->bytes_used 
$3 = 0
(gdb)

jsquyres · 2016-02-01T21:39:44Z

Also, can you put "Fixes #1215" in the final squashed commit message? Thanks!

rhc54 · 2016-02-01T22:07:23Z

once I know it actually does... 😃

rhc54 · 2016-02-01T22:32:47Z

@jsquyres Can you try this again and send me the debug output? I'm trying to figure out how you got into that situation as it isn't obvious.

jsquyres · 2016-02-01T22:40:51Z

@rhc54 Output is here: https://gist.github.com/jsquyres/3ff0b97966144ac4ffe8

annu13 · 2016-02-01T23:32:48Z

On 2/1/16, 2:32 PM, "rhc54" [email protected] wrote:

@jsquyres https://github.com/jsquyres Can you try this again and send
me the debug output? I'm trying to figure out how you got into that
situation as it isn't obvious.
�
Reply to this email directly or
view it on GitHub
#1335 (comment).

@rhc54 - I saw that you are using the same recv buffer to send to self
when the job info is not available. Could that be causing a seg fault (bad
unpack ptr)??

rhc54 · 2016-02-02T03:22:58Z

Technically, the OBJ_RETAIN should be adequate to protect us when recirculating the buffer. I've copied the buffer now, so let's see if this works - if it does, then we know we have a problem in the OOB.

jsquyres · 2016-02-02T14:08:10Z

Now it just loops forever, endlessly spitting out those "LOOPING BRKS" debug messages.

rhc54 · 2016-02-02T22:42:13Z

@jsquyres Can you give this one a try? I cannot figure out why those daemons aren't getting the launch msg, and so this adds some delay in that loop to provide an opportunity for the event lib to progress the rest of the messages. I also added a failsafe so we only loop a few times and then abort.

jsquyres · 2016-02-02T23:30:36Z

@rhc54 I tried the latest. It worked once (i.e., launched and ran the MPI job successfully), but the next 20-30 runs all failed with lots of output like this:

[pacini038.arcetri.cisco.com:17906] [[32249,0],24] LOOPING BRKS
[pacini020.arcetri.cisco.com:23045] [[32249,0],6] LOOPING BRKS
[pacini049.arcetri.cisco.com:21309] [[32249,0],35] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini071.arcetri.cisco.com:21255] [[32249,0],56] LOOPING BRKS
[pacini022.arcetri.cisco.com:23070] [[32249,0],8] LOOPING BRKS
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini046.arcetri.cisco.com:21924] [[32249,0],32] LOOPING BRKS
[pacini016.arcetri.cisco.com:24829] [[32249,0],3] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini023.arcetri.cisco.com:22466] [[32249,0],9] LOOPING BRKS
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] LOOPING BRKS
[pacini032.arcetri.cisco.com:18452] [[32249,0],18] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini042.arcetri.cisco.com:18241] [[32249,0],28] LOOPING BRKS
[pacini038.arcetri.cisco.com:17906] [[32249,0],24] LOOPING BRKS
[pacini038.arcetri.cisco.com:17906] [[32249,0],24] BRKS GIVING UP
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini049.arcetri.cisco.com:21309] [[32249,0],35] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini027.arcetri.cisco.com:22552] [[32249,0],13] LOOPING BRKS
[pacini027.arcetri.cisco.com:22552] [[32249,0],13] BRKS GIVING UP
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

  hostname:  pacini038

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pacini071.arcetri.cisco.com:21255] [[32249,0],56] LOOPING BRKS
[pacini022.arcetri.cisco.com:23070] [[32249,0],8] LOOPING BRKS
[pacini061.arcetri.cisco.com:21470] [[32249,0],47] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini046.arcetri.cisco.com:21924] [[32249,0],32] LOOPING BRKS
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini023.arcetri.cisco.com:22466] [[32249,0],9] LOOPING BRKS
[pacini023.arcetri.cisco.com:22466] [[32249,0],9] BRKS GIVING UP
[pacini074.arcetri.cisco.com:20928] [[32249,0],59] LOOPING BRKS
[pacini020.arcetri.cisco.com:23045] [[32249,0],6] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini016.arcetri.cisco.com:24829] [[32249,0],3] LOOPING BRKS
[pacini032.arcetri.cisco.com:18452] [[32249,0],18] LOOPING BRKS
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini049.arcetri.cisco.com:21309] [[32249,0],35] LOOPING BRKS
[pacini071.arcetri.cisco.com:21255] [[32249,0],56] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini074.arcetri.cisco.com:20928] [[32249,0],59] LOOPING BRKS
[pacini074.arcetri.cisco.com:20928] [[32249,0],59] BRKS GIVING UP
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] LOOPING BRKS
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] BRKS GIVING UP
[pacini032.arcetri.cisco.com:18452] [[32249,0],18] LOOPING BRKS

(the node displayed in the show-help message varied with each run)

hjelmn · 2016-02-03T16:06:09Z

@rhc54 I will try to find some time to debug brks at scale here. Probably will be sometime next week.

rhc54 · 2016-02-03T16:20:26Z

@hjelmn From what we have seen, it looks to me like Jeff is simply losing messages somewhere. The root cause of the problem is that downstream daemons never receive the launch message, and thus are never able to locate the job object since it doesn't get created.

So I'm not sure if there is a bug in brks or not. I'm unable to replicate even on a slow ssh machine, so this may be something specific to Jeff's machine.

I'm going to restore the direct component's rollup method (currently what we use in 1.10 and prior series) anyway. It may actually be faster than the other methods, but at least we know it worked.

jsquyres · 2016-02-03T22:04:32Z

I confirmed to @rhc54 on the phone: this fixed my problem. I am now able to launch successfully (and consistently) across 64 servers with ssh.

@rhc54 said he would clean up this PR and merge to master.

rhc54 · 2016-02-04T03:46:54Z

@jsquyres Okay, I've cleaned this up - please give it a smoke test and commit if all is well.

jsquyres · 2016-02-04T11:40:56Z

orte/mca/grpcomm/direct/grpcomm_direct_component.c

@@ -55,7 +55,7 @@ static int direct_register(void)
    /* make the priority adjustable so users can select
     * direct for use by apps without affecting daemons
     */
-    my_priority = 1;
+    my_priority = 100;


Do you want to leave some wiggle room here? I.e., if you set this to 100, then it's not possible to override it -- even the user on the command line can't change priorities to something that would guarantee that direct wouldn't be chosen.

wellll...as we both know, that isn't technically true. we set the priority to much higher values in a number of places. i picked this one because rcd is at 80, but now that we have opal_ignore'd it, i can turn down the heat

rhc54 · 2016-02-04T11:51:34Z

@jsquyres okay, quibbles addressed 😄

jsquyres · 2016-02-04T12:56:08Z

Tested and worked flawlessly in 20 out of 20 runs of 1000+ processes across 64 servers.

One last quibble (which you can choose to do or not 😄 ), add "Fixes #1215" to the commit message. Otherwise, 👍

…be the default, and to execute a rollup collective. This may in fact be faster than the alternatives, and something appears broken at scale when using brks in particular. Turn off the rcd and brks components as they don't work at scale right now - they can be restored at some future point when someone can debug them. Adjust to Jeff's quibbles Fixes open-mpi/mpi#1215

rhc54 · 2016-02-04T13:47:40Z

grumble...grumble.... 😩

Cleanup grpcomm race conditions

Revive the coll/sync component, adding a test to show it. Clean some permissions

rhc54 assigned jsquyres Feb 1, 2016

jsquyres reviewed Feb 4, 2016
View reviewed changes

rhc54 pushed a commit that referenced this pull request Feb 4, 2016

Merge pull request #1335 from rhc54/topic/gcom

f38ad4a

Cleanup grpcomm race conditions

rhc54 merged commit f38ad4a into open-mpi:master Feb 4, 2016

rhc54 deleted the topic/gcom branch February 14, 2016 17:44

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016

Merge pull request open-mpi#1335 from rhc54/cmr110/sync

8a7d6b4

Revive the coll/sync component, adding a test to show it. Clean some permissions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup grpcomm race conditions #1335

Cleanup grpcomm race conditions #1335

rhc54 commented Feb 1, 2016

jsquyres commented Feb 1, 2016

jsquyres commented Feb 1, 2016

rhc54 commented Feb 1, 2016

rhc54 commented Feb 1, 2016

jsquyres commented Feb 1, 2016

annu13 commented Feb 1, 2016

rhc54 commented Feb 2, 2016

jsquyres commented Feb 2, 2016

rhc54 commented Feb 2, 2016

jsquyres commented Feb 2, 2016

hjelmn commented Feb 3, 2016

rhc54 commented Feb 3, 2016

jsquyres commented Feb 3, 2016

rhc54 commented Feb 4, 2016

jsquyres Feb 4, 2016

rhc54 Feb 4, 2016

rhc54 commented Feb 4, 2016

jsquyres commented Feb 4, 2016

rhc54 commented Feb 4, 2016

Cleanup grpcomm race conditions #1335

Cleanup grpcomm race conditions #1335

Conversation

rhc54 commented Feb 1, 2016

jsquyres commented Feb 1, 2016

jsquyres commented Feb 1, 2016

rhc54 commented Feb 1, 2016

rhc54 commented Feb 1, 2016

jsquyres commented Feb 1, 2016

annu13 commented Feb 1, 2016

rhc54 commented Feb 2, 2016

jsquyres commented Feb 2, 2016

rhc54 commented Feb 2, 2016

jsquyres commented Feb 2, 2016

hjelmn commented Feb 3, 2016

rhc54 commented Feb 3, 2016

jsquyres commented Feb 3, 2016

rhc54 commented Feb 4, 2016

jsquyres Feb 4, 2016

Choose a reason for hiding this comment

rhc54 Feb 4, 2016

Choose a reason for hiding this comment

rhc54 commented Feb 4, 2016

jsquyres commented Feb 4, 2016

rhc54 commented Feb 4, 2016