Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup grpcomm race conditions #1335

Merged
merged 1 commit into from
Feb 4, 2016
Merged

Cleanup grpcomm race conditions #1335

merged 1 commit into from
Feb 4, 2016

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Feb 1, 2016

@jsquyres I believe this may cleanup the problem you've been seeing at scale. Please give it a try and let me know what you see.

@jsquyres
Copy link
Member

jsquyres commented Feb 1, 2016

@rhc54 Unfortunately, no. I'm getting a segv in the orted with a fairly consistent backtrace:

#0  0x00007f86194906d6 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x00007f8615f9d6ce in send_msg (fd=-1, args=4, cbdata=0x10970e0) at rml_oob_send.c:176
#2  0x00007f861a6e81ec in event_process_active_single_queue (activeq=0xfdc9b0, base=0xfdc440)
    at event.c:1370
#3  event_process_active (base=<optimized out>) at event.c:1440
#4  opal_libevent2022_event_base_loop (base=0xfdc440, flags=1) at event.c:1644
#5  0x00007f861a9db3a6 in orte_daemon (argc=27, argv=0x7ffe52ac55c8) at orted/orted_main.c:866
#6  0x0000000000400906 in main (argc=27, argv=0x7ffe52ac55c8) at orted.c:60

It looks like all the pointers passed to memcpy are bogus:

(gdb) up
#1  0x00007f8615f9d6ce in send_msg (fd=-1, args=4, cbdata=0x10970e0) at rml_oob_send.c:176
176                 memcpy(rcv->iov.iov_base, req->post.send.buffer->base_ptr, req->post.send.buffer->bytes_used);
(gdb) p rcv->iov.iov_base 
$1 = (void *) 0x0
(gdb) p req->post.send.buffer->base_ptr
$2 = 0x0
(gdb) p req->post.send.buffer->bytes_used 
$3 = 0
(gdb) 

@jsquyres
Copy link
Member

jsquyres commented Feb 1, 2016

Also, can you put "Fixes #1215" in the final squashed commit message? Thanks!

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 1, 2016

once I know it actually does... 😃

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 1, 2016

@jsquyres Can you try this again and send me the debug output? I'm trying to figure out how you got into that situation as it isn't obvious.

@jsquyres
Copy link
Member

jsquyres commented Feb 1, 2016

@annu13
Copy link
Contributor

annu13 commented Feb 1, 2016

On 2/1/16, 2:32 PM, "rhc54" [email protected] wrote:

@jsquyres https://github.com/jsquyres Can you try this again and send
me the debug output? I'm trying to figure out how you got into that
situation as it isn't obvious.

Reply to this email directly or
view it on GitHub
#1335 (comment).

@rhc54 - I saw that you are using the same recv buffer to send to self
when the job info is not available. Could that be causing a seg fault (bad
unpack ptr)??

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 2, 2016

Technically, the OBJ_RETAIN should be adequate to protect us when recirculating the buffer. I've copied the buffer now, so let's see if this works - if it does, then we know we have a problem in the OOB.

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2016

Now it just loops forever, endlessly spitting out those "LOOPING BRKS" debug messages.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 2, 2016

@jsquyres Can you give this one a try? I cannot figure out why those daemons aren't getting the launch msg, and so this adds some delay in that loop to provide an opportunity for the event lib to progress the rest of the messages. I also added a failsafe so we only loop a few times and then abort.

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2016

@rhc54 I tried the latest. It worked once (i.e., launched and ran the MPI job successfully), but the next 20-30 runs all failed with lots of output like this:

[pacini038.arcetri.cisco.com:17906] [[32249,0],24] LOOPING BRKS
[pacini020.arcetri.cisco.com:23045] [[32249,0],6] LOOPING BRKS
[pacini049.arcetri.cisco.com:21309] [[32249,0],35] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini071.arcetri.cisco.com:21255] [[32249,0],56] LOOPING BRKS
[pacini022.arcetri.cisco.com:23070] [[32249,0],8] LOOPING BRKS
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini046.arcetri.cisco.com:21924] [[32249,0],32] LOOPING BRKS
[pacini016.arcetri.cisco.com:24829] [[32249,0],3] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini023.arcetri.cisco.com:22466] [[32249,0],9] LOOPING BRKS
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] LOOPING BRKS
[pacini032.arcetri.cisco.com:18452] [[32249,0],18] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini042.arcetri.cisco.com:18241] [[32249,0],28] LOOPING BRKS
[pacini038.arcetri.cisco.com:17906] [[32249,0],24] LOOPING BRKS
[pacini038.arcetri.cisco.com:17906] [[32249,0],24] BRKS GIVING UP
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini049.arcetri.cisco.com:21309] [[32249,0],35] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini027.arcetri.cisco.com:22552] [[32249,0],13] LOOPING BRKS
[pacini027.arcetri.cisco.com:22552] [[32249,0],13] BRKS GIVING UP
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

  hostname:  pacini038

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pacini071.arcetri.cisco.com:21255] [[32249,0],56] LOOPING BRKS
[pacini022.arcetri.cisco.com:23070] [[32249,0],8] LOOPING BRKS
[pacini061.arcetri.cisco.com:21470] [[32249,0],47] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini046.arcetri.cisco.com:21924] [[32249,0],32] LOOPING BRKS
[pacini055.arcetri.cisco.com:21319] [[32249,0],41] LOOPING BRKS
[pacini023.arcetri.cisco.com:22466] [[32249,0],9] LOOPING BRKS
[pacini023.arcetri.cisco.com:22466] [[32249,0],9] BRKS GIVING UP
[pacini074.arcetri.cisco.com:20928] [[32249,0],59] LOOPING BRKS
[pacini020.arcetri.cisco.com:23045] [[32249,0],6] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini016.arcetri.cisco.com:24829] [[32249,0],3] LOOPING BRKS
[pacini032.arcetri.cisco.com:18452] [[32249,0],18] LOOPING BRKS
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini049.arcetri.cisco.com:21309] [[32249,0],35] LOOPING BRKS
[pacini071.arcetri.cisco.com:21255] [[32249,0],56] LOOPING BRKS
[pacini070.arcetri.cisco.com:26907] [[32249,0],55] LOOPING BRKS
[pacini059.arcetri.cisco.com:20963] [[32249,0],45] LOOPING BRKS
[pacini074.arcetri.cisco.com:20928] [[32249,0],59] LOOPING BRKS
[pacini074.arcetri.cisco.com:20928] [[32249,0],59] BRKS GIVING UP
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] LOOPING BRKS
[pacini043.arcetri.cisco.com:17948] [[32249,0],29] BRKS GIVING UP
[pacini032.arcetri.cisco.com:18452] [[32249,0],18] LOOPING BRKS

(the node displayed in the show-help message varied with each run)

@hjelmn
Copy link
Member

hjelmn commented Feb 3, 2016

@rhc54 I will try to find some time to debug brks at scale here. Probably will be sometime next week.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 3, 2016

@hjelmn From what we have seen, it looks to me like Jeff is simply losing messages somewhere. The root cause of the problem is that downstream daemons never receive the launch message, and thus are never able to locate the job object since it doesn't get created.

So I'm not sure if there is a bug in brks or not. I'm unable to replicate even on a slow ssh machine, so this may be something specific to Jeff's machine.

I'm going to restore the direct component's rollup method (currently what we use in 1.10 and prior series) anyway. It may actually be faster than the other methods, but at least we know it worked.

@jsquyres
Copy link
Member

jsquyres commented Feb 3, 2016

I confirmed to @rhc54 on the phone: this fixed my problem. I am now able to launch successfully (and consistently) across 64 servers with ssh.

@rhc54 said he would clean up this PR and merge to master.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 4, 2016

@jsquyres Okay, I've cleaned this up - please give it a smoke test and commit if all is well.

@@ -55,7 +55,7 @@ static int direct_register(void)
/* make the priority adjustable so users can select
* direct for use by apps without affecting daemons
*/
my_priority = 1;
my_priority = 100;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to leave some wiggle room here? I.e., if you set this to 100, then it's not possible to override it -- even the user on the command line can't change priorities to something that would guarantee that direct wouldn't be chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wellll...as we both know, that isn't technically true. we set the priority to much higher values in a number of places. i picked this one because rcd is at 80, but now that we have opal_ignore'd it, i can turn down the heat

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 4, 2016

@jsquyres okay, quibbles addressed 😄

@jsquyres
Copy link
Member

jsquyres commented Feb 4, 2016

Tested and worked flawlessly in 20 out of 20 runs of 1000+ processes across 64 servers.

One last quibble (which you can choose to do or not 😄 ), add "Fixes #1215" to the commit message. Otherwise, 👍

…be the default, and to execute a rollup collective. This may in fact be faster than the alternatives, and something appears broken at scale when using brks in particular. Turn off the rcd and brks components as they don't work at scale right now - they can be restored at some future point when someone can debug them.

Adjust to Jeff's quibbles

Fixes open-mpi/mpi#1215
@rhc54
Copy link
Contributor Author

rhc54 commented Feb 4, 2016

grumble...grumble.... 😩

rhc54 pushed a commit that referenced this pull request Feb 4, 2016
Cleanup grpcomm race conditions
@rhc54 rhc54 merged commit f38ad4a into open-mpi:master Feb 4, 2016
@rhc54 rhc54 deleted the topic/gcom branch February 14, 2016 17:44
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
Revive the coll/sync component, adding a test to show it. Clean some permissions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants