Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc pt2pt wrong answer #2505

Closed
markalle opened this issue Dec 2, 2016 · 13 comments
Closed

osc pt2pt wrong answer #2505

markalle opened this issue Dec 2, 2016 · 13 comments
Assignees
Milestone

Comments

@markalle
Copy link
Contributor

markalle commented Dec 2, 2016

Using the testcase small_1sided.c from this test harness pull request:
https://github.com/open-mpi/ompi-tests/pull/25
I get wrong answers from the osc pt2pt component:

% mpicc -o x small_1sided.c
% mpirun -host hostA,hostB -mca osc pt2pt -mca pml ob1 -mca btl openib,self,vader ./x
% mpirun -host hostA,hostB -mca osc pt2pt -mca pml ob1 -mca btl tcp,self,vader ./x

In cases where it's able to use osc rdma it seems to work.

This was from a vanilla build of openmpi-master-201612022109-0366f3a.

@jsquyres
Copy link
Member

jsquyres commented Dec 7, 2016

@hjelmn Can you look at this one, too? (perhaps related to #2530?)

@jjhursey
Copy link
Member

jjhursey commented Dec 8, 2016

A couple more data points running 1sided.c like this on a single node:

mpirun -np 2 -mca pml ob1 -mca osc pt2pt -mca btl VALUE ./small_1sided
VALUE Result
self,sm PASS
self,vader Wrong answer (near the start of the test)
self,tcp HANG (near the start of the test)
self,openib HANG

For one of the hangs this is the backtrace (both processes are in the same location):

(gdb) bt
#0  0x00003fff77da2e44 in clock_gettime () from /lib64/power8/libc.so.6
#1  0x00003fff7773d37c in gettime (base=0x10034a1f2e0, tp=0x10034a1f450) at event.c:370
#2  0x00003fff77741f40 in event_base_loop (base=0x10034a1f2e0, flags=<optimized out>) at event.c:1624
#3  0x00003fff778a38b8 in opal_progress () at ompi/opal/runtime/opal_progress.c:207
#4  0x00003fff74f9a990 in opal_condition_wait (c=0x10034bbac38, m=0x10034bbabd0)
    at ompi/opal/threads/condition.h:72
#5  0x00003fff74f9d0dc in ompi_osc_pt2pt_wait (win=0x10034bba9a0)
    at ompi/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c:519
#6  0x00003fff780c8318 in PMPI_Win_wait (win=0x10034bba9a0) at pwin_wait.c:52
#7  0x000000001000317c in main_test_fn ()
#8  0x0000000010004348 in runtest ()
#9  0x00000000100043d4 in main ()
(gdb) 
#5  0x00003fff74f9d0dc in ompi_osc_pt2pt_wait (win=0x10034bba9a0)
    at ompi/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c:519
519	        opal_condition_wait(&module->cond, &module->lock);
(gdb) l
514	             module->active_incoming_frag_count != module->active_incoming_frag_signal_count) {
515	        OPAL_OUTPUT_VERBOSE((25, ompi_osc_base_framework.framework_output, "num_complete_msgs = %d, "
516	                             "active_incoming_frag_count = %d, active_incoming_frag_signal_count = %d",
517	                             module->num_complete_msgs, module->active_incoming_frag_count,
518	                             module->active_incoming_frag_signal_count));
519	        opal_condition_wait(&module->cond, &module->lock);
520	    }
521	
522	    group = module->pw_group;
523	    module->pw_group = NULL;

The whole module structure is a bit big, but here are some of the top level variables from the first process:

  accumulate_ordering = true, 
  no_locks = false, 
  free_after = 0x0, 
  baseptr = 0x3fff99df0010, 
  comm = 0x1003d0fea50, 
  disp_unit = 4, 
  epoch_outgoing_frag_count = 0x1003d0f53e0, 
  tag_counter = 2473132, 
  outgoing_frag_count = 1911377, 
  outgoing_frag_signal_count = 1911377, 
  active_incoming_frag_count = 933106, 
  active_incoming_frag_signal_count = 933106, 
  passive_target_access_epoch = 0, 
  pw_group = 0x1003d1188d0, 
  num_complete_msgs = -1, 
  lock_status = 0, 
  recv_frags = 0x1003d101230, 
  recv_frag_count = 4, 

Here is the same information from the second process:

  accumulate_ordering = true, 
  no_locks = false, 
  free_after = 0x0, 
  baseptr = 0x3fff74a20010, 
  comm = 0x10034bbe9f0, 
  disp_unit = 4, 
  epoch_outgoing_frag_count = 0x10034ba17d0, 
  tag_counter = 2473128, 
  outgoing_frag_count = 1911374, 
  outgoing_frag_signal_count = 1911374, 
  active_incoming_frag_count = 933108, 
  active_incoming_frag_signal_count = 933108, 
  passive_target_access_epoch = 0, 
  pw_group = 0x10034bd8870, 
  num_complete_msgs = 1, 
  lock_status = 0, 
  recv_frags = 0x10034bc12a0, 
  recv_frag_count = 4, 

I don't know the one sided code, but from the variables above num_complete_msgs is -1 in one of the processes. Is that an allowed sentinel value or something?

@jjhursey
Copy link
Member

jjhursey commented Dec 9, 2016

It should be noted that v1.10 seems to be fine. So this is a regression in the v2.x series and in master.

@gpaulsen
Copy link
Member

@hjelmn Have you been able to make any progress on this?

I know @jjhursey and @markalle were working on this over the weekend, and noticed some inconsistencies in completion counting in osc_pt2pt in v2.0.x (and master). Comparing with v1.10, it looks like that code was restructured quite a bit, so difficult to pinpoint to a specific commit.

@gpaulsen
Copy link
Member

@jjhursey made some progress today as well. I'm not posting the full details here. But if others are working on this, it'd be good to sync up with Josh.

The problem seems to be that Rank 0 is getting ahead and is waiting in the Complete call for Epoch #2 while Rank 1 is just post the Wait call for Epoch #1. In general, this is ok. The problem is what happens next.

The Wait call in Rank 1 notices that Rank 0 has sent it a Complete request, and send back the data (but this is the data exposed in Epoch 1, not Epoch 2 - which Rank 2 has not started yet). This causes Rank 0 to receive the previous Epoch's data. Eventually this will fail as a Wrong Answer.

If you look at the Wrong Answer dump you will notice that the buffer is only off-by-one, which makes sense if it received the previous epoch's data.

I don't know what the fix is yet, but at least we can easily reproduce the problem and know what to look for.

@jjhursey
Copy link
Member

The timeline below shows an interleaving of two ranks that always leads to a wrong answer.

Time Rank 0 Rank 1
T1 Fence Fence
T2 Post
T3 Start
T4 Get
T5 Complete (blocks here)
T6 Post
T7 Start
T8 Get
T9 Complete
T10 Wait
T11 Bcast
T12 Post
T13 Start
T14 Get
T15 Complete (blocks here)
T16 Wait
T17 Bcast
T18 Wait (blocks here)
T19 Post
T20 Start
T21 Get
T22 Complete
T23 Wrong Answer detected

Notice that at T18 the Complete at Rank 0 is finished erroneously, and at which point the data is exchanged from Rank 1 which is still in the previous epoch. Since the data exchanged was from a previous exposure epoch, it is a wrong answer when checked at T23.

Notice also that the Post at Rank 1 (T6) occurs after Rank 0 is waiting in the Complete (T5) - which is fine. Except that the Post is marked by Rank 0 as an unexpected Post for the next epoch - which is not correct. This may contribute to the problem noticed at T18.

We believe that the Fence is necessary before the tests starts. So we also think that the eager_send_active variable (and/or some other member of the sync data structure) might not be updated correctly to mark the start of a new epoch.

@jsquyres
Copy link
Member

Per 12/13 webex discussion between @jjhursey and @hjelmn, IBM expects to have a fix today.

@jjhursey
Copy link
Member

jjhursey commented Dec 13, 2016

The patch below seems to fix the issue with a single threaded test [1sided.c]. We are still testing/reviewing. Once that is finished, and assuming all is well, then I'll put up a PR. I just wanted to post here for those playing along at home:

diff --git a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
index 3c086a42..4ecad212 100644
--- a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
+++ b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
@@ -227,6 +227,12 @@ int ompi_osc_pt2pt_start (ompi_group_t *group, int assert, ompi_win_t *win)
     /* haven't processed any post messages yet */
     sync->sync_expected = sync->num_peers;
 
+    /* If the previous epoch was from Fence, then eager_send_active is still
+     * set to true at this time, but it shoulnd't be true until we get our
+     * incoming Posts. So reset to 'false' for this new epoch.
+     */
+    sync->eager_send_active = false;
+
     OPAL_OUTPUT_VERBOSE((50, ompi_osc_base_framework.framework_output,
                          "ompi_osc_pt2pt_start entering with group size %d...",
                          sync->num_peers));

We are tracking a multithreaded one-sided issue at the moment which is related to this issue. That is the mt_1sided.c test case, which just runs the 1sided.c test case in two parallel threads (on two different dup's of MCW). That seems to hang or segv. Likely related to Issue #2530

jjhursey added a commit to jjhursey/ompi that referenced this issue Dec 14, 2016
This was referenced Dec 14, 2016
@jjhursey
Copy link
Member

Although PR #2590 fixed a number of the problems, in more extensive testing over the weekend we found another wrong answer with osc/pt2pt.

@markalle Can we share a smallish version of that test case so @hjelmn can help out with it?

@hppritcha
Copy link
Member

@jjhursey could you provide more info about the particular case gave the wrong answer? Does it look like a particular use case for OSC? Or was it just a matter of chance/timing that the test failed?

@jjhursey
Copy link
Member

The outline of this phase of the test program looks roughly like this (at all ranks)

fence
for N iterations
   accumulate (against various ranks)
fence
fence
put
fence
fence
for N iterations
   get
fence

I'll work with @markalle on trying to get a smaller version to share.

@jjhursey
Copy link
Member

@markalle and I worked on this test case a bit more yesterday, and we think it is a test case issue. I'm working this morning to confirm that. If confirmed then we can close this issue and open another to track the multi-threading issues.

@jjhursey
Copy link
Member

Internal testing confirmed that, for the single threaded case, the patches taken in as part of PR #2590 and PR #2594 fix this issue.

To avoid a hang we also need PR #2593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants