osc pt2pt wrong answer #2505

markalle · 2016-12-02T23:37:36Z

Using the testcase small_1sided.c from this test harness pull request:
https://github.com/open-mpi/ompi-tests/pull/25
I get wrong answers from the osc pt2pt component:

% mpicc -o x small_1sided.c
% mpirun -host hostA,hostB -mca osc pt2pt -mca pml ob1 -mca btl openib,self,vader ./x
% mpirun -host hostA,hostB -mca osc pt2pt -mca pml ob1 -mca btl tcp,self,vader ./x

In cases where it's able to use osc rdma it seems to work.

This was from a vanilla build of openmpi-master-201612022109-0366f3a.

The text was updated successfully, but these errors were encountered:

jsquyres · 2016-12-07T00:25:31Z

@hjelmn Can you look at this one, too? (perhaps related to #2530?)

jjhursey · 2016-12-08T21:32:29Z

A couple more data points running 1sided.c like this on a single node:

mpirun -np 2 -mca pml ob1 -mca osc pt2pt -mca btl VALUE ./small_1sided

`VALUE`	Result
`self,sm`	PASS
`self,vader`	Wrong answer (near the start of the test)
`self,tcp`	HANG (near the start of the test)
`self,openib`	HANG

For one of the hangs this is the backtrace (both processes are in the same location):

(gdb) bt
#0  0x00003fff77da2e44 in clock_gettime () from /lib64/power8/libc.so.6
#1  0x00003fff7773d37c in gettime (base=0x10034a1f2e0, tp=0x10034a1f450) at event.c:370
#2  0x00003fff77741f40 in event_base_loop (base=0x10034a1f2e0, flags=<optimized out>) at event.c:1624
#3  0x00003fff778a38b8 in opal_progress () at ompi/opal/runtime/opal_progress.c:207
#4  0x00003fff74f9a990 in opal_condition_wait (c=0x10034bbac38, m=0x10034bbabd0)
    at ompi/opal/threads/condition.h:72
#5  0x00003fff74f9d0dc in ompi_osc_pt2pt_wait (win=0x10034bba9a0)
    at ompi/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c:519
#6  0x00003fff780c8318 in PMPI_Win_wait (win=0x10034bba9a0) at pwin_wait.c:52
#7  0x000000001000317c in main_test_fn ()
#8  0x0000000010004348 in runtest ()
#9  0x00000000100043d4 in main ()
(gdb) 
#5  0x00003fff74f9d0dc in ompi_osc_pt2pt_wait (win=0x10034bba9a0)
    at ompi/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c:519
519	        opal_condition_wait(&module->cond, &module->lock);
(gdb) l
514	             module->active_incoming_frag_count != module->active_incoming_frag_signal_count) {
515	        OPAL_OUTPUT_VERBOSE((25, ompi_osc_base_framework.framework_output, "num_complete_msgs = %d, "
516	                             "active_incoming_frag_count = %d, active_incoming_frag_signal_count = %d",
517	                             module->num_complete_msgs, module->active_incoming_frag_count,
518	                             module->active_incoming_frag_signal_count));
519	        opal_condition_wait(&module->cond, &module->lock);
520	    }
521	
522	    group = module->pw_group;
523	    module->pw_group = NULL;

The whole module structure is a bit big, but here are some of the top level variables from the first process:

  accumulate_ordering = true, 
  no_locks = false, 
  free_after = 0x0, 
  baseptr = 0x3fff99df0010, 
  comm = 0x1003d0fea50, 
  disp_unit = 4, 
  epoch_outgoing_frag_count = 0x1003d0f53e0, 
  tag_counter = 2473132, 
  outgoing_frag_count = 1911377, 
  outgoing_frag_signal_count = 1911377, 
  active_incoming_frag_count = 933106, 
  active_incoming_frag_signal_count = 933106, 
  passive_target_access_epoch = 0, 
  pw_group = 0x1003d1188d0, 
  num_complete_msgs = -1, 
  lock_status = 0, 
  recv_frags = 0x1003d101230, 
  recv_frag_count = 4,

Here is the same information from the second process:

  accumulate_ordering = true, 
  no_locks = false, 
  free_after = 0x0, 
  baseptr = 0x3fff74a20010, 
  comm = 0x10034bbe9f0, 
  disp_unit = 4, 
  epoch_outgoing_frag_count = 0x10034ba17d0, 
  tag_counter = 2473128, 
  outgoing_frag_count = 1911374, 
  outgoing_frag_signal_count = 1911374, 
  active_incoming_frag_count = 933108, 
  active_incoming_frag_signal_count = 933108, 
  passive_target_access_epoch = 0, 
  pw_group = 0x10034bd8870, 
  num_complete_msgs = 1, 
  lock_status = 0, 
  recv_frags = 0x10034bc12a0, 
  recv_frag_count = 4,

I don't know the one sided code, but from the variables above num_complete_msgs is -1 in one of the processes. Is that an allowed sentinel value or something?

jjhursey · 2016-12-09T20:16:05Z

It should be noted that v1.10 seems to be fine. So this is a regression in the v2.x series and in master.

gpaulsen · 2016-12-12T15:12:13Z

@hjelmn Have you been able to make any progress on this?

I know @jjhursey and @markalle were working on this over the weekend, and noticed some inconsistencies in completion counting in osc_pt2pt in v2.0.x (and master). Comparing with v1.10, it looks like that code was restructured quite a bit, so difficult to pinpoint to a specific commit.

gpaulsen · 2016-12-12T22:38:20Z

@jjhursey made some progress today as well. I'm not posting the full details here. But if others are working on this, it'd be good to sync up with Josh.

The problem seems to be that Rank 0 is getting ahead and is waiting in the Complete call for Epoch #2 while Rank 1 is just post the Wait call for Epoch #1. In general, this is ok. The problem is what happens next.

The Wait call in Rank 1 notices that Rank 0 has sent it a Complete request, and send back the data (but this is the data exposed in Epoch 1, not Epoch 2 - which Rank 2 has not started yet). This causes Rank 0 to receive the previous Epoch's data. Eventually this will fail as a Wrong Answer.

If you look at the Wrong Answer dump you will notice that the buffer is only off-by-one, which makes sense if it received the previous epoch's data.

I don't know what the fix is yet, but at least we can easily reproduce the problem and know what to look for.

jjhursey · 2016-12-13T01:35:36Z

The timeline below shows an interleaving of two ranks that always leads to a wrong answer.

Time	Rank 0	Rank 1
T1	`Fence`	`Fence`
T2	`Post`
T3	`Start`
T4	`Get`
T5	`Complete` (blocks here)
T6		`Post`
T7		`Start`
T8		`Get`
T9		`Complete`
T10	`Wait`
T11	`Bcast`
T12	`Post`
T13	`Start`
T14	`Get`
T15	`Complete` (blocks here)
T16		`Wait`
T17		`Bcast`
T18	`Wait` (blocks here)
T19		`Post`
T20		`Start`
T21		`Get`
T22		`Complete`
T23	Wrong Answer detected

Notice that at T18 the Complete at Rank 0 is finished erroneously, and at which point the data is exchanged from Rank 1 which is still in the previous epoch. Since the data exchanged was from a previous exposure epoch, it is a wrong answer when checked at T23.

Notice also that the Post at Rank 1 (T6) occurs after Rank 0 is waiting in the Complete (T5) - which is fine. Except that the Post is marked by Rank 0 as an unexpected Post for the next epoch - which is not correct. This may contribute to the problem noticed at T18.

We believe that the Fence is necessary before the tests starts. So we also think that the eager_send_active variable (and/or some other member of the sync data structure) might not be updated correctly to mark the start of a new epoch.

jsquyres · 2016-12-13T17:09:32Z

Per 12/13 webex discussion between @jjhursey and @hjelmn, IBM expects to have a fix today.

jjhursey · 2016-12-13T19:31:36Z

The patch below seems to fix the issue with a single threaded test [1sided.c]. We are still testing/reviewing. Once that is finished, and assuming all is well, then I'll put up a PR. I just wanted to post here for those playing along at home:

diff --git a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
index 3c086a42..4ecad212 100644
--- a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
+++ b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
@@ -227,6 +227,12 @@ int ompi_osc_pt2pt_start (ompi_group_t *group, int assert, ompi_win_t *win)
     /* haven't processed any post messages yet */
     sync->sync_expected = sync->num_peers;
 
+    /* If the previous epoch was from Fence, then eager_send_active is still
+     * set to true at this time, but it shoulnd't be true until we get our
+     * incoming Posts. So reset to 'false' for this new epoch.
+     */
+    sync->eager_send_active = false;
+
     OPAL_OUTPUT_VERBOSE((50, ompi_osc_base_framework.framework_output,
                          "ompi_osc_pt2pt_start entering with group size %d...",
                          sync->num_peers));

We are tracking a multithreaded one-sided issue at the moment which is related to this issue. That is the mt_1sided.c test case, which just runs the 1sided.c test case in two parallel threads (on two different dup's of MCW). That seems to hang or segv. Likely related to Issue #2530

* Fixes Issue open-mpi#2505

jjhursey · 2016-12-19T15:34:39Z

Although PR #2590 fixed a number of the problems, in more extensive testing over the weekend we found another wrong answer with osc/pt2pt.

@markalle Can we share a smallish version of that test case so @hjelmn can help out with it?

hppritcha · 2016-12-19T16:37:53Z

@jjhursey could you provide more info about the particular case gave the wrong answer? Does it look like a particular use case for OSC? Or was it just a matter of chance/timing that the test failed?

jjhursey · 2016-12-19T19:44:49Z

The outline of this phase of the test program looks roughly like this (at all ranks)

fence
for N iterations
   accumulate (against various ranks)
fence
fence
put
fence
fence
for N iterations
   get
fence

I'll work with @markalle on trying to get a smaller version to share.

jjhursey · 2016-12-20T14:55:50Z

@markalle and I worked on this test case a bit more yesterday, and we think it is a test case issue. I'm working this morning to confirm that. If confirmed then we can close this issue and open another to track the multi-threading issues.

jjhursey · 2016-12-20T16:04:54Z

Internal testing confirmed that, for the single threaded case, the patches taken in as part of PR #2590 and PR #2594 fix this issue.

To avoid a hang we also need PR #2593

gpaulsen added this to the v2.0.2 milestone Dec 5, 2016

gpaulsen added Severity: blocker bug labels Dec 5, 2016

hppritcha assigned hjelmn Dec 7, 2016

jjhursey added a commit to jjhursey/ompi that referenced this issue Dec 14, 2016

osc/pt2pt: Fix PSCW after Fence Wrong Answer

062a377

* Fixes Issue open-mpi#2505

This was referenced Dec 14, 2016

Topic/osc pt2pt fixes #2581

Closed

Topic/osc pt2pt 1 thread fixes #2590

Merged

jjhursey closed this as completed Dec 20, 2016

jjhursey mentioned this issue Dec 20, 2016

Single threaded fixes for osc/pt2pt #2613

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osc pt2pt wrong answer #2505

osc pt2pt wrong answer #2505

markalle commented Dec 2, 2016 •

edited by jsquyres

Loading

jsquyres commented Dec 7, 2016

jjhursey commented Dec 8, 2016

jjhursey commented Dec 9, 2016

gpaulsen commented Dec 12, 2016

gpaulsen commented Dec 12, 2016

jjhursey commented Dec 13, 2016

jsquyres commented Dec 13, 2016

jjhursey commented Dec 13, 2016 •

edited

Loading

jjhursey commented Dec 19, 2016

hppritcha commented Dec 19, 2016

jjhursey commented Dec 19, 2016

jjhursey commented Dec 20, 2016

jjhursey commented Dec 20, 2016

osc pt2pt wrong answer #2505

osc pt2pt wrong answer #2505

Comments

markalle commented Dec 2, 2016 • edited by jsquyres Loading

jsquyres commented Dec 7, 2016

jjhursey commented Dec 8, 2016

jjhursey commented Dec 9, 2016

gpaulsen commented Dec 12, 2016

gpaulsen commented Dec 12, 2016

jjhursey commented Dec 13, 2016

jsquyres commented Dec 13, 2016

jjhursey commented Dec 13, 2016 • edited Loading

jjhursey commented Dec 19, 2016

hppritcha commented Dec 19, 2016

jjhursey commented Dec 19, 2016

jjhursey commented Dec 20, 2016

jjhursey commented Dec 20, 2016

markalle commented Dec 2, 2016 •

edited by jsquyres

Loading

jjhursey commented Dec 13, 2016 •

edited

Loading