Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc rdma hang with multiple windows #2530

Closed
markalle opened this issue Dec 6, 2016 · 13 comments
Closed

osc rdma hang with multiple windows #2530

markalle opened this issue Dec 6, 2016 · 13 comments
Assignees
Milestone

Comments

@markalle
Copy link
Contributor

markalle commented Dec 6, 2016

Using the testcase mt_1sided.c (and its support files 1sided.c mt_1sided_td1.c mt_1sided_td2.c) from this test harness pull request:
https://github.com/open-mpi/ompi-tests/pull/25
I get a hang from the osc rdma component:

% mpicc -o x mt_1sided.c mt_1sided_td1.c mt_1sided_td2.c
% mpirun -host hostA,hostB -mca osc rdma -mca pml ob1 -mca btl openib,self,vader ./x

This was from a vanilla build of openmpi-master-201612022109-0366f3a.

The single threaded "1sided.c" passes. And I think the key difference in the multi-threaded version is that there are two windows active (one for each thread).

1sided.c is a fairly broad test, looping over several synchronization types as well as contiguous and non-contiguous datatypes. I could probably whittle the test to be a bit more targeted for this particular hang if needed.

@jsquyres
Copy link
Member

jsquyres commented Dec 7, 2016

@hjelmn Can you have a look?

@rkowalewski
Copy link

@markalle Your link url does not work. We have probably the same issue regarding communication progress with multiple windows in our project. Could you please fix your url since I want to verify my assumption.

@jsquyres
Copy link
Member

jsquyres commented Dec 7, 2016

@rkowalewski Our ompi-test repository is not public; that's why the link is broken for you.

@markalle Are you amenable to sharing your new tests with @rkowalewski? You could post them in a gist, or attach them here on this issue.

@rkowalewski
Copy link

rkowalewski commented Dec 7, 2016

@jsquyres @markalle I spent more time with this. Finally, I have pinpointed the bug and created another issue: #2538
Maybe there is some relation between these two issues.

@markalle
Copy link
Contributor Author

markalle commented Dec 7, 2016

@jsquyres @rkowalewski Yeah, since we put it in the ompi-tests it's considered public-ish at least. Even the stuff I've checked in there is a whittled down version of our original tests, as giving out these smaller somewhat whittled down tests is easier to get management approval for.

@markalle
Copy link
Contributor Author

markalle commented Dec 7, 2016

Here, see if I made this gist right:
https://gist.github.com/markalle/56f739c3c907e53960d22d5de2fe4bc9
It should have 1sided.c, mt_1sided.c, mt_1sided_td1.c and mt_1sided_td2.c.

@jsquyres
Copy link
Member

jsquyres commented Dec 7, 2016

Perfect -- thanks @markalle!

@hjelmn
Copy link
Member

hjelmn commented Jan 31, 2017

Please test with the latest 2.0.2 release candidate.

@markalle
Copy link
Contributor Author

I guess we've got two issue reports opened for the same testcase, one for osc=rdma (2530) and one for osc=pt2pt (2614). I updated the other bug report, but will paste the same info here:

I built a vanilla OMPI v2.0.x with --enable-mpi-thread-multiple, and my results were:

mpirun -mca osc rdma -host hostA:2 ... : passed
mpirun -mca osc rdma -host hostA:4 ... : passed
mpirun -mca osc rdma -host hostA:1,hostB:1 ... : segv quickly
mpirun -mca osc pt2pt ... : message that "OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release"

@hjelmn
Copy link
Member

hjelmn commented Feb 9, 2018

Finally found the issue. It is not a threading issue per-se. There is an error in the accumulate master function that ends up leaking requests. Eventually things fall apart and it hangs, crashes, etc. I am testing a fix now.

@hjelmn
Copy link
Member

hjelmn commented Feb 12, 2018

Testing is looking good. PR coming shortly.

hjelmn added a commit to hjelmn/ompi that referenced this issue Mar 15, 2018
This commit is a large update to the osc/rdma component. Included in
this commit:

 - Add support for using hardware atomics for fetch-and-op and single
   count accumulate  when using the accumulate lock. This will improve
   the performance of these operations even when not setting the
   single intrinsic info key.

 - Rework how large accumulates are done. They now block on the get
   operation to fix some bugs discovered by an IBM one-sided test. I
   may roll back some of the changes if the underlying bug in the
   original design is discovered. There appear to be no real
   difference (on the hardware this was tested with) in performance so
   its probably a non-issue. References open-mpi#2530.

 - Add support for an additional lock-all algorithm: on-demand. The
   on-demand algorithm will attempt to acquire the peer lock when
   starting an RMA operation. The lock algorithm default has not
   changed. The algorithm can be selected by setting the
   osc_rdma_locking_mode MCA variable. The valid values are two_level
   and on_demand.

 - Make use of the btl_flush function if available. This can improve
   performance with some btls.

 - When using btl_flush do not keep track of the number of put
   operations. This reduces the number of atomic operations in the
   critical path.

 - Make the window buffers more friendly to multi-threaded
   applications. This was done by dropping support for multiple
   buffers per MPI window. I intend to re-add that support once the
   underlying performance bug under the old buffering scheme is
   fixed.

 - Fix a bug in request completion in the accumulate, get, and put
   paths. This also helps with open-mpi#2530.

 - General code cleanup and fixes.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Mar 15, 2018
This commit is a large update to the osc/rdma component. Included in
this commit:

 - Add support for using hardware atomics for fetch-and-op and single
   count accumulate  when using the accumulate lock. This will improve
   the performance of these operations even when not setting the
   single intrinsic info key.

 - Rework how large accumulates are done. They now block on the get
   operation to fix some bugs discovered by an IBM one-sided test. I
   may roll back some of the changes if the underlying bug in the
   original design is discovered. There appear to be no real
   difference (on the hardware this was tested with) in performance so
   its probably a non-issue. References open-mpi#2530.

 - Add support for an additional lock-all algorithm: on-demand. The
   on-demand algorithm will attempt to acquire the peer lock when
   starting an RMA operation. The lock algorithm default has not
   changed. The algorithm can be selected by setting the
   osc_rdma_locking_mode MCA variable. The valid values are two_level
   and on_demand.

 - Make use of the btl_flush function if available. This can improve
   performance with some btls.

 - When using btl_flush do not keep track of the number of put
   operations. This reduces the number of atomic operations in the
   critical path.

 - Make the window buffers more friendly to multi-threaded
   applications. This was done by dropping support for multiple
   buffers per MPI window. I intend to re-add that support once the
   underlying performance bug under the old buffering scheme is
   fixed.

 - Fix a bug in request completion in the accumulate, get, and put
   paths. This also helps with open-mpi#2530.

 - General code cleanup and fixes.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit that referenced this issue Mar 15, 2018
This commit is a large update to the osc/rdma component. Included in
this commit:

 - Add support for using hardware atomics for fetch-and-op and single
   count accumulate  when using the accumulate lock. This will improve
   the performance of these operations even when not setting the
   single intrinsic info key.

 - Rework how large accumulates are done. They now block on the get
   operation to fix some bugs discovered by an IBM one-sided test. I
   may roll back some of the changes if the underlying bug in the
   original design is discovered. There appear to be no real
   difference (on the hardware this was tested with) in performance so
   its probably a non-issue. References #2530.

 - Add support for an additional lock-all algorithm: on-demand. The
   on-demand algorithm will attempt to acquire the peer lock when
   starting an RMA operation. The lock algorithm default has not
   changed. The algorithm can be selected by setting the
   osc_rdma_locking_mode MCA variable. The valid values are two_level
   and on_demand.

 - Make use of the btl_flush function if available. This can improve
   performance with some btls.

 - When using btl_flush do not keep track of the number of put
   operations. This reduces the number of atomic operations in the
   critical path.

 - Make the window buffers more friendly to multi-threaded
   applications. This was done by dropping support for multiple
   buffers per MPI window. I intend to re-add that support once the
   underlying performance bug under the old buffering scheme is
   fixed.

 - Fix a bug in request completion in the accumulate, get, and put
   paths. This also helps with #2530.

 - General code cleanup and fixes.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn hjelmn added this to the v3.1.1 milestone May 8, 2018
hjelmn added a commit to hjelmn/ompi that referenced this issue May 8, 2018
This commit contains the contents of:

45db363
7f4872d

to the v3.1.x branch. These commits fix a couple of bugs and improve
the threading support (reference open-mpi#2530).

To keep the code mostly in sync with master I added code to
osc_rdma_types.h to convert between the atomics support on master and
v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
bwbarrett pushed a commit that referenced this issue May 14, 2018
This commit contains the contents of:

45db363
7f4872d

to the v3.1.x branch. These commits fix a couple of bugs and improve
the threading support (reference #2530).

To keep the code mostly in sync with master I added code to
osc_rdma_types.h to convert between the atomics support on master and
v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
@bwbarrett
Copy link
Member

@hjelmn are you ok with closing this ticket?

@bwbarrett bwbarrett modified the milestones: v3.1.1, v3.1.2 Jun 12, 2018
@hjelmn
Copy link
Member

hjelmn commented Jun 12, 2018

Yes.

@hjelmn hjelmn closed this as completed Jun 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants