Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/han/alltoallv Bugfixes #12806

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lrbison
Copy link
Contributor

@lrbison lrbison commented Sep 9, 2024

This series of commits fixes several errors, and changes behavior on error conditions:

  1. This alltoallv was not written to support in-place transforms, but I forgot to check for it. Add a simple catch for that case.
  2. The looping logic was bad in cases where ranks had no data to receive. Fix that logic.
  3. There was an incorrect swap between send/recv datatypes.
  4. The support for datatypes with negative lower bounds was broken.
  5. Fix a logic error which resulted in consuming data out-of-order
  6. Fix cases where the convertor would refuse to use the last few bytes of a buffer.

While debugging an application which exposed (2) above, I found that the rank which first experienced the error tried to cleanly exit with an error return code. However doing so while other ranks are accessing the shared memory region exposed by SMSC caused all the other local ranks to segfault. This hampered my progress during debug. To address it, I have moved the barrier that we previously only passed during successful exits, to be passed during all exit conditions. This will hang the execution instead of segfaulting other local ranks. I think this is better but I welcome other perspectives here.

Copy link

github-actions bot commented Sep 9, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

e6c34d3: coll/han/allreduce: squelch a particularly noisy l...

  • check_signed_off: does not contain a valid Signed-off-by line

cb82581: coll/han/alltoallv: handle errors better

  • check_signed_off: does not contain a valid Signed-off-by line

e3e9e1f: coll/han/alltoallv: correct loop condition on empt...

  • check_signed_off: does not contain a valid Signed-off-by line

577968e: coll/han: alltoallv should fall-back if transform ...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@lrbison
Copy link
Contributor Author

lrbison commented Sep 11, 2024

A status update: These commits are good as they are, but I think I have found evidence of possible data corruption. I am working to identify and fix the issue, and then I'll add them to this PR.

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

a50519c: coll/han/alltoallv: Correct swapped send/recv data...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@lrbison lrbison force-pushed the alltoallv_empty_msgs branch 2 times, most recently from 18a5ada to bf267b3 Compare September 20, 2024 03:01
@lrbison
Copy link
Contributor Author

lrbison commented Sep 20, 2024

@bosilca At long last this is ready for review again. Thank you for your patience. I have made a rather exhaustive test suite. I am trying to figure out the best place to add it. Perhaps to MTT I suppose. Any other suggestions?

@lrbison lrbison changed the title coll/han/alltoallv Bugfix: empty messages coll/han/alltoallv Bugfixes Sep 20, 2024
ompi/mca/coll/han/coll_han_allreduce.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_alltoallv.c Outdated Show resolved Hide resolved
ompi/mca/coll/han/coll_han_alltoallv.c Show resolved Hide resolved
ompi/mca/coll/han/coll_han_alltoallv.c Outdated Show resolved Hide resolved
if (rc) { break; };
}
if (rc) {
opal_output_verbose(1, mca_coll_han_component.han_output,
"Failed in alltoallv_sendrecv_w_direct_for_debugging: jloop=%d, rc=%d\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what debugging ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function is inefficient and not used. However if you doubt that something is correct in all the business logic of the main sendrecv_w, then it is a easy code change to use this function to double-check.

Is there a better way to keep such a function but also keep it out of the way?

ompi/mca/coll/han/coll_han_alltoallv.c Outdated Show resolved Hide resolved
We take path 2 here, to avoid possible race conditions between the
cancel and the possibility of a matching recv.

After realizing that the convertor may decide to not fully pack the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confused about the meaning of this ? The convertor might decide to avoid stopping in the middle of a predefined type. Is this what this comment is about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's what I intended to say. I observed a strange type consisting of 1 char and 1 int, and the convertor refused to use the last 3 bytes of the buffer, which threw off my original counting scheme.

@lrbison
Copy link
Contributor Author

lrbison commented Sep 20, 2024

Thank you George. I will be on a vacation for the next week and then I will address your comments!

@lrbison
Copy link
Contributor Author

lrbison commented Oct 4, 2024

@bosilca I believe I have addressed all your comments. Do you have any further review?

@bosilca
Copy link
Member

bosilca commented Oct 7, 2024

please squash before merge.

Fix several bugs in the original implementation:
 - Fallback if transform is in-place
 - Fix bug related to empty messages
 - Introduce barrier before exit during error
 - Handle datatypes with negative lower bounds
 - Fix a completion-ordering bug
 - Cleanup some unused variables

Signed-off-by: Luke Robison <[email protected]>
Allow the user to change the packing buffer size and how many maximum
buffers will be allocated.  Currently only han's alltoallv uses the
buffers.

Signed-off-by: Luke Robison <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants