Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predefined error handler such as MPI_ERRORS_ARE_FATAL does not work with main branch #11817

Closed
wzamazon opened this issue Jul 10, 2023 · 9 comments

Comments

@wzamazon
Copy link
Contributor

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Compiled from source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

22fe51cb7a961b6060fc5c48e659237cbe162566 3rd-party/openpmix (v1.1.3-3872-g22fe51cb)
 ece4f3c45a07a069e5b8f9c5e641613dfcaeffc3 3rd-party/prrte (psrvr-v2.0.0rc1-4638-gece4f3c45a)
 c1cfc910d92af43f8c27807a9a84c9c13f4fbc65 config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Amazon Linux 2
  • Computer hardware: Intel CPU
  • Network type: EFA

Details of the problem

As can be seen from mtt test result, the MPI_Errhandler_fatal test is not working as expected.

The test set the error handler to MPI_ERRORS_ARE_FATAL, then called MPI_Send to a non-existing rank.

It should have caused the application to abort immediately, but MPI_Send returned an error.

@wzamazon
Copy link
Contributor Author

I did some debug and I found the error was caused by the following commit:

commit 2d68804b21cd759be1feca85a8059eba82879882
Author: Edgar Gabriel <[email protected]>
Date:   Sat Feb 11 09:02:48 2023 -0600

    communicator: make c_name a dynamic array and reorder struct
    
    make the c_name element of the communicator structure a dynamic
    element. This allows us to reduce the size of PREDEFINED_COMMUNICATOR_PAD
    back to 512 to maintain backwards compatibility with the ompi 4.1.x release
    series.
    
    Reorder the communicator fields to reduce the struct size.
    This brings the communicator size at 536 bytes with FT, PERUSE enabled
    and compiled in debug mode.
    
    Fixes issue #11373
    
    Signed-off-by: Edgar Gabriel <[email protected]>
    Signed-off-by: George Bosilca <[email protected]>

This commit removed the errhandler_type field from ompi_communicator_t to save space, and used comm->error_handler->eh_mpi_object_typeas replacement.

However, for predefined error handle, its eh_mpi_object_type is OMPI_ERRHANDLER_TYPE_PREDEFINED, not OMPI_ERRHANDLER_TYPE_COMM. As a result, when a communicator encountered error, the error handle was not called.

@edgargabriel
Copy link
Member

@wzamazon thank you for the debugging, I can have a look, but it might be the end of the week until I get to it.

@wzamazon
Copy link
Contributor Author

@edgargabriel

Thank you!

I think I have a solution.

I looked into the change. IIUC, the removal of errhandler_type is not absolutely needed. The current mpi_communicator_t is 448 byte, well below the 512 byte limit.

So I think we can add errhandler_type back.

@edgargabriel
Copy link
Member

@wzamazon if that fixes the problem, that would be fantastic. This is what I would have looked into as well btw.

@wzamazon
Copy link
Contributor Author

#11818 did that. Please take a look. Thank you!

@jsquyres
Copy link
Member

@wzamazon Did we break ERRORS_ARE_FATAL on v5.0.x as well? If so, this is a blocker.

@wzamazon
Copy link
Contributor Author

wzamazon commented Jul 12, 2023

Did we break ERRORS_ARE_FATAL on v5.0.x as well? If so, this is a blocker.

Yes. As can be seen in mtt output:

link

.....
 MPITEST_ERROR(         0): Test should have aborted, FAILED
MPITEST_results: MPI_Errhandler_fatal          1 tests FAILED  (of        145)

@wzamazon
Copy link
Contributor Author

#11818 was merged.

I opened #11821 to backport.

@wzamazon
Copy link
Contributor Author

backport PR has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants