Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in-place MPI_Alltoallw crashes #9329

Closed
rabauke opened this issue Aug 29, 2021 · 7 comments
Closed

in-place MPI_Alltoallw crashes #9329

rabauke opened this issue Aug 29, 2021 · 7 comments

Comments

@rabauke
Copy link

rabauke commented Aug 29, 2021

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Downloaded sources from https://www.open-mpi.org/software/ompi/v4.1/ and compiled with

./configure --enable-mem-debug --enable-mem-profile --enable-debug

on Ubuntu 20.04 x64.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 20.04
  • Computer hardware: x64 laptop
  • Network type: no network

Details of the problem

The in-place variant of MPI_Alltoallw crashes as demonstrated by the following test program

#include "mpi.h"
#include <vector>

int main() {
  MPI_Init(nullptr, nullptr);

  int size, rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  std::vector<double> v(size, rank);

  std::vector<MPI_Datatype> types;
  for (int i{0}; i < size; ++i) {
    const int length[1] = {1};
    const int displacement[1] = {i};
    MPI_Datatype new_type;
    MPI_Type_indexed(1, length, displacement, MPI_DOUBLE, &new_type);
    MPI_Type_commit(&new_type);
    types.push_back(new_type);
  }

  std::vector<int> counts(size, 1);
  std::vector<int> displacements(size, 0);

  MPI_Alltoallw(MPI_IN_PLACE, nullptr, nullptr, nullptr, v.data(), counts.data(),
                displacements.data(), types.data(), MPI_COMM_WORLD);

  MPI_Finalize();
}

The above program essentially implements a standard in-place MPI_Alltoall and is not particularly useful. It is just for demonstration purposes. Running the program with

$ mpirun -np 4 debug 

yields

[tron:93567] pmix_mca_base_component_repository_open: unable to open mca_pnet_tcp: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
[tron:93567] pmix_mca_base_component_repository_open: unable to open mca_pnet_test: perhaps a missing symbol, or compiled for a different version of OpenPMIx (ignored)
free(): invalid pointer
[tron:93571] *** Process received signal ***
[tron:93571] Signal: Aborted (6)
[tron:93571] Signal code:  (-6)
double free or corruption (out)
[tron:93572] *** Process received signal ***
[tron:93572] Signal: Aborted (6)
[tron:93572] Signal code:  (-6)
[tron:93572] [ 0] double free or corruption (out)
[tron:93574] *** Process received signal ***
[tron:93574] Signal: Aborted (6)
[tron:93574] Signal code:  (-6)
[tron:93574] [ 0] [tron:93571] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fcb19341210]
[tron:93571] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f065cef2210]
[tron:93574] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fb7f6a2b210]
[tron:93572] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fb7f6a2b18b]
[tron:93572] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fcb1934118b]
[tron:93571] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f065cef218b]
[tron:93574] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fcb19320859]
[tron:93571] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fb7f6a0a859]
[tron:93572] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f065ced1859]
[tron:93574] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fcb1938b3ee]
[tron:93571] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7f065cf3c3ee]
[tron:93574] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x903ee)[0x7fb7f6a753ee]
[tron:93572] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fcb1939347c]
[tron:93571] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x99cac)[0x7fcb19394cac]
[tron:93571] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7fb7f6a7d47c]
[tron:93572] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x9847c)[0x7f065cf4447c]
[tron:93574] [ 5] /usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7fcb190f5974]
[tron:93571] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7fcb1815073d]
[tron:93571] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7fcb181507f1]
[tron:93571] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7f065cf46120]
[tron:93574] [ 6] /lib/x86_64-linux-gnu/libc.so.6(+0x9a120)[0x7fb7f6a7f120]
[tron:93572] [ 6] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7fcb1979eebd]
[tron:93571] [10] debug(+0x1510)[0x556b2d686510]
[tron:93571] [11] /usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7fb7f67df974]
[tron:93572] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7fb7f443873d]
/usr/local/lib/libopen-pal.so.40(opal_free+0x23)[0x7f065cca6974]
[tron:93574] [ 7] /usr/local/lib/openmpi/mca_coll_basic.so(+0x473d)[0x7f06568fd73d]
[tron:93574] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7f06568fd7f1]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fcb193220b3]
[tron:93571] [12] debug(+0x122e)[0x556b2d68622e]
[tron:93571] *** End of error message ***
[tron:93572] [ 8] /usr/local/lib/openmpi/mca_coll_basic.so(mca_coll_basic_alltoallw_intra+0x9b)[0x7fb7f44387f1]
[tron:93572] [ 9] [tron:93574] [ 9] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7fb7f6e88ebd]
[tron:93572] [10] debug(+0x1510)[0x55a7339ab510]
[tron:93572] [11] /usr/local/lib/libmpi.so.40(PMPI_Alltoallw+0x5ad)[0x7f065d34febd]
[tron:93574] [10] debug(+0x1510)[0x55b4c76f8510]
[tron:93574] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fb7f6a0c0b3]
[tron:93572] [12] debug(+0x122e)[0x55a7339ab22e]
[tron:93572] *** End of error message ***
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f065ced30b3]
[tron:93574] [12] debug(+0x122e)[0x55b4c76f822e]
[tron:93574] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node tron exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@ggouaillardet
Copy link
Contributor

Thanks for the report!

There is indeed a bug and I will post a fix shortly

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Aug 30, 2021
The temporary buffer must be shifted by the true_extent on a
per type basis (since the various datatypes might have different
true_extent).

Thanks Heiko Bauke for reporting this.

Refs. open-mpi#9329

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor

@rabauke meanwhile, you can manually download and apply the patch at https://github.com/open-mpi/ompi/pull/9330.patch

@rabauke
Copy link
Author

rabauke commented Aug 30, 2021

I can confirm that MPI_Alltoallw works as expected after applying the patch #9330 to Open MPI 4.1.1. The test program given in this ticket does no longer crash. Furthermore, the unit test that I am currently writing for MPL and which use MPI_Alltoallw in a slightly more complex context do no longer fail/crash.

@jsquyres
Copy link
Member

jsquyres commented Oct 7, 2021

Fix merged to master; awaiting PR for v5.0.x.

bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 7, 2021
The temporary buffer must be shifted by the true_extent on a
per type basis (since the various datatypes might have different
true_extent).

Thanks Heiko Bauke for reporting this.

Refs. open-mpi#9329

Signed-off-by: Gilles Gouaillardet <[email protected]>
jsquyres pushed a commit to bosilca/ompi that referenced this issue Oct 7, 2021
The temporary buffer must be shifted by the true_extent on a
per type basis (since the various datatypes might have different
true_extent).

Thanks Heiko Bauke for reporting this.

Refs. open-mpi#9329

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 0041ce8)
@awlauria
Copy link
Contributor

awlauria commented Oct 12, 2021

@rabauke fyi - we suspect this patch may have created (or maybe exposed?) another bug with IN_PLACE + MPI_Alltoallv(). See #9501.

We're going to hold off bringing back #9330 to release branches until the new issue is resolved.

@rabauke
Copy link
Author

rabauke commented Oct 12, 2021

Oh, what a pity!

bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Nov 16, 2021
The temporary buffer must be shifted by the true_extent on a
per type basis (since the various datatypes might have different
true_extent).

Thanks Heiko Bauke for reporting this.

Refs. open-mpi#9329

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 0041ce8)
Signed-off-by: Brian Barrett <[email protected]>
@awlauria
Copy link
Contributor

awlauria commented Mar 7, 2022

Fix merged to v5.0.x in #9493

Closing.

@awlauria awlauria closed this as completed Mar 7, 2022
awlauria pushed a commit to awlauria/ompi that referenced this issue Jun 23, 2022
The temporary buffer must be shifted by the true_extent on a
per type basis (since the various datatypes might have different
true_extent).

Thanks Heiko Bauke for reporting this.

Refs. open-mpi#9329

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 0041ce8)
Signed-off-by: Brian Barrett <[email protected]>
(cherry picked from commit 8ff0a09)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants