Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in mca_mpool_hugepage_module_init() on ARM64 #3697

Open
yosefe opened this issue Jun 14, 2017 · 28 comments
Open

Hang in mca_mpool_hugepage_module_init() on ARM64 #3697

yosefe opened this issue Jun 14, 2017 · 28 comments

Comments

@yosefe
Copy link
Contributor

yosefe commented Jun 14, 2017

Background information

               Package: Open MPI mtt@hpc-arm-01 Distribution
                Open MPI: 2.1.2a1
  Open MPI repo revision: v2.1.1-55-g4d82554
   Open MPI release date: Unreleased developer copy
                Open RTE: 2.1.2a1
  Open RTE repo revision: v2.1.1-55-g4d82554
   Open RTE release date: Unreleased developer copy
                    OPAL: 2.1.2a1
      OPAL repo revision: v2.1.1-55-g4d82554
       OPAL release date: Unreleased developer copy
  • Operating system/version: RedHat 7.2
  • Computer hardware: aarch64 (ARM 64-bit)
  • Network type: Infiniband

Details of the problem

A hang in ctxalloc test during MPI_Init.
Similar hang is observed in many other tests.
ctxalloc can be found here

mpirun -np 192 -mca btl_openib_warn_default_gid_prefix 0 \
  --bind-to core -mca pml ucx \
  -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 \
  -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 \
   -mca mpi_add_procs_cutoff 100000 --map-by node \
ctxalloc 2 1500 100

stack trace:

Thread 1 (Thread 0x3ffb800a480 (LWP 27563)):
#0  0x000003ffb7ccfa74 in nanosleep () from /usr/lib64/libpthread.so.0
#1  0x000003ffb7911e0c in _opal_lifo_release_cpu () at ../opal/class/opal_lifo.h:195
#2  0x000003ffb7911e40 in opal_lifo_pop_atomic (lifo=0x620520) at ../opal/class/opal_lifo.h:210
#3  0x000003ffb7911fd0 in opal_free_list_get_st (flist=0x620520) at ../opal/class/opal_free_list.h:213
#4  0x000003ffb7911ff4 in opal_free_list_get (flist=0x620520) at ../opal/class/opal_free_list.h:225
#5  0x000003ffb7912264 in opal_rb_tree_init (tree=0x6204e0, comp=0x3ffb4671ae4 <mca_mpool_rb_hugepage_compare>) at class/opal_rb_tree.c:86
#6  0x000003ffb4671d44 in mca_mpool_hugepage_module_init (mpool=0x620410, huge_page=0x61ef30) at mpool_hugepage_module.c:107
#7  0x000003ffb4672a10 in mca_mpool_hugepage_open () at mpool_hugepage_component.c:166
#8  0x000003ffb7948ab4 in open_components (framework=0x3ffb7a34788 <opal_mpool_base_framework>) at mca_base_components_open.c:117
#9  0x000003ffb79489e4 in mca_base_framework_components_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_components_open.c:65
#10 0x000003ffb79bd2a0 in mca_mpool_base_open (flags=MCA_BASE_OPEN_DEFAULT) at base/mpool_base_frame.c:89
#11 0x000003ffb7957d1c in mca_base_framework_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:174
#12 0x000003ffb7d59bb8 in ompi_mpi_init (argc=4, argv=0x3ffffffdc38, requested=0, provided=0x3ffffffda2c) at runtime/ompi_mpi_init.c:589
#13 0x000003ffb7d990e0 in PMPI_Init (argc=0x3ffffffdaac, argv=0x3ffffffdaa0) at pinit.c:66
#14 0x0000000000400c5c in main (argc=4, argv=0x3ffffffdc38) at ctxalloc.c:20
@shamisp
Copy link
Contributor

shamisp commented Jun 14, 2017 via email

@yosefe
Copy link
Contributor Author

yosefe commented Jun 14, 2017

Unlikely, this is a fixed 100ns interval"

static inline void _opal_lifo_release_cpu (void)
{
    /* NTH: there are many ways to cause the current thread to be suspended. This one
     * should work well in most cases. Another approach would be to use poll (NULL, 0, ) but
     * the interval will be forced to be in ms (instead of ns or us). Note that there
     * is a performance improvement for the lifo test when this call is made on detection
     * of contention but it may not translate into actually MPI or application performance
     * improvements. */
    static struct timespec interval = { .tv_sec = 0, .tv_nsec = 100 };
    nanosleep (&interval, NULL);
}

@jsquyres
Copy link
Member

Can you provide more information about what exactly is hanging? I.e., this is clearly one stack trace of the hang, but it must be looping on something that never completes.

@shamisp
Copy link
Contributor

shamisp commented Jun 14, 2017

@yosefe can I reproduce this with a single node ? Do I really need IB for this ? thanks

@bosilca
Copy link
Member

bosilca commented Jun 14, 2017

You don't need IB for this, just force the load of hugepage mpool. Also, this seems to happen in a very peculiar place, during a lifo_pop operation, which in this particualr context use the conditional load/store of the ARM64. As our tests pass on ARM64, I don't think the issue is in the atomic itself, but instead in the way the rb tree is initialized or how the freelist is grown in this particular instance.

@shamisp
Copy link
Contributor

shamisp commented Jun 14, 2017

is this a multi-thread run ? how many threads ?

@shamisp
Copy link
Contributor

shamisp commented Jun 14, 2017

@yosefe , can you please replace

opal_atomic_wmb ();

with
opal_atomic_mb()

@yosefe
Copy link
Contributor Author

yosefe commented Jun 19, 2017

@shamisp it's looping in

} while (!opal_atomic_sc_ptr (&lifo->opal_lifo_head.data.item, next));
, changing to atomic_mb() does not help.
it can be reproduced on 1 node.
disabling hugepage mpool moves the hange to another place - so it's something more fundamental:

(gdb) bt
#0  0x000003ffb7ccfa78 in nanosleep () from /usr/lib64/libpthread.so.0
#1  0x000003ffb7911e0c in _opal_lifo_release_cpu () at ../opal/class/opal_lifo.h:195
#2  0x000003ffb7911e40 in opal_lifo_pop_atomic (lifo=0x3ffb7a3b778 <mca_mpool_base_tree+64>) at ../opal/class/opal_lifo.h:210
#3  0x000003ffb7911fd0 in opal_free_list_get_st (flist=0x3ffb7a3b778 <mca_mpool_base_tree+64>) at ../opal/class/opal_free_list.h:213
#4  0x000003ffb7911ff4 in opal_free_list_get (flist=0x3ffb7a3b778 <mca_mpool_base_tree+64>) at ../opal/class/opal_free_list.h:225
#5  0x000003ffb7912264 in opal_rb_tree_init (tree=0x3ffb7a3b738 <mca_mpool_base_tree>, comp=0x3ffb79bdba0 <mca_mpool_base_tree_node_compare>) at class/opal_rb_tree.c:86
#6  0x000003ffb79bde28 in mca_mpool_base_tree_init () at base/mpool_base_tree.c:87
#7  0x000003ffb79bd378 in mca_mpool_base_open (flags=MCA_BASE_OPEN_DEFAULT) at base/mpool_base_frame.c:102
#8  0x000003ffb7957d1c in mca_base_framework_open (framework=0x3ffb7a34788 <opal_mpool_base_framework>, flags=MCA_BASE_OPEN_DEFAULT) at mca_base_framework.c:174
#9  0x000003ffb7d59bc0 in ompi_mpi_init (argc=4, argv=0x3ffffffe5c8, requested=0, provided=0x3ffffffe3bc) at runtime/ompi_mpi_init.c:589
#10 0x000003ffb7d990e8 in PMPI_Init (argc=0x3ffffffe43c, argv=0x3ffffffe430) at pinit.c:66
#11 0x0000000000400c5c in main (argc=4, argv=0x3ffffffe5c8) at ctxalloc.c:20

@yosefe
Copy link
Contributor Author

yosefe commented Jun 19, 2017

command line on single node:

mpirun -np 1 -mca btl self --bind-to core --display-map -mca pml ob1 ./ctxalloc 2 1500 100

@kawashima-fj
Copy link
Member

kawashima-fj commented Jun 20, 2017

I can reproduce the hang on my ARM64 machine.

Open MPI: openmpi-v2.x-201706150321-b562082 (nightly snapshot tarball)
CPU: Cavium Thunder X (ARMv8 (64-bit))
OS: CentOS Linux release 7.2.1603 (AltArch)
Compiler: GCC 4.8.5 (in CentOS)
configure option: --enable-debug or --disable-debug
compiler option: -O0

I cannot reproduce the hang with -O1 or -O2 (either --enable-debug or --disable-debug).

Test programs in the Open MPI source tree also hang.

test/class/opal_lifo

(gdb) bt
#0  0x0000ffffb7d009a8 in __nanosleep_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000401384 in _opal_lifo_release_cpu ()
    at ../../../../opal/class/opal_lifo.h:195
#2  0x00000000004013b8 in opal_lifo_pop_atomic (lifo=0xffffffffed80)
    at ../../../../opal/class/opal_lifo.h:210
#3  0x00000000004014f4 in thread_test (arg=0xffffffffed80)
    at ../../../../test/class/opal_lifo.c:50
#4  0x0000000000401ab8 in main (argc=1, argv=0xfffffffff018)
    at ../../../../test/class/opal_lifo.c:147

test/class/opal_fifo

(gdb) bt
#0  opal_fifo_pop_atomic (fifo=0xffffffffed68)
    at ../../../../opal/class/opal_fifo.h:238
#1  0x00000000004015d4 in thread_test (arg=0xffffffffed68)
    at ../../../../test/class/opal_fifo.c:51
#2  0x0000000000401e3c in main (argc=1, argv=0xfffffffff018)
    at ../../../../test/class/opal_fifo.c:184

test/class/ompi_rb_tree

(gdb) bt
#0  0x0000ffffb7a589a8 in __nanosleep_nocancel () from /lib64/libpthread.so.0
#1  0x0000ffffb7bfb338 in _opal_lifo_release_cpu ()
    at ../../../opal/class/opal_lifo.h:195
#2  0x0000ffffb7bfb36c in opal_lifo_pop_atomic (lifo=0xffffffffec90)
    at ../../../opal/class/opal_lifo.h:210
#3  0x0000ffffb7bfb4fc in opal_free_list_get_st (flist=0xffffffffec90)
    at ../../../opal/class/opal_free_list.h:213
#4  0x0000ffffb7bfb520 in opal_free_list_get (flist=0xffffffffec90)
    at ../../../opal/class/opal_free_list.h:225
#5  0x0000ffffb7bfb790 in opal_rb_tree_init (tree=0xffffffffec50, comp=0x4016c4 <comp_fn>)
    at ../../../opal/class/opal_rb_tree.c:86
#6  0x0000000000401c04 in test1 ()
    at ../../../../test/class/ompi_rb_tree.c:145
#7  0x0000000000402850 in main (argc=1, argv=0xfffffffff018)
    at ../../../../test/class/ompi_rb_tree.c:408

In the opal_lifo case, it is looping in do-while loop of the opal_lifo_pop_atomic function (OPAL_HAVE_ATOMIC_LLSC_PTR == 1 case).

In the opal_fifo case, it is looping in do-while loop of the opal_fifo_pop_atomic function (OPAL_HAVE_ATOMIC_LLSC_PTR == 1 case). Not continue case.

@hjelmn
Copy link
Member

hjelmn commented Jun 21, 2017

Does it run with --disable-builtin-atomics?

@kawashima-fj
Copy link
Member

@hjelmn Builtin atomics are disabled by default in v2.x branch. I enabled BUILTIN_GCC (__atomic_*) by --enable-builtin-atomics. Then make check (which includes three test programs above) passed without hang. Both -O0 and -O2 were tested.

@shamisp
Copy link
Contributor

shamisp commented Jun 30, 2017

@yosefe - what compiler version is used ?

@yosefe
Copy link
Contributor Author

yosefe commented Jun 30, 2017

$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)

@kawashima-fj
Copy link
Member

@PHHargrove reported similar issues on the devel list.

Regarding make check, totally reported version/platform/compiler/test patterns of fail/hang are:

  • Open MPI 2.1.2a1 / ARM64 / GCC 4.8.5 / opal_fifo, opal_lifo, ompi_rb_tree: hang
  • Open MPI 3.0.0rc1 / PPC64 / GCC 4.8.3 / opal_fifo: fail
  • Open MPI 3.0.0rc1 / PPC64LE / GCC 7.0.1 / opal_fifo: hang

A common point of ARM64 and PPC64(LE) is OPAL_HAVE_ATOMIC_LLSC_PTR == 1 (grep -r OPAL_HAVE_ATOMIC_LLSC_ opal/include/opal/sys shows it). But I'm wondering why the following patterns succeed. ARM64 and PPC64 show opposite results regarding --enable-builtin-atomics.

  • Open MPI 2.1.2a1 / ARM64 / GCC 4.8.5 / --enable-builtin-atomics (I tested)
  • Open MPI 3.0.0rc1 / ARM64 / GCC 4.8.5 / default (I tested)
  • Open MPI 2.1.1rc1 / PPC64 / GCC 4.8.3 (?) / default (?) (according to @PHHargrove's mail)

--disable-builtin-atomics is default in Open MPI 2.1 series.
--enable-builtin-atomics is default in Open MPI 3.0 series.

@jsquyres jsquyres added this to the v3.0.0 milestone Jul 5, 2017
@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2017

Added a v3.0.0 milestone since @PHHargrove saw this on 3.0.0rc1, per the above comment.

@shamisp
Copy link
Contributor

shamisp commented Jul 11, 2017

I will try to reproduce it on one of my systems.

@shamisp
Copy link
Contributor

shamisp commented Jul 13, 2017

I can confirm that the problem only show ups in "-O0" mode.

@hjelmn
Copy link
Member

hjelmn commented Jul 18, 2017

I know what is happening. With -O0 the opal_atomic_* functions are not inlined. That makes the LL/SC atomics a function call. This will likely cause livelock with the LL/SC fifo/lifo implementations as it increases the chance that a read will cancel the LL reservation. The correct fix is to force those atomics to always be inlined. I will make the change and see if it fixes the issue.

@kawashima-fj
Copy link
Member

@hjelmn I confirmed the current status on AArch64.

  • CPU: Cavium Thunder X (ARMv8 (64-bit))
  • OS: CentOS Linux release 7.2.1603 (AltArch)
  • Compiler: GCC 4.8.5 (in CentOS)
  • Open MPI:
    • 2.0.3
    • 2.1.1
    • 3.0.0rc1
    • 2.0.x nightly tarball (openmpi-v2.0.x-201707120323-25919a1)
    • 2.1.x nightly tarball (openmpi-v2.x-201707130323-bbbe264)
    • 3.0.x nightly tarball (openmpi-v3.0.x-201707180322-059cf32)
  • Configure option:
    • --enable-builtin-atomics
    • --disable-builtin-atomics
  • Compiler option:
    • -O0
    • -O3

make check result:

enable + -O0 enable + -O3 disable + -O0 disable + -O3
2.0.3 OK OK OK OK
2.1.1 OK OK hang-up OK
3.0.0rc1 OK OK hang-up OK
2.0.x nightly OK OK OK OK
2.1.x nightly OK OK hang-up OK
3.0.x nightly OK OK hang-up OK

All hang-up occur in the test/class directory (opal_fifo etc.).

As you said, bad condition is: Open MPI 2.1 or higher + --disable-builtin-atomics + -O0

Open MPI 2.0.x does not have this issue because it does not have opal/include/opal/sys/arm64 code and LL/SC fifo/lifo implementations are not used.

I'll confirm performance difference of --enable-builtin-atomics and --disable-builtin-atomics.

@kawashima-fj
Copy link
Member

@hjelmn I run osu_latency and osu_latency_mt on my ARM machine and found --disable-builtin-atomics is slightly faster than --enable-builtin-atomics on osu_latency_mt.

  • CPU/OS/Compiler: same as my previous comment
  • Open MPI: 3.0.0rc1
  • Compiler Option: -O3 (both Open MPI and OSU Micro-Benchmarks)
  • BTL: vader

osu_latency latency (us):

size (byte) --enable-builtin-atomics --disable-builtin-atomics
0 0.88 0.87
1 1.07 1.08
2 1.08 1.08
4 1.09 1.08
8 1.11 1.11
16 1.13 1.13
32 1.13 1.13
64 1.16 1.15

osu_latency_mt latency (us):

size (byte) --enable-builtin-atomics --disable-builtin-atomics
0 4.45 4.44
1 4.73 4.70
2 4.73 4.70
4 4.74 4.71
8 4.76 4.74
16 4.78 4.76
32 4.79 4.78
64 4.83 4.80

Each value is a median value of 10 times runs.

If you need more data, let me know.

@kawashima-fj
Copy link
Member

@hjelmn You asked me which commits I backported into v2.0.2-based Fujitsu MPI in the f2f meeting. I backported the following commits. Open MPI 2.0 can run on AArch64 without these commits but I backported for better support of AArch64.

@shamisp
Copy link
Contributor

shamisp commented Jul 20, 2017

I have tried gcc gcc/6.1.0 and gcc/7.1.0 and I still observe the same issue.

hjelmn added a commit to hjelmn/ompi that referenced this issue Aug 1, 2017
Enabling debugging can cause the load-link store-conditional
atomic operations to hit a live-lock condition. To prevent the
live-lock always inline these atomics.

Fixes open-mpi#3697

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

@kawashima-fj #3988 should fix the hang.

Its no surprise that the built-in atomics version is slower. The LL/SC lifo is significantly faster than the compare-and-swap version.

@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

This is the correct one. Think we have a fix.

@bwbarrett
Copy link
Member

@hjelmn / @jjhursey, where are we on fixing this issue? What branches are still impacted?

@bwbarrett bwbarrett modified the milestones: v3.0.1, v3.0.0 Sep 12, 2017
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Dec 19, 2017
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and
POWER platforms when the atomic fifo assembly isn't inlined,
which manifests as a hang.  Document the issue and the
work-around until a proper fix is committed.

Signed-off-by: Brian Barrett <[email protected]>
bwbarrett added a commit that referenced this issue Dec 20, 2017
As documented in #4563 and #3697, there is an issue on ARM and
POWER platforms when the atomic fifo assembly isn't inlined,
which manifests as a hang.  Document the issue and the
work-around until a proper fix is committed.

Signed-off-by: Brian Barrett <[email protected]>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Dec 20, 2017
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and
POWER platforms when the atomic fifo assembly isn't inlined,
which manifests as a hang.  Document the issue and the
work-around until a proper fix is committed.

Signed-off-by: Brian Barrett <[email protected]>
(cherry picked from commit 4658422)
@bwbarrett bwbarrett modified the milestones: v3.0.1, vNEXT, Future Mar 1, 2018
@jsquyres
Copy link
Member

Per 2018-03 Dallas face-to-face meeting, this is still happening to Fujitsu on ARMv8. It was discussed in the Dallas meeting; @hjelmn is looking into this.

@shamisp
Copy link
Contributor

shamisp commented Mar 22, 2018

Can we have a bit more details on this ? What AMO is broken ? Does it happen only with built-in AMOs ?

hjelmn added a commit to hjelmn/ompi that referenced this issue May 31, 2018
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.

Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.

References open-mpi#3697. Close when applied to v3.0.x and v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue May 31, 2018
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.

Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.

References open-mpi#3697. Close when applied to v3.0.x and v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit that referenced this issue Jun 1, 2018
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.

Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.

References #3697. Close when applied to v3.0.x and v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 4, 2018
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.

Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.

Back-port from master.

References open-mpi#3697. Close when applied to v3.0.x and v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit f8dbf62)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 5, 2018
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.

Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.

Back-port from master.

References open-mpi#3697. Close when applied to v3.0.x and v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit f8dbf62)
Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit b09f0b1)
Signed-off-by: Nathan Hjelm <[email protected]>
hoopoepg pushed a commit to hoopoepg/ompi that referenced this issue Jun 18, 2018
This commit fixes a hang that occurs with debug builds of Open MPI on
aarch64 and power/powerpc systems. When the ll/sc atomics are inline
functions the compiler emits load/store instructions for the function
arguments with -O0. These extra load/store arguments can cause the ll
reservation to be cancelled causing live-lock.

Note that we did attempt to fix this with always_inline but the extra
instructions are stil emitted by the compiler (gcc). There may be
another fix but this has been tested and is working well.

References open-mpi#3697. Close when applied to v3.0.x and v3.1.x.

Signed-off-by: Nathan Hjelm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants