Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal_fifo check hanging on aarch64 / PowerPC Big Endian #4563

Closed
hppritcha opened this issue Dec 4, 2017 · 20 comments
Closed

opal_fifo check hanging on aarch64 / PowerPC Big Endian #4563

hppritcha opened this issue Dec 4, 2017 · 20 comments
Assignees

Comments

@hppritcha
Copy link
Member

hppritcha commented Dec 4, 2017

At least on master at b160cf6 opal_fifo appears to be regularly hanging on aarch64. I'm using gcc 4.8.5. This test had been passing regularly with jenkins CI PR testing until sometime in the last several days/week.

The test does not appear to hang when Open MPI is configured with --enable-debug. Well actually it does sometimes. Bullet proof way to avoid the problem is to configure with
--disable-builtin-atomics.

@kawashima-fj
Copy link
Member

@hppritcha I cannot observe the hang in my environment. Is there more information to reproduce it?

My environment:

  • CPU: Thunder X
  • OS: CentOS 7.2.1603
  • GCC: 4.8.5 20150623 (CentOS rpm)
  • Open MPI: 2c86b87 (latest master)
  • configure: both --disable-debug and --enable-debug
  • test: make -C test check

@kawashima-fj
Copy link
Member

@hppritcha I can observe the hang in Jenkins.
https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.platforms/Platform=ARMv8/1849/consoleFull
But I still cannot observe the hang in my environment. (I ran make check 100 times.)

This may related to an old issue #3697 (#3988).

@hppritcha
Copy link
Member Author

@kawashima-fj how many cpus does your aarch64 system have?

@shamisp
Copy link
Contributor

shamisp commented Dec 5, 2017

Lock implementation can depend on compiler and uarch.

@shamisp
Copy link
Contributor

shamisp commented Dec 5, 2017

Does it use gcc AMOs implementation or OMPI's ?

@hppritcha
Copy link
Member Author

Hmm...If I take the patch from #4566 and configure with --disable-builtin-atomics, the opal_fifo doesn't hang on my aarch64 system. gcc 4.8.5, cpuinfo:

processor	: 0
BogoMIPS	: 500.00
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x1
CPU part	: 0xd07
CPU revision	: 2

I'm running opal_fifo in a loop, letting it go for hundreds of iterations. using default configure options, the test typically hangs after a few iterations.

@hjelmn
Copy link
Member

hjelmn commented Dec 5, 2017

@shamisp When using gcc builtins we end up using a bad compare-exchange 128 implementation (lock-based). I think we need to improve the configury and make ompi use the ompi atomic implementations for aarch64.

@shamisp
Copy link
Contributor

shamisp commented Dec 5, 2017

What version of GCC ?

@hjelmn
Copy link
Member

hjelmn commented Dec 6, 2017

All recent versions (5.x, 6.x, 7.x) advertise 128-bit compare-exchange support. It gets past our current configure check.

@hjelmn
Copy link
Member

hjelmn commented Dec 6, 2017

Easy enough to hard-code the builtins off until we have a better way to deal with this.

@hjelmn
Copy link
Member

hjelmn commented Dec 6, 2017

This problem exists with other LL/SC architectures. Let me finish my debug mode fix for the LL/SC implementation of the lifo and fifo and I will make sure we disable the builtins by default on those architectures.

@hppritcha
Copy link
Member Author

I don't see this 128 -bit compare exchange thing. Here's what's in my opal/include/opal_config.h:

/* Whether the __atomic builtin atomic compare and swap is lock-free on
   128-bit values */
/* #undef OPAL_HAVE_GCC_BUILTIN_CSWAP_INT128 */

when building on my cortexa57 box using either gcc 4.8.5 or gcc 7.3.0

@hjelmn
Copy link
Member

hjelmn commented Dec 6, 2017

Huh, odd. Ok. That is different than on power. I assumed it would be the same here. How about OPAL_HAVE_SYNC_BUILTIN_CSWAP_INT128?

@hppritcha
Copy link
Member Author

IN opal/include/opal_config.h I see

/* Whether the __sync builtin atomic compare and swap supports 128-bit values
   */
/* #undef OPAL_HAVE_SYNC_BUILTIN_CSWAP_INT128 */

@hppritcha
Copy link
Member Author

anyway, the jenkins scripts allow setting an optional configure option so i've defined that to add in

---disable-builtin-atomics

as that appears to be a reliable way to get opal_fifo to pass on this system.

@shamisp
Copy link
Contributor

shamisp commented Dec 6, 2017

AFAIK gcc 7 and latest clang are doing pretty good job on AMOs, including support for Arm v8.1 atomics. Do we know exactly what operation is broken ?

bwbarrett added a commit to bwbarrett/ompi that referenced this issue Dec 19, 2017
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and
POWER platforms when the atomic fifo assembly isn't inlined,
which manifests as a hang.  Document the issue and the
work-around until a proper fix is committed.

Signed-off-by: Brian Barrett <[email protected]>
bwbarrett added a commit that referenced this issue Dec 20, 2017
As documented in #4563 and #3697, there is an issue on ARM and
POWER platforms when the atomic fifo assembly isn't inlined,
which manifests as a hang.  Document the issue and the
work-around until a proper fix is committed.

Signed-off-by: Brian Barrett <[email protected]>
bwbarrett added a commit to bwbarrett/ompi that referenced this issue Dec 20, 2017
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and
POWER platforms when the atomic fifo assembly isn't inlined,
which manifests as a hang.  Document the issue and the
work-around until a proper fix is committed.

Signed-off-by: Brian Barrett <[email protected]>
(cherry picked from commit 4658422)
@bwbarrett bwbarrett changed the title opal_fifo check hanging on aarch64 opal_fifo check hanging on aarch64 / PowerPC Big Endian Mar 20, 2018
@bwbarrett
Copy link
Member

Discussion in the room about Power is that when we re-enable Power BE because this doesn't happen anymore is that we should have a NEWS item that says despite fixing the error message, we still don't actually support Power BE. And that we should remove the block now that we know it wasn't a silent data corruption problem.

@jsquyres
Copy link
Member

jsquyres commented Apr 6, 2018

@hjelmn Have you had a chance to finish this yet, perchance?

@jsquyres
Copy link
Member

jsquyres commented Apr 6, 2018

I note that there's a README bullet that will need to be updated once this issue is fixed:

Platform Notes

  • ARM and POWER users may experience intermittent hangs when Open MPI
    is compiled with low optimization settings, due to an issue with our
    atomic list implementation. We recommend compiling with -O3
    optimization, both for performance reasons and to avoid this hang.

jsquyres added a commit to jsquyres/ompi that referenced this issue Apr 10, 2018
Also note that ARM and POWER users may experience hangs (until
open-mpi#4563 is fixed).

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this issue Apr 10, 2018
Also note that ARM and POWER users may experience hangs (until
open-mpi#4563 is fixed).

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this issue Apr 10, 2018
Also note that ARM and POWER users may experience hangs (until
open-mpi#4563 is fixed).

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this issue Apr 10, 2018
Also note that ARM and POWER users may experience hangs (until
open-mpi#4563 is fixed).

Signed-off-by: Jeff Squyres <[email protected]>
@hppritcha
Copy link
Member Author

resolved several years ago. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants