Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{perf}[gompi/2021b] OSU-Micro-Benchmarks v5.9 #15344

Conversation

branfosj
Copy link
Member

@branfosj branfosj commented Apr 26, 2022

@branfosj branfosj marked this pull request as draft April 26, 2022 08:10
('CUDA', '11.4.1', '', True),
('NCCL', '2.10.3', versionsuffix),
('OpenMPI-CUDA', '4.1.1', versionsuffix),
('UCX-CUDA', '1.11.2', versionsuffix),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not strictly needed since OpenMPI-CUDA includes it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in c6e310a

@boegelbot
Copy link
Collaborator

@branfosj: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/2225335681
Output from first failing test suite run:

ERROR: test_conflicts (test.easyconfigs.easyconfigs.EasyConfigTest)
Check whether any conflicts occur in software dependency graphs.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 325, in test_conflicts
    self.assertFalse(check_conflicts(self.ordered_specs, modules_tool(), check_inter_ec_conflicts=False),
  File "test/easyconfigs/easyconfigs.py", line 277, in ordered_specs
    EasyConfigTest.resolve_all_dependencies()
  File "test/easyconfigs/easyconfigs.py", line 215, in resolve_all_dependencies
    cls._parsed_easyconfigs, modules_tool(), retain_all_deps=True)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 461, in resolve_dependencies
    raise_error_missing_deps(totally_missing, extra_msg="no easyconfig file or existing module found")
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 324, in raise_error_missing_deps
    raise EasyBuildError(error_msg)
EasyBuildError: 'Missing dependencies: OpenMPI-CUDA/4.1.1-gompi-2021b-CUDA-11.4.1 (no easyconfig file or existing module found)'

======================================================================
ERROR: test_dep_graph (test.easyconfigs.easyconfigs.EasyConfigTest)
Unit test that builds a full dependency graph.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 312, in test_dep_graph
    dep_graph(fn, self.ordered_specs)
  File "test/easyconfigs/easyconfigs.py", line 277, in ordered_specs
    EasyConfigTest.resolve_all_dependencies()
  File "test/easyconfigs/easyconfigs.py", line 215, in resolve_all_dependencies
    cls._parsed_easyconfigs, modules_tool(), retain_all_deps=True)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 461, in resolve_dependencies
    raise_error_missing_deps(totally_missing, extra_msg="no easyconfig file or existing module found")
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 324, in raise_error_missing_deps
    raise EasyBuildError(error_msg)
EasyBuildError: 'Missing dependencies: OpenMPI-CUDA/4.1.1-gompi-2021b-CUDA-11.4.1 (no easyconfig file or existing module found)'

======================================================================
ERROR: test_dep_versions_per_toolchain_generation (test.easyconfigs.easyconfigs.EasyConfigTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 800, in test_dep_versions_per_toolchain_generation
    for ec in self.ordered_specs:
  File "test/easyconfigs/easyconfigs.py", line 277, in ordered_specs
    EasyConfigTest.resolve_all_dependencies()
  File "test/easyconfigs/easyconfigs.py", line 215, in resolve_all_dependencies
    cls._parsed_easyconfigs, modules_tool(), retain_all_deps=True)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 461, in resolve_dependencies
    raise_error_missing_deps(totally_missing, extra_msg="no easyconfig file or existing module found")
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 324, in raise_error_missing_deps
    raise EasyBuildError(error_msg)
EasyBuildError: 'Missing dependencies: OpenMPI-CUDA/4.1.1-gompi-2021b-CUDA-11.4.1 (no easyconfig file or existing module found)'

----------------------------------------------------------------------
Ran 14544 tests in 645.639s

FAILED (errors=3)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@branfosj
Copy link
Member Author

branfosj commented Apr 26, 2022

For non-NCCL tests:

mpirun -np 2 ${t} -d cuda D D
mpirun -np 2 ${t} -d cuda H D
mpirun -np 2 ${t} -d cuda D H
mpirun -np 2 ${t} -d cuda H H
  • Pass: osu_bibw osu_bw osu_latency osu_mbw_mr osu_multi_lat osu_allgather osu_allgatherv osu_allreduce osu_alltoall osu_alltoallv osu_bcast osu_gather osu_gatherv osu_reduce osu_reduce_scatter osu_scatter osu_scatterv osu_iallgather osu_ialltoall osu_ibcast osu_igather osu_iscatter osu_alltoall osu_allreduce osu_reduce osu_alltoall
  • Fail: osu_put_latency osu_get_latency osu_put_bw osu_get_bw osu_put_bibw osu_acc_latency osu_cas_latency osu_fop_latency osu_ireduce osu_iallreduce
  • Does not suport the -d cuda: osu_latency_mt (but passes with that removed)

For NCCL tests:

mpirun -np 2 ${t} -d cuda D D
  • Pass: osu_nccl_bibw osu_nccl_bw osu_nccl_latency osu_nccl_allgather osu_nccl_allreduce osu_nccl_bcast osu_nccl_reduce osu_nccl_reduce_scatter osu_nccl_reduce osu_nccl_allreduce
  • no NCCL tests failed

@branfosj
Copy link
Member Author

osu_put_latency failure:

$ mpirun -np 2 osu_put_latency D D
# OSU MPI_Put-CUDA Latency Test v5.9
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size          Latency (us)
[bask-pg0308u06a:3785095:0:3785095] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f40c3200000)
==== backtrace (tid:3785095) ====
 0 0x0000000000012c20 __funlockfile()  :0
 1 0x000000000016079c __memmove_avx_unaligned_erms()  :0
 2 0x000000000003f09a ucp_rma_sw_put_pack_cb()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/rma/rma_sw.c:35
 3 0x000000000003202f uct_rc_mlx5_ep_am_bcopy()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/uct/ib/rc/accel/rc_mlx5_ep.c:347
 4 0x000000000003eff1 uct_ep_am_bcopy()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/uct/api/uct.h:2844
 5 0x000000000003eff1 ucp_rma_sw_progress_put()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/rma/rma_sw.c:47
 6 0x000000000003d737 ucp_request_try_send()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/core/ucp_request.inl:302
 7 0x000000000003d737 ucp_request_send()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/core/ucp_request.inl:327
 8 0x000000000003d737 ucp_rma_send_request()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/rma/rma.inl:42
 9 0x000000000003d737 ucp_rma_nonblocking()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/rma/rma_send.c:183
10 0x000000000003d737 ucp_put_nbx()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/rma/rma_send.c:301
11 0x000000000003df40 ucp_put_nbi()  /dev/shm/build-branfosj-up/UCX/1.11.2/GCCcore-11.2.0/ucx-1.11.2/src/ucp/rma/rma_send.c:191
12 0x0000000000005351 ompi_osc_ucx_put()  ???:0
13 0x000000000009228a MPI_Put()  ???:0
14 0x00000000004031a7 run_put_with_flush()  ???:0
15 0x0000000000402bcb main()  ???:0
16 0x0000000000023493 __libc_start_main()  ???:0
17 0x0000000000402cce _start()  ???:0
=================================
[bask-pg0308u06a:3785095] *** Process received signal ***
[bask-pg0308u06a:3785095] Signal: Segmentation fault (11)
[bask-pg0308u06a:3785095] Signal code:  (-6)
[bask-pg0308u06a:3785095] Failing at address: 0x937d90039c187
[bask-pg0308u06a:3785095] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x7f40ef526c20]
[bask-pg0308u06a:3785095] [ 1] /lib64/libc.so.6(+0x16079c)[0x7f40eed0179c]
[bask-pg0308u06a:3785095] [ 2] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/UCX/1.11.2-GCCcore-11.2.0/lib/libucp.so.0(+0x3f09a)[0x7f40e609209a]
[bask-pg0308u06a:3785095] [ 3] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/UCX/1.11.2-GCCcore-11.2.0/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_ep_am_bcopy+0xaf)[0x7f40e1a6402f]
[bask-pg0308u06a:3785095] [ 4] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/UCX/1.11.2-GCCcore-11.2.0/lib/libucp.so.0(+0x3eff1)[0x7f40e6091ff1]
[bask-pg0308u06a:3785095] [ 5] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/UCX/1.11.2-GCCcore-11.2.0/lib/libucp.so.0(ucp_put_nbx+0x2e7)[0x7f40e6090737]
[bask-pg0308u06a:3785095] [ 6] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/UCX/1.11.2-GCCcore-11.2.0/lib/libucp.so.0(ucp_put_nbi+0x10)[0x7f40e6090f40]
[bask-pg0308u06a:3785095] [ 7] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/OpenMPI-CUDA/4.1.1-GCC-11.2.0-CUDA-11.4.1/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_put+0x1a1)[0x7f40e10cc351]
[bask-pg0308u06a:3785095] [ 8] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/OpenMPI-CUDA/4.1.1-GCC-11.2.0-CUDA-11.4.1/lib/libmpi.so.40(PMPI_Put+0xaa)[0x7f40f2b3328a]
[bask-pg0308u06a:3785095] [ 9] osu_put_latency[0x4031a7]
[bask-pg0308u06a:3785095] [10] osu_put_latency[0x402bcb]
[bask-pg0308u06a:3785095] [11] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f40eebc4493]
[bask-pg0308u06a:3785095] [12] osu_put_latency[0x402cce]
[bask-pg0308u06a:3785095] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node bask-pg0308u06a exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

osu_fop_latency failure:

$ mpirun -np 2 osu_fop_latency D D
# OSU MPI_Fetch_and_op-CUDA latency Test v5.9
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size          Latency (us)
[bask-pg0308u06a:3785528:0:3785528] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f12c1200000)
==== backtrace (tid:3785528) ====
 0 0x0000000000012c20 __funlockfile()  :0
 1 0x0000000000007466 ompi_osc_ucx_fetch_and_op()  ???:0
 2 0x000000000006f8d5 MPI_Fetch_and_op()  ???:0
 3 0x000000000040310e run_fop_with_flush()  ???:0
 4 0x0000000000402bed main()  ???:0
 5 0x0000000000023493 __libc_start_main()  ???:0
 6 0x0000000000402cee _start()  ???:0
=================================
[bask-pg0308u06a:3785528] *** Process received signal ***
[bask-pg0308u06a:3785528] Signal: Segmentation fault (11)
[bask-pg0308u06a:3785528] Signal code:  (-6)
[bask-pg0308u06a:3785528] Failing at address: 0x937d90039c338
[bask-pg0308u06a:3785528] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x7f12e8fe2c20]
[bask-pg0308u06a:3785528] [ 1] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/OpenMPI-CUDA/4.1.1-GCC-11.2.0-CUDA-11.4.1/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_fetch_and_op+0xd6)[0x7f12a8b84466]
[bask-pg0308u06a:3785528] [ 2] /bask/projects/e/edmondac-rsg/easybuild/eb-sjb-up/EL8-ice/software/OpenMPI-CUDA/4.1.1-GCC-11.2.0-CUDA-11.4.1/lib/libmpi.so.40(MPI_Fetch_and_op+0xe5)[0x7f12ec5c68d5]
[bask-pg0308u06a:3785528] [ 3] osu_fop_latency[0x40310e]
[bask-pg0308u06a:3785528] [ 4] osu_fop_latency[0x402bed]
[bask-pg0308u06a:3785528] [ 5] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f12e8680493]
[bask-pg0308u06a:3785528] [ 6] osu_fop_latency[0x402cee]
[bask-pg0308u06a:3785528] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node bask-pg0308u06a exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@boegelbot
Copy link
Collaborator

@branfosj: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/2225744002
Output from first failing test suite run:

ERROR: test_conflicts (test.easyconfigs.easyconfigs.EasyConfigTest)
Check whether any conflicts occur in software dependency graphs.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 325, in test_conflicts
    self.assertFalse(check_conflicts(self.ordered_specs, modules_tool(), check_inter_ec_conflicts=False),
  File "test/easyconfigs/easyconfigs.py", line 277, in ordered_specs
    EasyConfigTest.resolve_all_dependencies()
  File "test/easyconfigs/easyconfigs.py", line 215, in resolve_all_dependencies
    cls._parsed_easyconfigs, modules_tool(), retain_all_deps=True)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 461, in resolve_dependencies
    raise_error_missing_deps(totally_missing, extra_msg="no easyconfig file or existing module found")
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 324, in raise_error_missing_deps
    raise EasyBuildError(error_msg)
EasyBuildError: 'Missing dependencies: OpenMPI-CUDA/4.1.1-gompi-2021b-CUDA-11.4.1 (no easyconfig file or existing module found)'

======================================================================
ERROR: test_dep_graph (test.easyconfigs.easyconfigs.EasyConfigTest)
Unit test that builds a full dependency graph.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 312, in test_dep_graph
    dep_graph(fn, self.ordered_specs)
  File "test/easyconfigs/easyconfigs.py", line 277, in ordered_specs
    EasyConfigTest.resolve_all_dependencies()
  File "test/easyconfigs/easyconfigs.py", line 215, in resolve_all_dependencies
    cls._parsed_easyconfigs, modules_tool(), retain_all_deps=True)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 461, in resolve_dependencies
    raise_error_missing_deps(totally_missing, extra_msg="no easyconfig file or existing module found")
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 324, in raise_error_missing_deps
    raise EasyBuildError(error_msg)
EasyBuildError: 'Missing dependencies: OpenMPI-CUDA/4.1.1-gompi-2021b-CUDA-11.4.1 (no easyconfig file or existing module found)'

======================================================================
ERROR: test_dep_versions_per_toolchain_generation (test.easyconfigs.easyconfigs.EasyConfigTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 800, in test_dep_versions_per_toolchain_generation
    for ec in self.ordered_specs:
  File "test/easyconfigs/easyconfigs.py", line 277, in ordered_specs
    EasyConfigTest.resolve_all_dependencies()
  File "test/easyconfigs/easyconfigs.py", line 215, in resolve_all_dependencies
    cls._parsed_easyconfigs, modules_tool(), retain_all_deps=True)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 461, in resolve_dependencies
    raise_error_missing_deps(totally_missing, extra_msg="no easyconfig file or existing module found")
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/tools/robot.py", line 324, in raise_error_missing_deps
    raise EasyBuildError(error_msg)
EasyBuildError: 'Missing dependencies: OpenMPI-CUDA/4.1.1-gompi-2021b-CUDA-11.4.1 (no easyconfig file or existing module found)'

----------------------------------------------------------------------
Ran 14544 tests in 710.545s

FAILED (errors=3)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@Micket
Copy link
Contributor

Micket commented Apr 26, 2022

For reference; https://www.open-mpi.org/faq/?category=runcuda#mpi-apis-cuda-ucx
As of writing, lists;

MPI API Support Added In Version
MPI_Send, MPI_Bsend, MPI_Ssend, MPI_Rsend, MPI_Isend, MPI_Ibsend, MPI_Issend, MPI_Irsend, MPI_Send_init, MPI_Bsend_init, MPI_Ssend_init, MPI_Rsend_init, MPI_Recv, MPI_Irecv, MPI_Recv_init, MPI_Sendrecv, MPI_Bcast, MPI_Gather, MPI_Gatherv, MPI_Allgather, MPI_Reduce, MPI_Reduce_scatter, MPI_Reduce_scatter_block, MPI_Allreduce, MPI_Scan, MPI_Exscan, MPI_Allgatherv, MPI_Alltoall, MPI_Alltoallv, MPI_Alltoallw, MPI_Scatter, MPI_Scatterv, MPI_Iallgather, MPI_Iallgatherv, MPI_Ialltoall, MPI_Iialltoallv, MPI_Ialltoallw, MPI_Ibcast, MPI_Iexscan UCX v1.4
MPI API Expected Support
One-sided operations such as MPI_Put, MPI_Get, MPI_Accumulate, MPI_Rget, MPI_Rput, MPI_Get_Accumulate, MPI_Fetch_and_op, MPI_Compare_and_swap, etc Future
Window creation calls such as MPI_Win_create Future
Non-blocking reduction collectives like MPI_Ireduce, MPI_Iallreduce, etc Future

@branfosj
Copy link
Member Author

That explains it. The failures align with the future work. So those are expected failures then and the rest are working.

@branfosj
Copy link
Member Author

closing this, as we are going with #15528 instead of #14919

@branfosj branfosj closed this May 26, 2022
@branfosj branfosj deleted the 20220426091024_new_pr_OSU-Micro-Benchmarks59 branch May 26, 2022 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants