PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP #14919

akesandgren · 2022-02-03T15:57:20Z

(created using eb --new-pr)

NOTE: PoC only (even if it actually works)

This shows how we can separate the CUDA parts in OpenMPI in such a way that we can handle OpenMPI the same way we do UCX and UCX-CUDA.

To build the cuda split base version (i.e. OpenMPI) configure with --with-cuda=enable with no CUDA deps
To build the fully cuda enabled OpenMPI-CUDA, configure with --with-cuda=$EBROOTCUDA (as usual) and only install

libmca_common_cuda*
mca_btl_smcuda*
mca_pml_ob1*
mca_pml_ucx*
mca_btl_openib*
mca_btl_tcp*
mca_rcache_gpusm*
mca_rcache_grdma*
mca_rcache_rgpusm*

into OpenMPI-CUDA.../lib and lib/openmpi
And finally make sure OMPI_MCA_mca_component_path first lists OpenMPI-CUDA's lib and lib/openmpi and the the base OpenMPI's lib and lib/openmpi

The OpenMPI-4.1.1_fix_missing_OPAL_CUDA_GDR_SUPPORT_protection.patch doesn't really have anything to do with this, it's just a fix for an actual bug in OpenMPI that got uncovered due to this.

And CUDA_GDR support is still not handled since that will also need manual intervention during configure.

Depends on: easybuilders/easybuild-easyblocks#2710

boegelbot · 2022-02-03T16:53:34Z

@akesandgren: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/1790411763
Output from first failing test suite run:

FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 873, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches in OpenMPI-4.1.1-GCC-11.2.0.eb: found 1 sources + 8 patches vs 6 checksums

----------------------------------------------------------------------
Ran 14165 tests in 728.390s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

boegel · 2022-02-03T19:08:44Z

@akesandgren Did you open PRs upstream for these changes? If so, please drop the PR URLs in the PR description

akesandgren · 2022-02-07T07:06:29Z

I haven't done that yet. Was kind of hoping for some constructive criticism here first...
The problem with the above is that there are still 4 mca modules that need to be built and installed twice.

…syblocks#2710

boegelbot · 2022-04-20T13:53:46Z

@akesandgren: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/2195839723
Output from first failing test suite run:

FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 910, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches in OpenMPI-4.1.1-GCC-11.2.0.eb: found 1 sources + 8 patches vs 6 checksums
Checksums missing for one or more sources/patches in OpenMPI-CUDA-4.1.1-GCC-11.2.0-CUDA-11.4.1.eb: found 1 sources + 8 patches vs 6 checksums

----------------------------------------------------------------------
Ran 14517 tests in 532.978s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

akesandgren · 2022-04-20T17:00:50Z

@boegelbot Please build @ generoso
EB_ARGS='--include-easyblocks-from-pr 2710'

boegelbot · 2022-04-20T17:05:10Z

Got message " Please build @ generoso
EB_ARGS='--include-easyblocks-from-pr 2710'", but I don't know what to do with it, sorry...

- notification for comment with ID 1104186093 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

akesandgren · 2022-04-20T17:14:00Z

@boegelbot Please test @ generoso
EB_ARGS='--include-easyblocks-from-pr 2710'

boegelbot · 2022-04-20T17:15:09Z

@akesandgren: Request for testing this PR well received on login1

PR test command 'EB_PR=14919 EB_ARGS="--include-easyblocks-from-pr 2710" /opt/software/slurm/bin/sbatch --job-name test_PR_14919 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 8390

Test results coming soon (I hope)...

- notification for comment with ID 1104201893 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2022-04-20T17:52:58Z

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2710
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/4025828c5fec4e9ba74622b4acc7743f for a full test report.

bartoldeman · 2022-05-04T15:26:01Z

I think I have a simpler solution, which is to include those parts of cuda.h that are necessary to build Open MPI.
This was done by other projects as well, e.g. GCC for nvptx offloading, see https://patchwork.ozlabs.org/project/gcc/patch/20170113181123.GA1867@tucnak/

I'm attaching my patch here, which can still use a bit of cleanup. The resulting Open MPI should be equivalent to one built with CUDA as a builddependency, so no special OMPI_mca environment variables need to be set, and there's just one Open MPI, with two UCXes, as before.
OpenMPI-4.1.1_build_without_cuda_header.patch.txt

The only concern left is then the potential performance issue for small transfers since there is always a runtime check for CUDA.

Micket · 2022-05-04T15:32:58Z

This is unexpected to me. Isn't OpenMPI linking to libcudart.so (and calling functions from there)?

bartoldeman · 2022-05-04T15:46:28Z

This is unexpected to me. Isn't OpenMPI linking to libcudart.so (and calling functions from there)?

No, what happens is that the code in common_cuda.c dlopen()s libcuda.so.1, and imports various symbols from there. Nothing is linked directly to libcudart.so or libcuda.so.

Micket · 2022-05-04T15:54:19Z

I see. Well, ldd libmpi.so on old fosscuda OpenMPI does show it linking to libcudart.so, but that's just OpenMPI sillyness then

bartoldeman · 2022-05-04T18:59:14Z

Simplified patch attached
OpenMPI-4.1.1_build_with_internal_cuda_header.patch.txt

bartoldeman · 2022-05-04T19:06:02Z

I see. Well, ldd libmpi.so on old fosscuda OpenMPI does show it linking to libcudart.so, but that's just OpenMPI sillyness then

EasyBuild sillyness really since gcccuda toolchains and friends add -lrt -lcudart to LIBS.

ocaisa · 2022-05-05T09:15:06Z

I think it's called sideloading or something, OpenMPI does that for lots of libraries that it may or may not use.

Micket · 2022-05-06T12:25:54Z

@bartoldeman Will you open a PR for this approach?
Do we need a PR for the easyblock, to introduce the option?
disable_cuda_awareness = True
or some such that forcibly sets --without-cuda ?

ocaisa · 2022-05-19T13:02:55Z

@Micket the patch introduces a new potential value for the option --with-cuda (--with-cuda=internal) but it looks like we would need a way of ensuring that the patch was not applied if CUDA was a dependency (since this may have a little less overhead). We could tweak the OpenMPI easyblock to take a cuda key which accepts internal, no (default) and forcefully sets the value if CUDA is a dependency.

ocaisa · 2022-05-19T13:30:43Z

Looking at the easyblock, I don't think there's any need for a change to the easyblock. You just add the patch and set

configopts = '--with-cuda=internal'

Micket · 2022-05-19T14:09:18Z

@Micket the patch introduces a new potential value for the option --with-cuda (--with-cuda=internal) but it looks like we would need a way of ensuring that the patch was not applied if CUDA was a dependency (since this may have a little less overhead).

I didn't think any of the patches we have discussed would induce any more or less overhead than what we have always had in the old fosscuda toolchains. There is simply the price to pay for making the memory-copying routines to check

The only downside of enabling this cuda-awareness is that it does induce an overhead even for those who aren't using CUDA at all due to the extra branch + function call.
Though, this is just for collectives, and just matters for smaller message sizes. Also, @bartoldeman wrote some patches to reduce the overhead significantly. We should apply those patches even for the old fosscuda toolchains (no reason not to).

quoting @bartoldeman from the slack thread:

I did some profiling and with a small patch I could reduce the difference to <1% for small alltoall.
Basically what happens is that an often called small memcpy, which is normally inlined in non-CUDA-enabled Open MPI gets transformed into a function pointer for CUDA-enabled Open MPI. If I use

#define MEMCPY_CUDA( DST, SRC, BLENGTH, CONVERTOR ) \
do { \
    if (CONVERTOR->flags & CONVERTOR_CUDA) \
        CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) ); \
    else \
        MEMCPY( (DST), (SRC), (BLENGTH) ); \
} while(0)

instead of

#    define MEMCPY_CUDA(DST, SRC, BLENGTH, CONVERTOR) \
        CONVERTOR->cbmemcpy((DST), (SRC), (BLENGTH), (CONVERTOR))

and similar in 2 other places, the performance difference goes down to <1%.

At that point, i don't even think it's worth caring about disabling cuda-awareness. It might already be possible by simply using:

configopts = '--without-cuda`

if you don't care about CUDA and not want to pay even the <1% overhead.

boegel · 2022-06-08T11:31:12Z

@akesandgren This can be closed now that #15528 is merged?

akesandgren · 2022-06-08T12:23:56Z

Jup, that it can, closing.

Add groundwork for separating actual cuda parts in OpenMPI.

ee2b250

akesandgren changed the title ~~Add groundwork for separating actual cuda parts in OpenMPI.~~ PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP Feb 3, 2022

ocaisa mentioned this pull request Feb 18, 2022

[Feature Request] Use runtime detection for have_gpu_aware lammps/lammps#3140

Closed

boegel added the enhancement label Mar 2, 2022

boegel added this to the 4.x milestone Mar 2, 2022

Add easyconfig for OpenMPI-CUDA. Depends on easybuilders/easybuild-ea…

1c75cb9

…syblocks#2710

add missing checksums

75396f0

branfosj mentioned this pull request Apr 26, 2022

{perf}[gompi/2021b] OSU-Micro-Benchmarks v5.9 #15344

Closed

1 task

ocaisa mentioned this pull request May 19, 2022

{mpi}[GCC/11.3.0] OpenMPI v4.1.4 #15426

Merged

2 tasks

bartoldeman mentioned this pull request May 22, 2022

add patch for OpenMPI 4.1.1 to support building using --with-cuda=internal #15528

Merged

akesandgren closed this Jun 8, 2022

akesandgren deleted the 20220203165712_new_pr_OpenMPI411 branch June 8, 2022 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP #14919

PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP #14919

akesandgren commented Feb 3, 2022 •

edited

Loading

boegelbot commented Feb 3, 2022

boegel commented Feb 3, 2022

akesandgren commented Feb 7, 2022

boegelbot commented Apr 20, 2022

akesandgren commented Apr 20, 2022

boegelbot commented Apr 20, 2022

akesandgren commented Apr 20, 2022

boegelbot commented Apr 20, 2022

boegelbot commented Apr 20, 2022

bartoldeman commented May 4, 2022

Micket commented May 4, 2022

bartoldeman commented May 4, 2022 •

edited

Loading

Micket commented May 4, 2022

bartoldeman commented May 4, 2022

bartoldeman commented May 4, 2022

ocaisa commented May 5, 2022

Micket commented May 6, 2022

ocaisa commented May 19, 2022

ocaisa commented May 19, 2022

Micket commented May 19, 2022

boegel commented Jun 8, 2022

akesandgren commented Jun 8, 2022

PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP #14919

PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP #14919

Conversation

akesandgren commented Feb 3, 2022 • edited Loading

boegelbot commented Feb 3, 2022

boegel commented Feb 3, 2022

akesandgren commented Feb 7, 2022

boegelbot commented Apr 20, 2022

akesandgren commented Apr 20, 2022

boegelbot commented Apr 20, 2022

akesandgren commented Apr 20, 2022

boegelbot commented Apr 20, 2022

boegelbot commented Apr 20, 2022

bartoldeman commented May 4, 2022

Micket commented May 4, 2022

bartoldeman commented May 4, 2022 • edited Loading

Micket commented May 4, 2022

bartoldeman commented May 4, 2022

bartoldeman commented May 4, 2022

ocaisa commented May 5, 2022

Micket commented May 6, 2022

ocaisa commented May 19, 2022

ocaisa commented May 19, 2022

Micket commented May 19, 2022

boegel commented Jun 8, 2022

akesandgren commented Jun 8, 2022

akesandgren commented Feb 3, 2022 •

edited

Loading

bartoldeman commented May 4, 2022 •

edited

Loading