Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP #14919

Conversation

akesandgren
Copy link
Contributor

@akesandgren akesandgren commented Feb 3, 2022

(created using eb --new-pr)

NOTE: PoC only (even if it actually works)

This shows how we can separate the CUDA parts in OpenMPI in such a way that we can handle OpenMPI the same way we do UCX and UCX-CUDA.

To build the cuda split base version (i.e. OpenMPI) configure with --with-cuda=enable with no CUDA deps
To build the fully cuda enabled OpenMPI-CUDA, configure with --with-cuda=$EBROOTCUDA (as usual) and only install

libmca_common_cuda*
mca_btl_smcuda*
mca_pml_ob1*
mca_pml_ucx*
mca_btl_openib*
mca_btl_tcp*
mca_rcache_gpusm*
mca_rcache_grdma*
mca_rcache_rgpusm*

into OpenMPI-CUDA.../lib and lib/openmpi
And finally make sure OMPI_MCA_mca_component_path first lists OpenMPI-CUDA's lib and lib/openmpi and the the base OpenMPI's lib and lib/openmpi

The OpenMPI-4.1.1_fix_missing_OPAL_CUDA_GDR_SUPPORT_protection.patch doesn't really have anything to do with this, it's just a fix for an actual bug in OpenMPI that got uncovered due to this.

And CUDA_GDR support is still not handled since that will also need manual intervention during configure.

Depends on: easybuilders/easybuild-easyblocks#2710

@akesandgren akesandgren changed the title Add groundwork for separating actual cuda parts in OpenMPI. PoC: Add groundwork for separating actual cuda parts in OpenMPI. WIP WIP WIP Feb 3, 2022
@boegelbot
Copy link
Collaborator

@akesandgren: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/1790411763
Output from first failing test suite run:

FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 873, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches in OpenMPI-4.1.1-GCC-11.2.0.eb: found 1 sources + 8 patches vs 6 checksums

----------------------------------------------------------------------
Ran 14165 tests in 728.390s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@boegel
Copy link
Member

boegel commented Feb 3, 2022

@akesandgren Did you open PRs upstream for these changes? If so, please drop the PR URLs in the PR description

@akesandgren
Copy link
Contributor Author

I haven't done that yet. Was kind of hoping for some constructive criticism here first...
The problem with the above is that there are still 4 mca modules that need to be built and installed twice.

@boegelbot
Copy link
Collaborator

@akesandgren: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/2195839723
Output from first failing test suite run:

FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 910, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches in OpenMPI-4.1.1-GCC-11.2.0.eb: found 1 sources + 8 patches vs 6 checksums
Checksums missing for one or more sources/patches in OpenMPI-CUDA-4.1.1-GCC-11.2.0-CUDA-11.4.1.eb: found 1 sources + 8 patches vs 6 checksums

----------------------------------------------------------------------
Ran 14517 tests in 532.978s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@akesandgren
Copy link
Contributor Author

@boegelbot Please build @ generoso
EB_ARGS='--include-easyblocks-from-pr 2710'

@boegelbot
Copy link
Collaborator

Got message " Please build @ generoso
EB_ARGS='--include-easyblocks-from-pr 2710'", but I don't know what to do with it, sorry...

- notification for comment with ID 1104186093 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@akesandgren
Copy link
Contributor Author

@boegelbot Please test @ generoso
EB_ARGS='--include-easyblocks-from-pr 2710'

@boegelbot
Copy link
Collaborator

@akesandgren: Request for testing this PR well received on login1

PR test command 'EB_PR=14919 EB_ARGS="--include-easyblocks-from-pr 2710" /opt/software/slurm/bin/sbatch --job-name test_PR_14919 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8390

Test results coming soon (I hope)...

- notification for comment with ID 1104201893 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2710
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/4025828c5fec4e9ba74622b4acc7743f for a full test report.

@bartoldeman
Copy link
Contributor

I think I have a simpler solution, which is to include those parts of cuda.h that are necessary to build Open MPI.
This was done by other projects as well, e.g. GCC for nvptx offloading, see https://patchwork.ozlabs.org/project/gcc/patch/20170113181123.GA1867@tucnak/

I'm attaching my patch here, which can still use a bit of cleanup. The resulting Open MPI should be equivalent to one built with CUDA as a builddependency, so no special OMPI_mca environment variables need to be set, and there's just one Open MPI, with two UCXes, as before.
OpenMPI-4.1.1_build_without_cuda_header.patch.txt

The only concern left is then the potential performance issue for small transfers since there is always a runtime check for CUDA.

@Micket
Copy link
Contributor

Micket commented May 4, 2022

This is unexpected to me. Isn't OpenMPI linking to libcudart.so (and calling functions from there)?

@bartoldeman
Copy link
Contributor

bartoldeman commented May 4, 2022

This is unexpected to me. Isn't OpenMPI linking to libcudart.so (and calling functions from there)?

No, what happens is that the code in common_cuda.c dlopen()s libcuda.so.1, and imports various symbols from there. Nothing is linked directly to libcudart.so or libcuda.so.

@Micket
Copy link
Contributor

Micket commented May 4, 2022

I see. Well, ldd libmpi.so on old fosscuda OpenMPI does show it linking to libcudart.so, but that's just OpenMPI sillyness then

@bartoldeman
Copy link
Contributor

Simplified patch attached
OpenMPI-4.1.1_build_with_internal_cuda_header.patch.txt

@bartoldeman
Copy link
Contributor

I see. Well, ldd libmpi.so on old fosscuda OpenMPI does show it linking to libcudart.so, but that's just OpenMPI sillyness then

EasyBuild sillyness really since gcccuda toolchains and friends add -lrt -lcudart to LIBS.

@ocaisa
Copy link
Member

ocaisa commented May 5, 2022

I think it's called sideloading or something, OpenMPI does that for lots of libraries that it may or may not use.

@Micket
Copy link
Contributor

Micket commented May 6, 2022

@bartoldeman Will you open a PR for this approach?
Do we need a PR for the easyblock, to introduce the option?
disable_cuda_awareness = True
or some such that forcibly sets --without-cuda ?

@ocaisa
Copy link
Member

ocaisa commented May 19, 2022

@Micket the patch introduces a new potential value for the option --with-cuda (--with-cuda=internal) but it looks like we would need a way of ensuring that the patch was not applied if CUDA was a dependency (since this may have a little less overhead). We could tweak the OpenMPI easyblock to take a cuda key which accepts internal, no (default) and forcefully sets the value if CUDA is a dependency.

@ocaisa
Copy link
Member

ocaisa commented May 19, 2022

Looking at the easyblock, I don't think there's any need for a change to the easyblock. You just add the patch and set

configopts = '--with-cuda=internal'

@ocaisa ocaisa mentioned this pull request May 19, 2022
2 tasks
@Micket
Copy link
Contributor

Micket commented May 19, 2022

@Micket the patch introduces a new potential value for the option --with-cuda (--with-cuda=internal) but it looks like we would need a way of ensuring that the patch was not applied if CUDA was a dependency (since this may have a little less overhead).

I didn't think any of the patches we have discussed would induce any more or less overhead than what we have always had in the old fosscuda toolchains. There is simply the price to pay for making the memory-copying routines to check

The only downside of enabling this cuda-awareness is that it does induce an overhead even for those who aren't using CUDA at all due to the extra branch + function call.
Though, this is just for collectives, and just matters for smaller message sizes. Also, @bartoldeman wrote some patches to reduce the overhead significantly. We should apply those patches even for the old fosscuda toolchains (no reason not to).

quoting @bartoldeman from the slack thread:


I did some profiling and with a small patch I could reduce the difference to <1% for small alltoall.
Basically what happens is that an often called small memcpy, which is normally inlined in non-CUDA-enabled Open MPI gets transformed into a function pointer for CUDA-enabled Open MPI. If I use

#define MEMCPY_CUDA( DST, SRC, BLENGTH, CONVERTOR ) \
do { \
    if (CONVERTOR->flags & CONVERTOR_CUDA) \
        CONVERTOR->cbmemcpy( (DST), (SRC), (BLENGTH), (CONVERTOR) ); \
    else \
        MEMCPY( (DST), (SRC), (BLENGTH) ); \
} while(0)

instead of

#    define MEMCPY_CUDA(DST, SRC, BLENGTH, CONVERTOR) \
        CONVERTOR->cbmemcpy((DST), (SRC), (BLENGTH), (CONVERTOR))

and similar in 2 other places, the performance difference goes down to <1%.


At that point, i don't even think it's worth caring about disabling cuda-awareness. It might already be possible by simply using:

configopts = '--without-cuda`

if you don't care about CUDA and not want to pay even the <1% overhead.

@boegel
Copy link
Member

boegel commented Jun 8, 2022

@akesandgren This can be closed now that #15528 is merged?

@akesandgren
Copy link
Contributor Author

Jup, that it can, closing.

@akesandgren akesandgren closed this Jun 8, 2022
@akesandgren akesandgren deleted the 20220203165712_new_pr_OpenMPI411 branch June 8, 2022 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants