Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added CUDA 11.0.2 and related recipes, incl. gompic/2020a and iccifortcuda/2020a #11295

Merged
merged 16 commits into from
Oct 2, 2020

Conversation

bartoldeman
Copy link
Contributor

Like #10935 this introduces a new CUDAcore easyconfig to share
CUDA between use of GCC and Intel compilers, but
unlike #10935 this uses a versionsuffix for UCX+CUDA so it does
not need any framework changes or MODULEPATH adjustments.

Like easybuilders#10935 this introduces a new CUDAcore easyconfig to share
CUDA between use of GCC and Intel compilers, but
unlike easybuilders#10935 this uses a versionsuffix for UCX+CUDA so it does
not need any framework changes or MODULEPATH adjustments.
@bartoldeman
Copy link
Contributor Author

@boegelbot please test @ generoso

@easybuilders easybuilders deleted a comment from boegelbot Sep 16, 2020
@boegelbot
Copy link
Collaborator

@bartoldeman: Request for testing this PR well received on generoso

PR test command 'EB_PR=11295 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11295 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7819

Test results coming soon (I hope)...

- notification for comment with ID 693637644 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in this PR)
generoso-x-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/56e6bf3449538960d811354b1d47988f for a full test report.

@Micket
Copy link
Contributor

Micket commented Sep 16, 2020

First; I agree with this approach compared to the other alternatives presented, though this got me thinking about the possibilities here:
If the MPI libraries only cares about whether or not the UCX library has CUDA support, this would let us avoid the foss+ fosscuda split. CUDA enabled things would simply live under foss and just depend on ('UCX', '1.8.0', '-CUDA-11.0.2') (presumably adding the same versionsuffix themselves), i.e. something like this:

CUDA-11.0.2.eb # at SYSTEM level
UCX-1.8.0-GCCcore-9.3.0-CUDA-11.0.2.eb  # depends on CUDA-11.0.2
HPL-2.3-foss-2019b-CUDA-11.0.2.eb # depends on UCX/1.8.0-CUDA-11.0.2

A relatively minor thing, but, if we ever want to do something like that, it sure would be nicer if the system-level CUDA got to keep using the CUDA name, and we used some other name CUDA.GCC or whatever for the module under the compiler level.
Simply adding an extra line to:
https://github.com/easybuilders/easybuild-framework/blob/7e1ec1c2903dfbe8afd27ca8e140ac58302e581c/easybuild/tools/module_naming_scheme/hierarchical_mns.py#L53-L69
should solve the HMNS issue (just like was done for the icc+ifort merge), and flat naming schemes don't care either way.
(though, it we were to actually opt for combining fosscuda into foss, then we wouldn't need this change, as it simply would stop existing)

While writing this, I realized there might be some issues for those who RPATH the UCX libs into their openmpi, installation, which would require multiple versions of openmpi, forever forcing the use of separate foss and fosscuda. ☹️ )

@bartoldeman
Copy link
Contributor Author

Test report by @bartoldeman
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in this PR)
build-node.computecanada.ca - Linux centos linux 7.8.2003, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.7.7
See https://gist.github.com/10cde85212b91406ae7f7bada4823d96 for a full test report.

@bartoldeman
Copy link
Contributor Author

@Micket it's not clear to me from your comment what I should change to this commit -- you said you were ok but did not approve it.

@boegel boegel added this to the next release (4.3.1) milestone Sep 25, 2020
@boegel boegel added the update label Sep 25, 2020
@boegel boegel dismissed their stale review September 25, 2020 14:54

nevermind, sticking to UCX 1.8.0 and CUDA 11.0.2 makes sense

@boegel
Copy link
Member

boegel commented Sep 25, 2020

To avoid the foss/fosscuda split, I think we should add support for "optional dependencies", where EasyBuild can be configured to include CUDA as a dep or not.

On GPU systems you would then configure EasyBuild with --with-optional-deps=CUDA or something, which would active the optional CUDA dependency in UCX.

That's a good way forward to collapse foss and fosscuda I think...

@mboisson
Copy link
Contributor

Probably a good idea, but beyond this PR :P

@boegel boegel changed the title Added CUDA 11.0.2 and related recipes Added CUDA 11.0.2 and related recipes, incl. gompic/2020a and iccifortcuda/2020a Sep 25, 2020
@boegel
Copy link
Member

boegel commented Sep 25, 2020

Test report by @boegel
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in this PR)
node3301.joltik.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/5edfd5e3b69a624567046276f0f19c8c for a full test report.

@boegel
Copy link
Member

boegel commented Sep 25, 2020

Probably a good idea, but beyond this PR :P

Yeah, I would like to go for an approach like that in the upcoming 2020b generation...

@boegel
Copy link
Member

boegel commented Sep 25, 2020

Test report by @boegel
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in this PR)
node2706.swalot.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 2.7.5
See https://gist.github.com/79b212dd5b48add3fb9f1b85c4549fb6 for a full test report.

boegel
boegel previously approved these changes Sep 26, 2020
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to go...

@Micket @lexming @akesandgren Any last words on this before we merge it?

@lexming
Copy link
Contributor

lexming commented Sep 26, 2020

I'll test this over the weekend

use OS_PKG_OPENSSL_DEV constant in UCX 1.8.0 easyconfig on top of CUDA 11.0.2
@boegel
Copy link
Member

boegel commented Sep 28, 2020

I'll test this over the weekend

@lexming Any results yet?

Add GDRCopy to UCX with CUDA
@bartoldeman
Copy link
Contributor Author

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@bartoldeman: Request for testing this PR well received on generoso

PR test command 'EB_PR=11295 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11295 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7992

Test results coming soon (I hope)...

- notification for comment with ID 701567955 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 6 out of 11 (11 easyconfigs in this PR)
generoso-x-3 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/ee2278700059a71300570325b1491bc4 for a full test report.

@bartoldeman
Copy link
Contributor Author

Fails on generoso because of

ldconfig -n /tmp/boegelbot/GDRCopy/2.1/GCCcore-9.3.0-CUDA-11.0.2/gdrcopy-2.1/src
make[1]: ldconfig: Command not found

I'll check what can be done...

@bartoldeman
Copy link
Contributor Author

I'll try doing something like prebuildopts="PATH=$PATH:/sbin && " because ldconfig usually sits there and sbin is not always in the path -- but need to be offline for a while.

@boegel
Copy link
Member

boegel commented Sep 30, 2020

I'll try doing something like prebuildopts="PATH=$PATH:/sbin && " because ldconfig usually sits there and sbin is not always in the path -- but need to be offline for a while.

ldconfig is indeed at /usr/sbin/ldconfig on generoso

@boegel
Copy link
Member

boegel commented Sep 30, 2020

Test report by @boegel
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
node3408.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/9cce610aa7571502d5ae5f1c8595cbb0 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 30, 2020

Test report by @lexming
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
node154.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/076daad50492f360566c6603017539da for a full test report.

lexming
lexming previously approved these changes Sep 30, 2020
Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

It's not in the standard PATH on generoso.
@bartoldeman
Copy link
Contributor Author

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@bartoldeman: Request for testing this PR well received on generoso

PR test command 'EB_PR=11295 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11295 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7993

Test results coming soon (I hope)...

- notification for comment with ID 701712682 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
generoso-x-3 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/4aef6122d11c00f0cb6a9db71a40a993 for a full test report.

@bartoldeman
Copy link
Contributor Author

Test report by @bartoldeman
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
build-node.computecanada.ca - Linux centos linux 7.8.2003, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.7.7
See https://gist.github.com/0d91a9465733179f6fda3926dda1714f for a full test report.

@lexming
Copy link
Contributor

lexming commented Oct 1, 2020

Test report by @lexming
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
node154.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/b939306feb9a8364a30104fef72f4706 for a full test report.

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@boegel
Copy link
Member

boegel commented Oct 1, 2020

Test report by @boegel
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
node3404.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/6e437eed7ddba02daa96a7badf2e04ec for a full test report.

@boegel
Copy link
Member

boegel commented Oct 1, 2020

Test report by @boegel
SUCCESS
Build succeeded for 12 out of 12 (11 easyconfigs in this PR)
node3502.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7302P 16-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/f45264ebb3d1328c78a0b79dd6e4f509 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 1, 2020

Test report by @boegel
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
node2424.golett.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (haswell), Python 2.7.5
See https://gist.github.com/97c627d65aa9b3895e55449e50e26eaa for a full test report.

@zao
Copy link
Contributor

zao commented Oct 1, 2020

Test report by @zao
SUCCESS
Build succeeded for 11 out of 11 (11 easyconfigs in this PR)
freja - Linux Ubuntu 20.04, x86_64, Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz, Python 3.8.2
See https://gist.github.com/c9fdbe7ed70ed6a0f989b6f12a48bd6f for a full test report.

@branfosj
Copy link
Member

branfosj commented Oct 1, 2020

Test report by @branfosj
SUCCESS
Build succeeded for 8 out of 8 (8 easyconfigs in this PR)
bear-pg0306u19a.bear.cluster - Linux RHEL 8.2, POWER, 8335-GTX (power9le), Python 3.6.8
See https://gist.github.com/7ddd170bfa042bbffd88aa27b610ca89 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 2, 2020

Going in, thanks @bartoldeman!

@boegel boegel merged commit 9fef154 into easybuilders:develop Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants