Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to NCCL 2.8.3 built from source for CuPy, Horovod, libgpuarray, PyTorch and TensorFlow using fosscuda/2020b (+ add missing astor extension for TensorFlow) #13103

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jun 10, 2021

(created using eb --new-pr)

edit: requires #13071 (NCCL 2.8.3 from source)

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 18 out of 21 (8 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/e9a292f37996f8d77d9d86a733335d92 for a full test report.

@boegel boegel changed the title Use updated NCCL switch to NCCL 2.8.3 built from source for CuPy, Horovod, libgpuarray, PyTorch and TensorFlow using fosscuda/2020b (+ add missing astor extension for TensorFlow) Jun 11, 2021
Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm (i hope to have a test report coming in after.. many hours)

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 8 out of 8 (8 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/80508e6a75fd63dcca52f793c893b9d0 for a full test report.

@Micket

This comment has been minimized.

@boegel
Copy link
Member

boegel commented Jun 12, 2021

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=13103 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_13103 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 17447

Test results coming soon (I hope)...

- notification for comment with ID 860071762 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 4 out of 9 (8 easyconfigs in total)
generoso-x-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/fe945f39397bc1e2a2356d0ef3ed1f1f for a full test report.

@Micket
Copy link
Contributor

Micket commented Jun 13, 2021

Test report by @Micket
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-02 - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, Python 3.6.8
See https://gist.github.com/de39519e6390888866d1837df08a7aa5 for a full test report.

@boegel
Copy link
Member

boegel commented Jun 13, 2021

Test report by @boegel
SUCCESS
Build succeeded for 10 out of 10 (8 easyconfigs in total)
node3307.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/2fb7767f4d1384ec7fb54c6ae161ce2d for a full test report.

@Micket
Copy link
Contributor

Micket commented Jun 13, 2021

Test report by @Micket
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
alvis1-02 - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, Python 3.6.8
See https://gist.github.com/8a1260556e32023ab135bc8d527d0b53 for a full test report.

@boegel
Copy link
Member

boegel commented Jun 13, 2021

Going in, thanks @Flamefire!

@boegel boegel merged commit e78c833 into easybuilders:develop Jun 13, 2021
@Micket
Copy link
Contributor

Micket commented Jun 13, 2021

Test report by @Micket
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
alvis1-02 - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, Python 3.6.8
See https://gist.github.com/a239d2b379ede6c1f81f37a610d871d5 for a full test report.

@Flamefire Flamefire deleted the 20210610150123_new_pr_TensorFlow241 branch June 14, 2021 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants