Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{tools}[gfbf/2023a] jax v0.4.25 w/ CUDA 12.1.1 #20119

Merged

Conversation

ThomasHoffmann77
Copy link
Contributor

@ThomasHoffmann77 ThomasHoffmann77 commented Mar 14, 2024

(created using eb --new-pr)
requires:

edit: requires bug fix in framework for "cp %s %(builddir)s/archives" to work as extract command:

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/43d87811306655a013126860c0bb6777 for a full test report.

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
FAILED
Build succeeded (with --ignore-test-failure) for 1 out of 2 (2 easyconfigs in total)
srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/c51c43986eae5a7afe56f715d7c5c38c for a full test report.

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/b2b075b38d9d9d5c6fe4b0503dab7279 for a full test report.

@branfosj
Copy link
Member

branfosj commented Mar 14, 2024

Test report by @branfosj
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
bear-pg0208u15a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8
See https://gist.github.com/branfosj/83b07adf11f9a9eea619d5b7e45eddb5 for a full test report.

Same three failures as #19841 (comment)

@branfosj
Copy link
Member

branfosj commented Mar 15, 2024

Test report by @branfosj
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
bear-pg0208u31a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 4 x NVIDIA NVIDIA A100-SXM4-40GB, 535.154.05, Python 3.6.8
See https://gist.github.com/branfosj/bec290f9c00aa6309ee649e8ff185675 for a full test report.

Same three failures as #19841 (comment)

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded (with --ignore-test-failure) for 2 out of 2 (2 easyconfigs in total)
proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/59e7a52712f520a524e93b5b5210551b for a full test report.

@verdurin
Copy link
Member

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_1x4_float32_float64 PASSED                                                                                                    [ 55%]
tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted

@verdurin
Copy link
Member

I see you're all building with --ignore-test-failure - is that expected with jax?

@ThomasHoffmann77
Copy link
Contributor Author

I don't have a build node setup to upload test reports. Did see this test error:

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_1x4_float32_float64 PASSED                                                                                                    [ 55%]
tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 Fatal Python error: Aborted
#16:38 thoffman@srv-mahamid-01#NVIDIA_TF32_OVERRIDE=0 CUDA_VISIBLE_DEVICES=0 XLA_PYTHON_CLIENT_ALLOCATOR=platform JAX_ENABLE_X64=true pytest -vv tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32
============================= test session starts ==============================
platform linux -- Python 3.11.3, pytest-7.4.2, pluggy-1.2.0 -- /g/easybuild/x86_64/Rocky/8/rome/software/Python/3.11.3-GCCcore-12.3.0/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/tmp/jax-jax-v0.4.25/.hypothesis/examples'))
rootdir: /tmp/jax-jax-v0.4.25
configfile: pyproject.toml
plugins: xdist-3.3.1, hypothesis-6.88.1
collected 1 item                                                               

tests/lax_scipy_special_functions_test.py::LaxScipySpcialFunctionsTest::testScipySpecialFun_gammainc_s_2x1x4_float32_float32 PASSED [100%]
============================== 1 passed in 3.21s ===============================

@boegel boegel added the update label Mar 17, 2024
@boegel boegel added this to the 4.x milestone Mar 17, 2024
@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
srv-mahamid-01.embl.de - Linux AlmaLinux 8.8, x86_64, AMD EPYC 7513 32-Core Processor, 2 x NVIDIA NVIDIA GeForce RTX 3090, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/37e79d6b1006b4e8bee5438a97ef2ccd for a full test report.

@ThomasHoffmann77
Copy link
Contributor Author

Test report by @ThomasHoffmann77
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
proline - Linux AlmaLinux 8.8, x86_64, 12th Gen Intel(R) Core(TM) i7-12700, 1 x NVIDIA NVIDIA RTX A4000, 535.113.01, Python 3.6.8
See https://gist.github.com/ThomasHoffmann77/61812c0e50d74c911c9d72e03155eac6 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (2 easyconfigs in total)
n1438 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/ee41d9059916ce8b1f93b9267d0c847f for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
i8002 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/327109d42642f3d3ed5c28565c08f20b for a full test report.

@Flamefire
Copy link
Contributor

In both cases the failure is:

external/upb/upb/table.c: In function upb_inttable_pop:
external/upb/upb/table.c:588:10: error: val.val may be used uninitialized [-Werror=maybe-uninitialized]
  588 |   return val;
      |          ^~~
external/upb/upb/table.c:585:13: note: val.val was declared here
  585 |   upb_value val;
      |             ^~~

Due to -Werror added here

XLA comes with even more dependencies (workspace*.bzl). Can we add them as local repositories too? Maybe even auto-generate those lists via a Python script or so (similar to e.g. findPythonDeps which outputs a list of Python packages for use in an EC. That script is bundled with EasyBuild so readily available)

@Flamefire
Copy link
Contributor

Flamefire commented Mar 26, 2024

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
n1265 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/8cbb16221ab8da073cee85c97c0dd911 for a full test report.

This is caused by a crash. It isn't really clear why it fails or in which test, as when I run the crashing test file manually it works. Attaching GDB shows ~LogMessageFatal() as the cause. Need more investigation into why, i.e. what the fatal error is, but this looks serious...

@casparvl
Copy link
Contributor

It fails on our H100 80GB cards, not our A100 80GB cards :-)

Ah, I'm blind. So used to *100 being A100, not used to H100 yet (even though we should be having them in production in the next 3 or so weeks ;-)).

It's a bit of a long shot, but what if you only build for a single compute capability, i.e. only the one for H100 (that's 9.0 I believe, right)?

@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn159.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 9334 32-Core Processor, 4 x NVIDIA NVIDIA H100, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/1f92447b1dfb9a8ef9cd7e0d72e739e0 for a full test report.

@akesandgren
Copy link
Contributor

Nope, using single compute capability for the H100 (9.0) also fails in the same way.

@ThomasHoffmann77
Copy link
Contributor Author

Nope, using single compute capability for the H100 (9.0) also fails in the same way.

0.4.29 has been released in the meantime. It might be worth to try this version.

@akesandgren
Copy link
Contributor

First attempt at using 0.4.29 with this toolchain failed:

tests/sparse_nm_test.py::SpmmTest::test_shapes0 Fatal Python error: Aborted

Thread 0x000014a919150000 (most recent call first):
  File "/dev/shm/ake/build/jax/0.4.29/gfbf-2023a-CUDA-12.1.1/jax/jax-jax-v0.4.29/jax/_src/compiler.py", line 238
 in backend_compile
  File "/dev/shm/ake/build/jax/0.4.29/gfbf-2023a-CUDA-12.1.1/jax/jax-jax-v0.4.29/jax/_src/profiler.py", line 335
 in wrapper
  File "/dev/shm/ake/build/jax/0.4.29/gfbf-2023a-CUDA-12.1.1/jax/jax-jax-v0.4.29/jax/_src/compiler.py", line 608
 in _compile_and_write_cache
  File "/dev/shm/ake/build/jax/0.4.29/gfbf-2023a-CUDA-12.1.1/jax/jax-jax-v0.4.29/jax/_src/compiler.py", line 378
 in compile_or_get_cached
  File "/dev/shm/ake/build/jax/0.4.29/gfbf-2023a-CUDA-12.1.1/jax/jax-jax-v0.4.29/jax/_src/interpreters/pxla.py",
 line 2779 in _cached_compilation
...

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 4 out of 4 (1 easyconfigs in total)
alvis4-41 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8
See https://gist.github.com/VRehnberg/d5fe8ca36746f898d5d1defc484c686e for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 4 out of 4 (1 easyconfigs in total)
alvis3-16 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/VRehnberg/332751b5d6100151342f5f2ccbcc7608 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 4 out of 4 (1 easyconfigs in total)
alvis9-07 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 1 x NVIDIA NVIDIA A40, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/a7488643356322088e5d131a18309892 for a full test report.

@akesandgren
Copy link
Contributor

To get this to run on H100 one needs a newer CUDA, 0.4.29 with foss/2023a and CUDA/12.5.0 passes all but a single broken test, i.e. the test itself is broken...

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ThomasHoffmann77
Copy link
Contributor Author

🎉

fix local_extract_cmd according to @akesandgren 's suggestion
@VRehnberg
Copy link
Contributor

VRehnberg commented Jul 9, 2024

Edit: Recent change in global pip.conf made build fail. Unrelated to this PR.

@Flamefire
Copy link
Contributor

Edit: Recent change in global pip.conf made build fail. Unrelated to this PR.

Which one exactly and what was the error? Might be worth addressing in framework

@VRehnberg
Copy link
Contributor

VRehnberg commented Jul 9, 2024

Edit: Recent change in global pip.conf made build fail. Unrelated to this PR.

Which one exactly and what was the error? Might be worth addressing in framework

ERROR: Could not find an activated virtualenv (required).

https://gist.github.com/VRehnberg/5be54199260e8a478002d18dd986725c

[install]
require-virtualenv = true

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-11 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/141b0e91571ee25d1ea81ddb0e05c169 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis2-02 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 1 x NVIDIA Tesla T4, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/a10d64ffec054c2de2cef9a0960a1fbd for a full test report.

@Flamefire
Copy link
Contributor

ERROR: Could not find an activated virtualenv (required).

Ah I remember that. There is a fix in the easyblocks: easybuilders/easybuild-easyblocks#3374
If that isn't used in jax (custom easyblocks might skip/replace that step) we need to include that there. Haven't checked that it does yet

@VRehnberg
Copy link
Contributor

Ah I remember that.

Thanks, wasn't using that easyblock still on 4.9.2 easyblocks.

It seems like the jax test suite is very VRAM hungry, since they only pass on A100 for some reason?

So the tests are not VRAM hungry. I never saw more than 1 GB used and T4 only have 16 GB in total so that's a strict limit even if our monitoring would miss a short spike. Does use about 27 GB of regular RAM though in case that could be an issue.

@lexming
Copy link
Contributor

lexming commented Jul 23, 2024

Test report by @lexming
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node409.hydra.os - Linux Rocky Linux 8.10, x86_64, AMD EPYC 7282 16-Core Processor (zen2), 1 x NVIDIA NVIDIA A100-PCIE-40GB, 550.90.07, Python 3.6.8
See https://gist.github.com/lexming/650ff1e0bf428372c2ea23fc00fbba4e for a full test report.

Copy link
Contributor

@lexming lexming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lexming lexming dismissed akesandgren’s stale review July 24, 2024 05:22

review addressed

@lexming
Copy link
Contributor

lexming commented Jul 24, 2024

Merging, thanks a lot for keeping up with all the issues @ThomasHoffmann77 !

@lexming lexming merged commit 3ea4e45 into easybuilders:develop Jul 24, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants