Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{chem}[GCCcore/11.2.0,foss/2021b] LAMMPS v29Sep2021 (with and without cuda) #14815

Closed

Conversation

arkdavy
Copy link

@arkdavy arkdavy commented Jan 21, 2022

(created using eb --new-pr)
requires easybuilders/easybuild-easyblocks#2213

@arkdavy
Copy link
Author

arkdavy commented Jan 21, 2022

Hi, I wanted to test the easyconfigs at my laptop, but running python3 -m test.easyconfigs.suite from easybuild-dev/easybuild-easyconfigs from just pulled repository (i.e. clean development branch) did not pass all tests, so I could not continue. The following errors appear

AIL: test_style_conformance (test.easyconfigs.styletests.StyleTest)
Check the easyconfigs for style
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/adavydov/easybuild-dev/easybuild_devcompile-dUIPaZ/easybuild-framework/easybuild/base/testing.py", line 94, in assertEqual
    super(TestCase, self).assertEqual(a, b)
  File "/usr/lib/python3.8/unittest/case.py", line 912, in assertEqual
    assertion_func(first, second, msg=msg)
AssertionError: 301 != 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/adavydov/easybuild-dev/easybuild-easyconfigs/test/easyconfigs/styletests.py", line 68, in test_style_conformance
    self.assertEqual(result, 0, error_msg)
  File "/home/adavydov/easybuild-dev/easybuild_devcompile-dUIPaZ/easybuild-framework/easybuild/base/testing.py", line 116, in assertEqual
    raise AssertionError("%s:\nDIFF%s:\n%s" % (msg, limit, ''.join(diff[:self.ASSERT_MAX_DIFF])))
AssertionError: There shouldn't be any code style errors (and/or warnings), found 301:

And there are plenty of lines below, eg.,

libwebp-1.1.0-GCCcore-10.2.0.eb:31:37: E741 ambiguous variable name 'l'

I have loaded the development version of easybuild via a module. Any ideas why is that so?

Best Regards,
Arkadiy

@arkdavy
Copy link
Author

arkdavy commented Jan 31, 2022

One fail, eg., test-suite (3.5, Lmod-7.8.22, Lua), "Error: Version 3.5 with arch x64 "not found is not clear to me. Does not look like it's the submitted easyconfings files

@akesandgren
Copy link
Contributor

I will produce some suggestions to this in a while, currently testing a build of a slightly different CUDA version

So don't merge just yet.

@arkdavy
Copy link
Author

arkdavy commented Apr 7, 2022

Thanks, don't worry, there is a long way before merge because easybuilders/easybuild-easyblocks#2213 has to be submitted first.

]

# hardware-specific option
cuda_compute_capabilities = ['8.0']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be included here, it should be set by the site

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but you can comment it out and leave it as a guide)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing it with a comment about cuda_compute_capabilities would be good. Especially as this can highlight that the LAMMPS easyblock builds for the highest item in the list, as opposed to some other software that produces a fat-binary compatible with all items in that list.

@ocaisa
Copy link
Member

ocaisa commented May 13, 2022

There are quite a few options being given that don't seem to be understood by CMake:

  Manually-specified variables were not used by the project:

    BOOST_ROOT
    BUILD_EXE
    BUILD_LIB
    Boost_NO_SYSTEM_PATHS
    DOWNLOAD_EIGEN3
    EIGEN3_INCLUDE_DIR
    Eigen3_DIR
    PKG_USER-INTEL
    PKG_USER-OMP

@ocaisa
Copy link
Member

ocaisa commented May 13, 2022

The docs seem to want to do a git clone:

  error: subprocess-exited-with-error
  
   git clone --filter=blob:none --quiet git://github.com/akohlmey/sphinx-fortran /tmp/eb-zygtj2ct/pip-req-build-w6ehr51x did not run successfully.
   exit code: 128
  > See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

 git clone --filter=blob:none --quiet git://github.com/akohlmey/sphinx-fortran /tmp/eb-zygtj2ct/pip-req-build-w6ehr51x did not run successfully.
 exit code: 128
> See above for output.

I found a requirements.txt file which contains:

Sphinx==4.0.3
sphinxcontrib-spelling==7.2.1
git+git://github.com/akohlmey/sphinx-fortran@parallel-read
sphinx_tabs==3.2.0
breathe==4.31.0
Pygments==2.10.0
six==1.16.0

@arkdavy
Copy link
Author

arkdavy commented May 13, 2022

Ah, I was fixing the git issue with either a patch or local configuration. Not sure what should we do here. Perhaps just abandon documentation and the corresponding patch would be easier

The point above (CMake)...

The messages about the configured libraries

libraries configured within
3Mar2020:
['BLAS', 'CURL', 'Eigen3', 'FFTW3', 'GSL', 'Git', 'HDF5', 'JPEG', 'KIM-API (required version >= 2.1)', 'LAPACK', 'MPI', 'NetCDF', 'OpenMP', 'PNG', 'PkgConfig', 'PythonInterp', 'PythonLibs', 'Threads', 'VORO', 'VTK', 'ZLIB']

29Sep2021:
['BLAS', 'CURL', 'Doxygen (required version >= 1.8.10)', 'FFTW3', 'GSL', 'Git', 'HDF5', 'JPEG', 'KIM-API (required version >= 2.2.0)', 'LAPACK', 'MPI', 'NetCDF', 'OpenMP', 'PNG', 'PkgConfig', 'Python', 'Python3', 'TPLLIBDL', 'Threads', 'VORO', 'ZLIB']

so, Eigen is missing in the second one, Boost is not present in both because it is actually not in the dependencies list. I guess it is possible to fix Eigen (in easyblock), or just remove it (easier), and ignore Boost warning.

It is harder to work through packages because previous lammps version did not report the packages it is build with in the log. But

PKG_USER-INTEL
PKG_USER-OMP

are set by the easyblock. PKG_USER-OMP was configuring what is OPENMP now as far as I understood, and it is configured. I will try also adding INTEL to the package list, which is not configured in the new build.

I definitely have exe file (lmp). And I do have liblammps.so in the lib64 directory, so BUILD_EXE and BUILD_LIB can be also ignored since the corresponding targets are anyway present.

Best,
Arkadiy

@ocaisa
Copy link
Member

ocaisa commented May 13, 2022

Looks like the easyblock needs some more tweaking I think to handle the newer scenarios.

@ocaisa
Copy link
Member

ocaisa commented May 13, 2022

With the option

configopts = '-DBUILD_DOC=off'

this builds ok for me (but I built it on a node with no GPUs, so it failed the tests!).

@arkdavy
Copy link
Author

arkdavy commented May 14, 2022

I see, I will try to work on EasyBlock next week. It does not look that complicated

@ocaisa
Copy link
Member

ocaisa commented Jul 20, 2022

@arkdavy I suggest we close this in favour of #15877

The one thing that is missing there is forcing CUDA awareness in Kokkos, I'm not sure we actually need to do that...does the detection fail for our (unusual) OpenMPI setup? If we do need it, we have to include UCX-CUDA or it won't work.

@ocaisa
Copy link
Member

ocaisa commented Jul 20, 2022

Ah, according to #14517 it looks like we do need to force-enable the MPI CUDA support

@arkdavy
Copy link
Author

arkdavy commented Jul 21, 2022

Hi @ocaisa, I was working on it, and cuda-awareness is on via NCCL(has UCX-CUDA inside) and LAMMPS-29Sep2021-force-cudaaware-kokkos.patch

@ocaisa
Copy link
Member

ocaisa commented Jul 21, 2022

I came up with another patch yesterday that I think is a little safer (will still work on non-GPU nodes): lammps/lammps#3140 (comment)

@arkdavy
Copy link
Author

arkdavy commented Jul 21, 2022

Ok, let me test/think about it today to see if it does the job for the submitted easyconfig

@arkdavy
Copy link
Author

arkdavy commented Jul 21, 2022

I've got an error

nvlink fatal : Could not open input file '/home/adavydov/easybuild/local_installpath/software/tbb/2020.3-GCCcore-11.2.0/lib/libtbbmalloc.so'

Although the mentioned file is present and has "INPUT (libtbbmalloc.so.2)" inside. Any ideas?

@branfosj
Copy link
Member

INPUT (libtbbmalloc.so.2)

It is failing to correctly deal with that file (see https://sourceware.org/binutils/docs/ld/Implicit-Linker-Scripts.html for information about these type of files), which are an alternative method to having a symlink of libtbbmalloc.so -> libtbbmalloc.so.2.

@arkdavy
Copy link
Author

arkdavy commented Jul 21, 2022

Does it mean that one has to change the way tbb is setting up the libraries? I don't see how to compile LAMMPS without it for a moment, besides ignoring tbb... Although #15876 seems to work with it.

@ocaisa
Copy link
Member

ocaisa commented Jul 21, 2022

Only one package needs TBB, but I can't remember which now. If you exclude that you can move on with your life!

@ocaisa
Copy link
Member

ocaisa commented Jul 22, 2022

From https://docs.lammps.org/Build_extras.html, it looks like only the INTEL package needs TBB

@ocaisa
Copy link
Member

ocaisa commented Jul 25, 2022

@arkdavy I see the same issue for #15900 (see https://gist.github.com/boegelbot/99fd77113c6b64c588539ac40def2884#file-lammps-23jun2022-foss-2021a-kokkos-cuda-11-3-1_partial-log-L484). The issue is with nvlink and it's likely fixed by bumping the CUDA dep, but that is not easy for us. A way around it is to disable the Intel package, or put a symlink in place that the linker encounters first and points to the "real" libtbbmalloc.so.2

@ocaisa
Copy link
Member

ocaisa commented Jul 25, 2022

I did a bit of digging around and this is triggered by the INTEL package which uses TBB. That package can be built without TBB, so we could just comment out the TBB dep.

When TBB is there you get

-- Found TBB_MALLOC: /project/boegelbot/Rocky8/haswell/software/tbb/2020.3-GCCcore-10.3.0/lib/libtbbmalloc.so

So probably we could skirt around this by providing the path to the versioned .so explicitly

@ocaisa
Copy link
Member

ocaisa commented Jul 25, 2022

The "fix" is in 920d48e (alternatively one could define CMake variables to the library but it didn't seem worth the effort)

@arkdavy
Copy link
Author

arkdavy commented Jul 25, 2022

Ok, I have removed the tbb, and could build the code. I will update the branch but still will have to test if the new cudaaware works. Meanwhile, having --update-pr broken because EB does not see GitPython and I have to go now. I will write here as soon as cuda aware tests are done (I do not see how to definitely check it without running a couple of calculations and a cluster).

@arkdavy
Copy link
Author

arkdavy commented Jul 26, 2022

Hi @ocaisa . I have tried the new patch (in the last commit), and it didn't work apparently:

WARNING: Turning off GPU-aware MPI since it is not detected, use '-pk kokkos gpu/aware on' to override (src/KOKKOS/kokkos.cpp:277)

The same is concluded by comparing the timing of my tests between 1-GPU and 2-GPU calculations

@ocaisa
Copy link
Member

ocaisa commented Jul 26, 2022

@arkdavy is that with a rebuild of the OpenMPI module?

@arkdavy
Copy link
Author

arkdavy commented Jul 27, 2022

Ok, I have rebuilt the OpenMPI and see that there are CUDA-related patches and config option, but the result is still the same

Copy link
Member

@jfgrimm jfgrimm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As decided in issue #16330, we have deprecated the use of True to signify a system-toolchain dependency (#16384), in favour of the more intuitive SYSTEM template constant. Due to the change in the test suite, please run eb --sync-pr-with-develop 14815 and update the PR to use SYSTEM instead.

@arkdavy arkdavy closed this Oct 27, 2022
@boegel boegel added the update label Oct 27, 2022
@boegel boegel added this to the 4.x milestone Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants