Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: PR test machine problem? #12732

Closed
jhux2 opened this issue Feb 9, 2024 · 21 comments
Closed

Framework: PR test machine problem? #12732

jhux2 opened this issue Feb 9, 2024 · 21 comments
Labels
autotester Issues related to the autotester. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@jhux2
Copy link
Member

jhux2 commented Feb 9, 2024

Bug Report

@trilinos/framework @sebrowne

PR testing for #12726 and #12722 are failing on ascic164 with similar errors but in unrelated packages. Memory or disk space issue?

 Standard Error 	

collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.
[CTest: warning matched] /usr/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/liblapack.so, may conflict with libgfortran.so.5
@jhux2 jhux2 added the type: bug The primary issue is a bug in Trilinos code or tests label Feb 9, 2024
@jhux2 jhux2 changed the title PackageName: General Summary of the Bug Framework: PR test failures Feb 9, 2024
@jhux2 jhux2 added PA: Framework Issues that fall under the Trilinos Framework Product Area autotester Issues related to the autotester. labels Feb 9, 2024
@jhux2 jhux2 changed the title Framework: PR test failures Framework: PR test machine problem? Feb 9, 2024
@jhux2
Copy link
Member Author

jhux2 commented Feb 10, 2024

And #12731.

@jhux2 jhux2 pinned this issue Feb 10, 2024
@sebrowne
Copy link
Contributor

This is effectively the same issue as was first seen in #12707 (comment). I'm monitoring and can confirm that some of the machines are reliably running out of disk space, and it appears to be just from our builds/tests. Some of them have over 1TB of disk, so I think we have a problem, unless it REALLY does take that much space to build and test OpenMP + GCC + OpenMPI Trilinos.

@cgcgcg
Copy link
Contributor

cgcgcg commented Feb 12, 2024

Is there something in particular that uses a lot of space? Or is the increase across the board on all object files?

@sebrowne
Copy link
Contributor

The builds that are particularly problematic are (now that I notice it) the statically-linked ones. So my guess is that my conclusion will be "It's just that expensive to link Trilinos statically". More specifically, I think it's that expensive to link the tests statically. I'm running a build to get a feel for how much space that build takes, and will report back. Most likely some code change pushed the build sizes up just a little bit, and it's now running up against our hard disk sizes on some of the machines.

@sebrowne
Copy link
Contributor

sebrowne commented Feb 12, 2024

Yep, we are in the unfortunate position of "that build really does take over 200GB of disk space". Static linking is a bear, to be sure. See attached disk usage plot of my replicating PR build on one of the machines that has been running out of space. Will need to discuss the priority of a statically-linked Trilinos (with all of its tests) with the Operational Leadership Team. I can try to get it to place that build on only machines with larger disks for now, but no real promises that it will get better prior to discussing it and deciding on a longer-term path forwards.

diskusage

For posterity, here is the breakdown by package. I only show the ones taking more than 5GB here (but that build was incomplete so there may be others as well).

6.8G    amesos2
9.2G    anasazi
28G     belos
21G     ifpack2
5.4G    intrepid
23G     intrepid2
9.8G    ml
15G     stratimikos
35G     teko
5.2G    teuchos
5.5G    thyra
56G     tpetra
6.8G    xpetra

@jhux2
Copy link
Member Author

jhux2 commented Feb 12, 2024

6.8G    amesos2
9.2G    anasazi
28G     belos
21G     ifpack2
5.4G    intrepid
23G     intrepid2
9.8G    ml
15G     stratimikos
35G     teko
5.2G    teuchos
5.5G    thyra
56G     tpetra
6.8G    xpetra

Is that all? ;p

@jhux2
Copy link
Member Author

jhux2 commented Feb 12, 2024

Would it be ok to temporarily disable the static PR build?

@sebrowne
Copy link
Contributor

Would it be ok to temporarily disable the static PR build?

I'd like to avoid that at all costs if we intend to keep caring about it. I'm going to try and assign those builds to a specific machine to keep them from running out of space temporarily.

@rppawlo
Copy link
Contributor

rppawlo commented Feb 12, 2024

Is that big memory jump due to the kokkos update? If so I think they would care about a regression of that magnitude.

@sebrowne
Copy link
Contributor

I can try to bisect it. My gut feeling is no, but it may have been the straw so to speak. I'll check out a version from a month ago and compare the on-disk size and report back tomorrow.

@ndellingwood
Copy link
Contributor

Is that big memory jump due to the kokkos update? If so I think they would care about a regression of that magnitude.

@rppawlo the kokkos update was a patch release consisting primarily of bug fixes and patch matches of PRs made to Trilinos, the changes were pretty minimal compared to what was already in Trilinos with 4.2.00, it wouldn't make sense to me that the patch release would trigger a large spike in memory usage.

@ndellingwood
Copy link
Contributor

@sebrowne this was the PR with the kokkos patch release: #12707; the sha after merge of that PR was 9f40ed4 , the sha of merge prior to that was 9b3eff1 , in case these are helpful for comparison points

@sebrowne
Copy link
Contributor

Aaaaand one of them is running out again (will likely affect #12739)

@sebrowne
Copy link
Contributor

New news: unless I did something to cause it somehow, this configuration (GCC + OpenMPI +openmp + static linking) takes 1.1TB to build with all packages and tests enabled. 180GB of it is in ROL examples. Continuing to poke around building different SHA1s (including those provided by @ndellingwood)

@ndellingwood
Copy link
Contributor

@sebrowne I tried some smaller-scale testing just building Tpetra tests before/after of the Kokkos patch release (i.e. 9b3eff1 / 9f40ed4 ) using the sems-compilers and cmake fragments from one of the failing gcc/8.3.0+openmpi/1.10 openmp builds, the Kokkos release was definitely the change that introduced the build size blow up

Here is a comparison of the tpetra tests:

du -hs Trilinos/packages/tpetra/core/tests:

Before After
244M 2.5G

Ouch..

A few notes on what I tried and learned so far:

  • One change that went in with the Kokkos PR was to remove Kokkos' setting of Trilinos cxx flags, which has changes matching those prototyped and early tested in Kokkos: Don't let Kokkos set CMAKE_CXX_FLAGS #12572. That PR also had some fails related to space issues assumed spurious, but now the correlation makes me suspect something related to those changes is culprit

  • I looked at verbose compile lines from a few selected files in the before/after builds, the primary difference was that after the Kokkos 4.2.1 release Kokkos + KokkosKernels Promotion To 4.2.1 #12707 additional flags for debug symbols -g appeared on the compile lines that weren't there previously, which can definitely add to larger build sizes (though I'm not claiming at this point this accounts for the full blow up in size); I don't know where the -g flag comes from (this was a release build), and I'm not sure if it was set somewhere through Trilinos but was being clobbered by Kokkos' setting of CMake flags, or elsewhere?

  • Additional flags appearing on some compilation lines (after the Kokkos PR Kokkos + KokkosKernels Promotion To 4.2.1 #12707 merged) were -pedantic -Wall -Wno-long-long -Wwrite-strings

  • The failing PR build is a release build, but I didn't see -g added to the CMAKE_CXX_FLAGS in the cmake fragment. I tried to get a clean build configured to force removal of the -g flag in order to see what the difference is in build size (for example, adding explicit cmake options -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_CXX_FLAGS="" -DCMAKE_C_FLAGS="" to my cmake line) but each time the -g flag is present. I'm not sure where this is being appended to the flags, any suggestions for an easier/faster way to trace this?

  • I tried dropping this chunk of cmake code

    IF (KokkosEnable)
    MESSAGE("-- " "Skip adding flags for OpenMP because Kokkos flags does that ...")
    SET(OpenMP_CXX_FLAGS_OVERRIDE " ")
    # There is a top-level CMAKE_CXX_FLAGS. Warnings and other flags get added
    # in the sub-scope of each individual package
    # Grab those variables here so we know what was the original top-level flags
    # and what are the CMAKE_CXX_FLAGS added afterwards for an individual package
    SET(TRILINOS_TOPLEVEL_CXX_FLAGS ${CMAKE_CXX_FLAGS})
    # NOTE: Setting TRILINOS_TOPLEVEL_CXX_FLAGS only has any impact if Kokkos is
    # being treated as an internal package.
    ENDIF()
    (which I think should be unnecessary after Kokkos: Don't let Kokkos set CMAKE_CXX_FLAGS #12572 matching changes?) and reconfigured as above to try and get rid of the stubborn -g flag, but no luck, and the test build size was the same

My next action item is to bisect through the kokkos 4.2.1 changes to see which changes contributed to the build size blow up (I'm guessing #12572 matching changes are the driver, but we'll see)

If the build size blow up follows from removing Kokkos' setting of CMAKE_CXX_FLAGS, we'll need to continue investigating to figure out how to handle this. One important driver of that change was for Hip builds on Cray hardware e.g. #12697 , which apps will need us to support. Hopefully it is as simple as compiling without debug symbols, but at the moment I'm uncertain how to drop the stubborn -g flag

@sebrowne
Copy link
Contributor

I did some VERY brief additional triage for the -g flag, and it appears we are enabling Trilinos_ENABLE_DEBUG_SYMBOLS in all PR builds. Which is (in my opinion) completely inappropriate. My guess is that the prior behavior in Kokkos actually fixed this issue (or at least, hid it). I will triple-check tomorrow morning, but if that shrinks the builds enough, I'll make the change immediately. Thank you so much for the triage help @ndellingwood !

@ndellingwood
Copy link
Contributor

@sebrowne excellent, thanks for the triage and update!

I tested a build with the kokkos master branch (i.e. 4.2.01) and reverted the change that dropped Kokkos' setting of Trilinos CMAKE_CXX_FLAGS:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 4a4e7a5..d20b3c8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -252,6 +252,7 @@ ENDIF()
 # subpackages
 
 ## This restores the old behavior of ProjectCompilerPostConfig.cmake
+# It sets the CMAKE_CXX_FLAGS globally to those used by Kokkos
 # We must do this before KOKKOS_PACKAGE_DECL
 IF (KOKKOS_HAS_TRILINOS)
   # Overwrite the old flags at the top-level
@@ -279,13 +280,21 @@ IF (KOKKOS_HAS_TRILINOS)
     SET(KOKKOSCORE_XCOMPILER_OPTIONS "${KOKKOSCORE_XCOMPILER_OPTIONS} -Xcompiler ${XCOMP_FLAG}")
     LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS -Xcompiler ${XCOMP_FLAG})
   ENDFOREACH()
+  SET(KOKKOSCORE_CXX_FLAGS "${KOKKOSCORE_COMPILE_OPTIONS} ${KOKKOSCORE_XCOMPILER_OPTIONS}")
   IF (KOKKOS_ENABLE_CUDA)
     STRING(REPLACE ";" " " KOKKOSCORE_CUDA_OPTIONS    "${KOKKOS_CUDA_OPTIONS}")
     FOREACH(CUDAFE_FLAG ${KOKKOS_CUDAFE_OPTIONS})
       SET(KOKKOSCORE_CUDAFE_OPTIONS "${KOKKOSCORE_CUDAFE_OPTIONS} -Xcudafe ${CUDAFE_FLAG}")
       LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS -Xcudafe ${CUDAFE_FLAG})
     ENDFOREACH()
+    SET(KOKKOSCORE_CXX_FLAGS "${KOKKOSCORE_CXX_FLAGS} ${KOKKOSCORE_CUDA_OPTIONS} ${KOKKOSCORE_CUDAFE_OPTIONS}")
   ENDIF()
+  # Both parent scope and this package
+  # In ProjectCompilerPostConfig.cmake, we capture the "global" flags Trilinos wants in
+  # TRILINOS_TOPLEVEL_CXX_FLAGS
+  SET(CMAKE_CXX_FLAGS "${TRILINOS_TOPLEVEL_CXX_FLAGS} ${KOKKOSCORE_CXX_FLAGS}" PARENT_SCOPE)
+  SET(CMAKE_CXX_FLAGS "${TRILINOS_TOPLEVEL_CXX_FLAGS} ${KOKKOSCORE_CXX_FLAGS}")
+  #CMAKE_CXX_FLAGS will get added to Kokkos and Kokkos dependencies automatically here
   #These flags get set up in KOKKOS_PACKAGE_DECL, which means they
   #must be configured before KOKKOS_PACKAGE_DECL
   SET(KOKKOS_ALL_COMPILE_OPTIONS

The tpetra/core/tests build size dropped from 2.5G back down to 224M, so some combo of that change along with allowing the -g flag to be added to compilation resulted in the build size blow up.

@sebrowne Hopefully the change to Trilinos_ENABLE_DEBUG_SYMBOLS to drop discontinue use of -g in the OpenMP build is sufficient to get things back on track, fingers crossed!

@ndellingwood
Copy link
Contributor

@sebrowne one more bit of good news, adding the cmake option you suggested -DTrilinos_ENABLE_DEBUG_SYMBOLS=OFF reduced the tpetra/core/tests build size back down to 224M (no changes to Trilinos or Kokkos), so that change should get the OpenMP PR build back to normal

@sebrowne
Copy link
Contributor

I feel comfortable declaring that this is what was compounding to cause the size explosion. See #12742

@jhux2
Copy link
Member Author

jhux2 commented Feb 14, 2024

@sebrowne and @ndellingwood Thank you both for tracking this down!

@sebrowne
Copy link
Contributor

Confirmed fixed

@sebrowne sebrowne unpinned this issue Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autotester Issues related to the autotester. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants