-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framework: PR test machine problem? #12732
Comments
And #12731. |
This is effectively the same issue as was first seen in #12707 (comment). I'm monitoring and can confirm that some of the machines are reliably running out of disk space, and it appears to be just from our builds/tests. Some of them have over 1TB of disk, so I think we have a problem, unless it REALLY does take that much space to build and test OpenMP + GCC + OpenMPI Trilinos. |
Is there something in particular that uses a lot of space? Or is the increase across the board on all object files? |
The builds that are particularly problematic are (now that I notice it) the statically-linked ones. So my guess is that my conclusion will be "It's just that expensive to link Trilinos statically". More specifically, I think it's that expensive to link the tests statically. I'm running a build to get a feel for how much space that build takes, and will report back. Most likely some code change pushed the build sizes up just a little bit, and it's now running up against our hard disk sizes on some of the machines. |
Yep, we are in the unfortunate position of "that build really does take over 200GB of disk space". Static linking is a bear, to be sure. See attached disk usage plot of my replicating PR build on one of the machines that has been running out of space. Will need to discuss the priority of a statically-linked Trilinos (with all of its tests) with the Operational Leadership Team. I can try to get it to place that build on only machines with larger disks for now, but no real promises that it will get better prior to discussing it and deciding on a longer-term path forwards. For posterity, here is the breakdown by package. I only show the ones taking more than 5GB here (but that build was incomplete so there may be others as well).
|
Is that all? ;p |
Would it be ok to temporarily disable the static PR build? |
I'd like to avoid that at all costs if we intend to keep caring about it. I'm going to try and assign those builds to a specific machine to keep them from running out of space temporarily. |
Is that big memory jump due to the kokkos update? If so I think they would care about a regression of that magnitude. |
I can try to bisect it. My gut feeling is no, but it may have been the straw so to speak. I'll check out a version from a month ago and compare the on-disk size and report back tomorrow. |
@rppawlo the kokkos update was a patch release consisting primarily of bug fixes and patch matches of PRs made to Trilinos, the changes were pretty minimal compared to what was already in Trilinos with 4.2.00, it wouldn't make sense to me that the patch release would trigger a large spike in memory usage. |
Aaaaand one of them is running out again (will likely affect #12739) |
New news: unless I did something to cause it somehow, this configuration (GCC + OpenMPI +openmp + static linking) takes 1.1TB to build with all packages and tests enabled. 180GB of it is in ROL examples. Continuing to poke around building different SHA1s (including those provided by @ndellingwood) |
@sebrowne I tried some smaller-scale testing just building Tpetra tests before/after of the Kokkos patch release (i.e. 9b3eff1 / 9f40ed4 ) using the sems-compilers and cmake fragments from one of the failing gcc/8.3.0+openmpi/1.10 openmp builds, the Kokkos release was definitely the change that introduced the build size blow up Here is a comparison of the tpetra tests:
Ouch.. A few notes on what I tried and learned so far:
My next action item is to bisect through the kokkos 4.2.1 changes to see which changes contributed to the build size blow up (I'm guessing #12572 matching changes are the driver, but we'll see) If the build size blow up follows from removing Kokkos' setting of CMAKE_CXX_FLAGS, we'll need to continue investigating to figure out how to handle this. One important driver of that change was for Hip builds on Cray hardware e.g. #12697 , which apps will need us to support. Hopefully it is as simple as compiling without debug symbols, but at the moment I'm uncertain how to drop the stubborn |
I did some VERY brief additional triage for the |
@sebrowne excellent, thanks for the triage and update! I tested a build with the kokkos master branch (i.e. 4.2.01) and reverted the change that dropped Kokkos' setting of Trilinos CMAKE_CXX_FLAGS: diff --git a/CMakeLists.txt b/CMakeLists.txt
index 4a4e7a5..d20b3c8 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -252,6 +252,7 @@ ENDIF()
# subpackages
## This restores the old behavior of ProjectCompilerPostConfig.cmake
+# It sets the CMAKE_CXX_FLAGS globally to those used by Kokkos
# We must do this before KOKKOS_PACKAGE_DECL
IF (KOKKOS_HAS_TRILINOS)
# Overwrite the old flags at the top-level
@@ -279,13 +280,21 @@ IF (KOKKOS_HAS_TRILINOS)
SET(KOKKOSCORE_XCOMPILER_OPTIONS "${KOKKOSCORE_XCOMPILER_OPTIONS} -Xcompiler ${XCOMP_FLAG}")
LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS -Xcompiler ${XCOMP_FLAG})
ENDFOREACH()
+ SET(KOKKOSCORE_CXX_FLAGS "${KOKKOSCORE_COMPILE_OPTIONS} ${KOKKOSCORE_XCOMPILER_OPTIONS}")
IF (KOKKOS_ENABLE_CUDA)
STRING(REPLACE ";" " " KOKKOSCORE_CUDA_OPTIONS "${KOKKOS_CUDA_OPTIONS}")
FOREACH(CUDAFE_FLAG ${KOKKOS_CUDAFE_OPTIONS})
SET(KOKKOSCORE_CUDAFE_OPTIONS "${KOKKOSCORE_CUDAFE_OPTIONS} -Xcudafe ${CUDAFE_FLAG}")
LIST(APPEND KOKKOS_ALL_COMPILE_OPTIONS -Xcudafe ${CUDAFE_FLAG})
ENDFOREACH()
+ SET(KOKKOSCORE_CXX_FLAGS "${KOKKOSCORE_CXX_FLAGS} ${KOKKOSCORE_CUDA_OPTIONS} ${KOKKOSCORE_CUDAFE_OPTIONS}")
ENDIF()
+ # Both parent scope and this package
+ # In ProjectCompilerPostConfig.cmake, we capture the "global" flags Trilinos wants in
+ # TRILINOS_TOPLEVEL_CXX_FLAGS
+ SET(CMAKE_CXX_FLAGS "${TRILINOS_TOPLEVEL_CXX_FLAGS} ${KOKKOSCORE_CXX_FLAGS}" PARENT_SCOPE)
+ SET(CMAKE_CXX_FLAGS "${TRILINOS_TOPLEVEL_CXX_FLAGS} ${KOKKOSCORE_CXX_FLAGS}")
+ #CMAKE_CXX_FLAGS will get added to Kokkos and Kokkos dependencies automatically here
#These flags get set up in KOKKOS_PACKAGE_DECL, which means they
#must be configured before KOKKOS_PACKAGE_DECL
SET(KOKKOS_ALL_COMPILE_OPTIONS The tpetra/core/tests build size dropped from 2.5G back down to 224M, so some combo of that change along with allowing the @sebrowne Hopefully the change to |
@sebrowne one more bit of good news, adding the cmake option you suggested |
I feel comfortable declaring that this is what was compounding to cause the size explosion. See #12742 |
@sebrowne and @ndellingwood Thank you both for tracking this down! |
Confirmed fixed |
Bug Report
@trilinos/framework @sebrowne
PR testing for #12726 and #12722 are failing on ascic164 with similar errors but in unrelated packages. Memory or disk space issue?
The text was updated successfully, but these errors were encountered: