Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade the GSI to Spack-Stack version 1.6.0 #684

Merged
merged 20 commits into from
Feb 14, 2024

Conversation

DavidHuber-NOAA
Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA commented Jan 18, 2024

DUE DATE for merger of this PR into develop is 2/29/2024 (six weeks after PR creation).

Description
This upgrades the spack-stack version to 1.6.0 which also upgrades

netCDF-Fortran 4.6.0 -> 4.6.1
sp 2.3.3 -> 2.5.0
CRTM 2.4.0 -> 2.4.0.1
prod_util 1.2.2 -> 2.1.1

Depends on #672
Resolves #674

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?
Built on Hera and Orion. Regression tests performed on Hera. Four tests currently crash (those running CRTM) due to an issue with the way HIRS is versioned in the CRTM. This will be retested once #672 is merged.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@DavidHuber-NOAA
Copy link
Collaborator Author

Reran regression tests on Hera. All tests pass except the following:

  • global_4denvar_loproc_updat and global_4denvar_loproc_contrl produced different step sizes on the 6th iteration starting at 10^-16 decimal point.
    • Additionally, the log shows ~60,000 Remove_AntCorr errors: Remove_AntCorr(FAILURE) : Input iFOV inconsistent with AC data that seem to be coming from CRTM as well.
    • @RussTreadon-NOAA @ADCollard do you know the cause of these errors? Are the differences cause for concern? All J terms were identical to the precision available in the logs.
  • rrfs_3denvar_glbens_hiproc_updat exceeded the time threshold of 91.64s at 92.87s
  • hafs_4denvar_glbens_loproc_updat exceeded the time threshold of 353.55s at 358.36s

@DavidHuber-NOAA
Copy link
Collaborator Author

@ADCollard @RussTreadon-NOAA I am seeing copious warning messages from the CRTM v2.4.0.1 (~60,000 lines of Remove_AntCorr(FAILURE) : Input iFOV inconsistent with AC data). The CRTM output seems to be nearly identical for the global_4denvar test case (see above for details), so it doesn't seem to be a significant problem numerically. Do you know what may be causing this or who I can ask about it? And should this be resolved before attempting to upgrade to 2.4.0.1?

@ADCollard
Copy link
Contributor

@DavidHuber-NOAA Sorry just saw this. I will check out your branch and run it myself.

@RussTreadon-NOAA
Copy link
Contributor

I ran global_4denvar on Hera and WCOSS2 (Cactus). The 60K+ lines of Remove_AntCorr(FAILURE) are, as @DavidHuber-NOAA , reports only on Hera. No such output is generated from the Cactus un.

Comparison of the amsua coefficients in the runtime crtm_coeffs directories on Hera and Cactus show differences for amsua_metop-c.SpcCoeff.bin

Hera

-rw-r--r-- 1 role.epic nems  1252 Jan  4 21:17 amsua_aqua.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97092 Jan  4 21:17 amsua_aqua.TauCoeff.bin
-rw-r--r-- 1 role.epic nems 12196 Jan  4 21:17 amsua_metop-a.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97092 Jan  4 21:17 amsua_metop-a.TauCoeff.bin
-rw-r--r-- 1 role.epic nems 12196 Jan  4 21:17 amsua_metop-a_v2.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 12196 Jan  4 21:17 amsua_metop-b.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97532 Jan  4 21:17 amsua_metop-b.TauCoeff.bin
-rw-r--r-- 1 role.epic nems  1252 Jan  4 21:17 amsua_metop-c.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97092 Jan  4 21:17 amsua_metop-c.TauCoeff.bin
-rw-r--r-- 1 role.epic nems 12196 Jan  4 21:17 amsua_n15.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97092 Jan  4 21:17 amsua_n15.TauCoeff.bin
-rw-r--r-- 1 role.epic nems 12196 Jan  4 21:17 amsua_n18.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97092 Jan  4 21:17 amsua_n18.TauCoeff.bin
-rw-r--r-- 1 role.epic nems 12196 Jan  4 21:17 amsua_n19.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97092 Jan  4 21:17 amsua_n19.TauCoeff.bin

Cactus

-rw-r--r-- 1 ops.prod prod  1252 Nov 13 15:03 amsua_aqua.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 97092 Nov 13 15:03 amsua_aqua.TauCoeff.bin
-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 amsua_metop-a.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 97092 Nov 13 15:03 amsua_metop-a.TauCoeff.bin
-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 amsua_metop-a_v2.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 amsua_metop-b.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 97532 Nov 13 15:03 amsua_metop-b.TauCoeff.bin
-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 amsua_metop-c.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 97092 Nov 13 15:03 amsua_metop-c.TauCoeff.bin
-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 amsua_n15.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 97092 Nov 13 15:03 amsua_n15.TauCoeff.bin
-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 amsua_n18.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 97092 Nov 13 15:03 amsua_n18.TauCoeff.bin
-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 amsua_n19.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod 97092 Nov 13 15:03 amsua_n19.TauCoeff.bin

The Hera amsua_metop-c.SpcCoeff.bin is too small. The Hera spack-stack-1.6.0 crtm/2.4.0.1 fix file directory is missing the amsua_metop-c.SpcCoeff.bin with the antenna correction.

The crtm/2.4.0.1 fix directory on Cactus has five files.

-rw-r--r-- 1 ops.prod prod 12196 Nov 13 15:03 /apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0.1/fix/amsua_metop-c.SpcCoeff.bin
-rw-r--r-- 1 ops.prod prod  1252 Nov 13 15:03 /apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0.1/fix/amsua_metop-c.SpcCoeff.noACC.bin
-rw-r--r-- 1 ops.prod prod 97092 Nov 13 15:03 /apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0.1/fix/amsua_metop-c.TauCoeff.bin
-rw-r--r-- 1 ops.prod prod 20109 Nov 13 15:03 /apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0.1/fix/amsua_metop-c_v2.ACCoeff.nc
-rw-r--r-- 1 ops.prod prod  4576 Nov 13 15:03 /apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0.1/fix/amsua_metop-c_v2.SpcCoeff.nc

whereas on Hera we have four files

-rw-r--r-- 1 role.epic nems  1252 Jan  4 21:17 /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0.1_emc-sl2g2hw/fix/amsua_metop-c.SpcCoeff.bin
-rw-r--r-- 1 role.epic nems 97092 Jan  4 21:17 /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0.1_emc-sl2g2hw/fix/amsua_metop-c.TauCoeff.bin
-rw-r--r-- 1 role.epic nems 20109 Jan  4 21:17 /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0.1_emc-sl2g2hw/fix/amsua_metop-c_v2.ACCoeff.nc
-rw-r--r-- 1 role.epic nems  4576 Jan  4 21:17 /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env/install/intel/2021.5.0/crtm-fix-2.4.0.1_emc-sl2g2hw/fix/amsua_metop-c_v2.SpcCoeff.nc

The Cactus amsua_metop-c.SpcCoeff.noACC.bin is the same size as the Hera amsua_metop-c.SpcCoeff.bin.

@DavidHuber-NOAA
Copy link
Collaborator Author

@RussTreadon-NOAA Thank you for the detailed investigation. I will report this to the spack-stack team.

@ADCollard
Copy link
Contributor

@DavidHuber-NOAA @RussTreadon-NOAA I also did the same test and noted the difference between cactus and hera. I also checked the coefficient files in the github and tarred versions of the CRTM fix files.

These are:

https://github.com/JCSDA/crtm  tag v2.4.0_emc.3 

and

https://bin.ssec.wisc.edu/pub/s4/CRTM/fix_REL-2.4.0_emc_07112023.tgz

In both cases the metop-c coefficient files are small (lacking the antenna correction information). So the WCOSS2 coefficient files appear inconsistent with the official release, but are in fact the correct ones to use.

@Hang-Lei-NOAA worked with NCO to install the coefficient files on the WCOSS machines, so maybe he can explain the difference.

My guess is that as this was supposed to only be an incremental change (only six files needed to be added) only these files were updated on WCOSS2 (which is definitely the safest way to manage this change).

@jswhit
Copy link
Contributor

jswhit commented Jan 23, 2024

can this PR support gaea C5?

@DavidHuber-NOAA
Copy link
Collaborator Author

@jswhit First, we will need to resolve this CRTM fix file issue. If that can be resolved, then I'm fine with updating the Gaea module to point to the C5/F5 stack -- I think it was just installed on F5 yesterday. Would you be willing to run the regression tests there?

@ulmononian
Copy link

just wanted to leave a note here that several versions of spack-stack have been installed in a new location to accommodate the decommissioning of c4 and the transition to the f5 filesystem. the available versions and their locations are as follows:

/ncrc/proj/epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/modulefiles/Core
/ncrc/proj/epic/spack-stack/spack-stack-1.5.0/envs/unified-env/install/modulefiles/Core
/ncrc/proj/epic/spack-stack/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
/ncrc/proj/epic/spack-stack/spack-stack-1.6.0/envs/unified-env/install/modulefiles/Core

there is a PR in the ufs-wm to update gaea to use /ncrc/proj/epic/spack-stack/spack-stack-1.5.1, but testing is not yet completed for this (ufs-community/ufs-weather-model#2115 ). for more information on the f2/f5 transition on the ufs-wm side, please see ufs-community/ufs-weather-model#2101.

@DavidHuber-NOAA
Copy link
Collaborator Author

The CRTM-fix file was fixed on Orion and regression tests were run. All tests passed except RTMA, which encountered a non-fatal time exceedance failure:

The runtime for rtma_loproc_updat is 240.516202 seconds.  This has exceeded maximum allowable threshold time of 237.953284 seconds,
resulting in Failure time-thresh of the regression test.

When the fix files are fixed on Hera, Hercules, and Jet, I will run RTs on each machine then mark this ready for review. I will also update the Gaea modulefiles to point to those suggested by @ulmononian so they can be tested by @jswhit.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA: I cloned DavidHuber-NOAA:feature/ss_160 on Hera. It's three commits behind the head of NOAA-EMC/GSI:develop. A manual merge of the upstream develop into the cloned ss_160 did not encounter any merge conflicts. I did this in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr684.

@DavidHuber-NOAA
Copy link
Collaborator Author

@RussTreadon-NOAA Apologies, I had pulled develop in but not pushed it back to GitHub. It has been now. Thanks.

@DavidHuber-NOAA
Copy link
Collaborator Author

@jswhit I have given a stab at updating the module file and regression test variables and parameters for Gaea-C5. I verified that the GSI builds. The CRTM-fix files have also been fixed on Gaea. Would you mind running the RTs to verify everything is OK?

@RussTreadon-NOAA
Copy link
Contributor

GSI PR #691 has not yet been merged into DavidHuber-NOAA:feature/ss_160. GSI PR #691 entered develop this morning (1/31/2024).

@DavidHuber-NOAA
Copy link
Collaborator Author

@RussTreadon-NOAA @edwardhartnett @TingLei-NOAA @climbfuji I know now what is causing Hercules to fail with netCDF errors. The errors only appear with the flag I_MPI_EXTRA_FILESYSTEM=1 is set. Spack-stack version 1.6.0 sets this flag in /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/gsi-addon-env/install/modulefiles/Core/stack-intel/2021.9.0.lua and that is why when switching from v1.5.1 to v1.6.0, these failures started to appear. They continued to appear after switching back to v1.5.1 because I added this flag to sub_hercules. Removing this flag from sub_hercules allows the tests to run to completion.

That said, I believe this still indicates an issue with the way the regional GSI is implementing its parallel I/O as this flag allows Intel MPI to distribute I/O across a parallel filesystem. If it is disabled, then native support for parallel I/O is disabled.

My next question is, should I upgrade to v1.6.0 and add unset I_MPI_EXTRA_FILESYSTEM to sub_hercules? This will (continue to) hide the problem and regional workflows using spack-stack v1.6.0 will also need to add this flag when running the GSI.

@RussTreadon-NOAA
Copy link
Contributor

Excellent detective work @DavidHuber-NOAA!

I agree that we need to re-examine and likely refactor netcdf parallel i/o. You opened GSI issue #694 for this purpose. Let's capture your findings in this issue. @TingLei-NOAA, issue #694 has been assigned you to you. I also added @ShunLiu-NOAA and @hu5970 to this issue. We need to resolve this problem.

Since ctests pass on Hercules from a spack-stack/1.6.0 build with unset I_MPI_EXTRA_FILESYSTEM, I recommend adding this to sub_hercules. As @DavidHuber-NOAA notes, this hides the problem. It is true that hidden problems are easily forgotten or ignored. This noted, my preference is to see the authoritative GSI move to spack-stack/1.6.0 on Hercules now instead of holding at spack-stack/1.5.1 until regional parallel netcdf i/o is refactored and/or we jump to a later spack-stack release.

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA That is great! Yes "export I_MPI_EXTRA_FILESYSTEM=0" makes my previously failed test succeed now. I have two questions, first , will this make those previously failed hdf5 tests pass now?
Second, is this setup needed for the same packages while on other machines?

@DavidHuber-NOAA
Copy link
Collaborator Author

@TingLei-NOAA I'll respond to this in #694.

@DavidHuber-NOAA
Copy link
Collaborator Author

Tests were successful on Hercules with the exception of hafs_3denvar_glbens, which failed to produce identical fv3_dynvars files between the hafs_3denvar_glbens_loproc_updat and hafs_3denvar_glbens_hiproc_updat test cases. I repeated this test on the current develop branch on Hercules and found the same result, so this is not a problem with spack-stack v1.6.0, but with Hercules/HAFS DA, so I think that this PR is ready to go. I will create a separate issue to track this.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , I found similar behavior on Hercules while running ctests for PR #695. PR #695 builds gsi.x and enkf.x using spack-stack/1.5.1. Perhaps resolution of GSI issue #694 will remove this behavior.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. Thank you @DavidHuber-NOAA for persistently working through various problems.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , we need to peer reviews for this PR. I signed off as the Handling Reviewer.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , since you have already commented on this PR can you serve as a peer reviewer?

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , since this is a spack-stack PR, can we get someone from the g-w or spack-stack teams to serve as a peer reviewer?

@AlexanderRichert-NOAA
Copy link

@RussTreadon-NOAA this looks good to me from the spack-stack perspective. If it's been tested on the systems in question then those configs look good.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @AlexanderRichert-NOAA for your comment. I added you as a reviewer so we can formally capture your input as a reviewer. Your review will help move this PR forward.

local stack_gnu_ver=os.getenv("stack_gnu_ver") or "9.2.0"
local stack_openmpi_ver=os.getenv("stack_openmpi_ver") or "4.1.5"
local cmake_ver=os.getenv("cmake_ver") or "3.23.1"
local prod_util_ver=os.getenv("prod_util_ver") or "1.2.2"
local prod_util_ver=os.getenv("prod_util_ver") or "2.1.1"
local openblas_ver=os.getenv("openblas_ver") or "0.3.19"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe openblas version should be 0.3.24, @DavidHuber-NOAA can you confirm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexanderRichert-NOAA You are correct, it should be 0.3.24. I updated it and ran a test build. Thanks!

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Feb 14, 2024 via email

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Feb 14, 2024

@TingLei-NOAA , like you, I am not a spack-stack expert. We are both GSI, and now JEDI, developers. Can you review this PR as one who is knowledgeable about the GSI? @AlexanderRichert-NOAA is reviewing this PR as one knowledgeable about spack-stack.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Feb 14, 2024 via email

Copy link

@AlexanderRichert-NOAA AlexanderRichert-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spack-stack paths and versions look good.

Copy link
Contributor

@TingLei-NOAA TingLei-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @DavidHuber-NOAA for revealing the issues related to fv3reg GSI IO and demonstrating them are not results from the current PR with helps from other colleagues. Also Dave's work in this PR helps a lot in identifying the culprit in the parallel IO issue on hercules by showing the impacts of I_MPI_EXTRA_FILESYSTEM. As a developer for GSI, albeit not an expert in Spack-stack, I find this PR to be good and approve it.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve.

@RussTreadon-NOAA RussTreadon-NOAA merged commit 74ac594 into NOAA-EMC:develop Feb 14, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade to spack-stack/1.6.0
9 participants