Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for reproducibility, failed builds, and CI. More hrrr tests, shorter rap&hrrr forecast lengths. #1257

Merged

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Jun 7, 2022

PR Checklist

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR are specified below.

  • Results for one or more of the regression tests change and the reasons for the changes are understood and explained below. A new test, hrrr_control_debug, needs a baseline. Many hrrr and rap tests have shorter forecast lengths, and will need baselines as well. Any test using the GF scheme will need a new baseline. Other PRs merged into this one add new tests which will need baselines.

  • New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.

Description

This PR combines several bugfix PRs into one.

Original changes:

  1. Fixes a bug in the GF scheme that broke the rap_decomp regression test.
  2. Re-enables the rap_decomp regression test.
  3. Adds decomp, 2threads, restart, and debug variants of hrrr_control.
  4. Reduces the forecast length of rap and hrrr tests from 24 hours to 12, to compensate for new tests. Restarts happen at hour 6 instead of 12.

Merged from #1263:

Some fixes are made to dycore to resolve the issue with the regional decomposition reproducibility

Merged from #1253:

Fixes broken CI: 1) Update fms library in docker image; 2) Add cpld noaero nowave test back to rt_gnu.conf. This is to reduce the number of cores used by CI on 1-node AWS EC2 instance.

New baselines are needed for the 3 new tests added to rt.conf:
cpld_control_noaero_p8
cpld_control_nowave_noaero_p8
cpld_debug_noaero_p8

New baselines are needed for the 1 new test added to rt_gnu.conf:
cpld_control_nowave_noaero_p8

Merged from #1254:

A recent addition of "module" statements at the top of some files broke moninshoc. This adds a dependency so moninshoc will compile. There are three changes:

  1. moninshoc.meta has the necessary dependency
  2. rt.conf will compile the FV3_GFS_cpld_rasmgshocnsstnoahmp_ugwp suite to ensure moninshoc builds
  3. rt_gnu.conf will compile the FV3_GFS_cpld_rasmgshocnsstnoahmp_ugwp suite to ensure moninshoc builds

Issue(s) addressed

Fixes #1249 due to merged PR #1253

Fixes NCAR/ccpp-physics#935 due to merged PR #1254

Fixes #649 due to merged PR #1263

Fixes #953

This does not fix, but does relate to, #1222, which points out several other FV3_HRRR suite tests do not pass in restart, decomp, 2threads, or debug mode. Apparently the original hrrr_control does pass those tests.

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

Dependencies

NCAR/ccpp-physics#942
NOAA-EMC/fv3atm#553
NOAA-GFDL/GFDL_atmos_cubed_sphere#194

@SamuelTrahanNOAA
Copy link
Collaborator Author

Ideally, this should be combined with #953 which enables all rap_control tests.

@climbfuji
Copy link
Collaborator

Ideally, this should be combined with #953 which enables all rap_control tests.

@SamuelTrahanNOAA Feel free to pull my changes for ufs-weather-model into your PR, or we can do it the other way round.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Ideally, this should be combined with #953 which enables all rap_control tests.

@SamuelTrahanNOAA Feel free to pull my changes for ufs-weather-model into your PR, or we can do it the other way round.

I will merge it to my branch shortly.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@climbfuji Your FV3 is out of date. Do you want me to copy your submodule pull requests too?

@climbfuji
Copy link
Collaborator

climbfuji commented Jun 8, 2022 via email

@SamuelTrahanNOAA
Copy link
Collaborator Author

It won't be urgent until the day this is scheduled for commit. If you won't be available on demand that day, it would be best if I submit replacements for your PRs.

@climbfuji
Copy link
Collaborator

climbfuji commented Jun 8, 2022 via email

@MinsukJi-NOAA
Copy link
Contributor

MinsukJi-NOAA commented Jun 16, 2022

@SamuelTrahanNOAA hrrr_control_2threads failed twice on wcoss cray with the same error message:

+ aprun -j 1 -n 78 -N 12 -d 2 -cc depth ./fv3.exe
apsched: claim exceeds reservation's node-count

Is the run configuration (e.g. number of tasks, and etc.) correct?

@SamuelTrahanNOAA
Copy link
Collaborator Author

@SamuelTrahanNOAA hrrr_control_2threads failed twice on wcoss cray with the same error message:

+ aprun -j 1 -n 78 -N 12 -d 2 -cc depth ./fv3.exe
apsched: claim exceeds reservation's node-count

Is the run configuration (e.g. number of tasks, and etc.) correct?

All of the other 2threads tests are disabled on wcoss.cray, so I just updated rt.conf to disabled hrrr_control_2threads on wcoss.cray as well.

@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: jet
Compiler: intel
Job: BL
[BL] Repo location: /lfs4/HFIP/h-nems/emc.nemspara/autort/pr/959879766/20220616204517/ufs-weather-model
[BL] Error: Test control_p8_rrtmgp 022 failed in run_test failed
Please make changes and add the following label back: jet-intel-BL

@SamuelTrahanNOAA
Copy link
Collaborator Author

All of the Jet tests worked for me. There are frequent system issues on Jet, so you may have to submit a job or two in every workflow execution.

@jkbk2004
Copy link
Collaborator

All of the Jet tests worked for me. There are frequent system issues on Jet, so you may have to submit a job or two in every workflow execution.

sure!

@jkbk2004
Copy link
Collaborator

@SamuelTrahanNOAA All tests pass and we are good to start merging in dependencies: ccpp -> atmos_cubed_sphere -> fv3atm. Can you start with ccpp first?

@SamuelTrahanNOAA
Copy link
Collaborator Author

I have merged ccpp, but I lack the ability to merge atmos_cubed_sphere.

@jkbk2004
Copy link
Collaborator

I have merged ccpp, but I lack the ability to merge atmos_cubed_sphere.

I left a note to atmos_cubed_sphere.

@SamuelTrahanNOAA
Copy link
Collaborator Author

The fv3atm PR is ready for merge. A few minutes ago, someone merged atmos_cubed_sphere. I've updated the fv3atm branch to point to the top of atmos_cubed_sphere and ccpp-physics.

@jkbk2004
Copy link
Collaborator

@SamuelTrahanNOAA Everything ready upto fv3atm. We need hash update and revert pointer.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Done.

@jkbk2004
Copy link
Collaborator

@DusanJovic-NOAA The PR is ready for final review to approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated.
Projects
None yet
8 participants