Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Use templates for METplus config files #683

Merged

Conversation

gsketefian
Copy link
Collaborator

@gsketefian gsketefian commented Mar 20, 2023

DESCRIPTION OF CHANGES:

This PR converts the METplus configuration files into Jinja templates form which the config files needed by the vx tasks can be generated. These config files are placed in the experiment directory. This prevents accidental changes to the METplus config files when multiple experiments are running simultaneously. This PR also reduces the number of J-jobs and ex-scripts used by the vx tasks by combining ones that are similar. Details:

  1. Convert the METplus configuration files into Jinja templates. In the process, to avoid repetition, combine multiple files for APCP hours greater than 01 (03, 06, and 24) into a single file having the string _APCPgt01h in its name.
  2. Modify the ex-scripts so that they use the fill_jinja_template.py script to generate the METplus config files from the templates.
  3. Combine similar j-jobs and ex-scripts into ones that can be used by multiple tasks, e.g. combine the ex-script that calls METplus's GridStat and PointStat tools for deterministic verification into one ex-script with the string gridstat_or_pointstat in its name.
  4. Modify the ROCOTO template XML to use the new J-job and ex-script names and pass in several new variables to the vx tasks (OBTYPE, ACCUM_HH (if necessary), and MET_TOOL; also, remove FHR).
  5. Add a new WE2E test (MET_ensemble_verification_only_vx) that tests use of staged obs and forecast files to run ensemble verification.
  6. Fix/clean up running of vx tasks in NCO mode.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

Ran the fundamental tests on Hera with intel (see below); all passed.

MET_verification
MET_verification_only_vx
community_ensemble_2mems
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
MET_ensemble_verification
MET_ensemble_verification_only_vx
community_ensemble_2mems_stoch
pregen_grid_orog_sfc_climo

DOCUMENTATION:

Documentation is needed but will be added once the whole set of vx PRs has been merged (see Issue #630).

ISSUE:

Resolves #382

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

… for Hera and modify run_WE2E_tests.py so that this location is obtained in a platform-independent way.
…ation of staged forecast files; modify code to correctly set VX_FCST_INPUT_BASEDIR in WE2E experiments.
…o that it is exactly consistent with the one used in run_WE2E_tests.py.
…t each task has its own values for NNODES, PPN, MEM, WTIME, MAXTRIES.
…uired by NCO); also rename vx task top-level sections in config_defaults.yaml so that they're all lower case (to match names of ex-scripts).
…ks; Add ensemble member name (if running ensemble vx) to the names of the vx tasks in the ROCOTO xml.
…t convention; fix bugs in the two ex-scripts exregional_run_met_pointstat_ens[mean|prob].sh that call METplus twice.
…output. Details below.

In the ex-scripts for EnsembleStat: (1) generate a new variable FCST_INPUT_FN_TEMPLATE that contains a list of forecast input template files for each of the ensemble members and use this in the METplus configuration files, and (2) change the input base directory (INPUT_BASE) to use VX_FCST_INPUT_BASEDIR, which depending on whether forecasts are being run is set either to the location of the UPP output from the forecasts or to the staged forecast UPP output (this allows staged ensemble staged files to be used to perform ensemble vx).

In config_defaults.yaml, introduce new template variables (FCST_SUBDIR_TEMPLATE, FCST_FN_TEMPLATE, and FCST_FN_METPROC_TEMPLATE) that are/will be used in forming the variable FCST_INPUT_FN_TEMPLATE in the EnsembleStat ex-scripts.
…leStat for point-obs, ensure that METplus is called only once and for the correct variable (instead of for both SFC and UPA).
… This should later be replaced with a python version.
… This should later be replaced with a python version.
…able the ensemble member name (if doing an ensemble forecast). This makes it easier in METviewer to identify curves corresponding to different members.
…_LOCAL_MODULE_FN and VX_LOCAL_MODULE_FN; fix spacing.
@JeffBeck-NOAA
Copy link
Collaborator

@MichaelLueken, I think this PR is ready. The only problem is that @gsketefian is on leave this week and there is a minor conflict with the hera.yaml machine file (change in default paths). Are you comfortable with me resolving this conflict and then proceeding with the merge?

@MichaelLueken
Copy link
Collaborator

@JeffBeck-NOAA I certainly don't mind if you went ahead and corrected the conflict. The issue is that PRs that change the FV3LAM_wflow.xml file are on hold until @christinaholtNOAA's PR #676 has been merged.

@christinaholtNOAA Since both this PR and @chan-hoo's PR #682 have been approved and automated tests have been run and passed, can we move forward with these two PRs? Unfortunately, it would mean needing to apply these two updates to your PR before your work can be merged. Thank you very much for your time and consideration.

@chan-hoo
Copy link
Collaborator

@MichaelLueken, please merge my PR #682 before the PR #676 if it passed the tests :) I am going to create the production/aqmv7 branch for production of AQMv7 once my PR is merged. It (without the huge change) will make my life much easier :)

@MichaelLueken
Copy link
Collaborator

@JeffBeck-NOAA Speaking with @christinaholtNOAA, the final changes should be in place and we are hoping that the Jenkins tests will be submitted today on PR #676. So long as there are no failures with the Jenkins tests and manual testing on Jet, then her PR will be merged either tomorrow or first thing Friday. If there are issues, then I will be able to move forward with the changes in this PR. Thanks!

@JeffBeck-NOAA
Copy link
Collaborator

@MichaelLueken, thanks. I've resolved the conflicts for this PR and merged the commit.

mkavulich and others added 13 commits May 4, 2023 10:35
* [develop] Adds a YAML interface for creating a Rocoto XML. (ufs-community#676)

Refactors the creation of a Rocoto XML to use a very generic Jinja2 template that is flexible enough to meet the needs of various workflow configurations supported by SRW. Specifically, it allows for a completely arbitrary workflow to be created under SRW, which includes the addition of completely arbitrary tasks on top of the predefined ones.

---------

Co-authored-by: Michael Kavulich <[email protected]>

* [develop] Change the build log output file extension from log to txt (ufs-community#690)

When pipeline files are archived to s3 bucket, retrieving the file via a browser attempts to render/display files of known extensions. A browser doesn't generally understand what to do with a .log extension (e.g. build.log). For ease of use in the CI Dashboard, which is a static HTML page, the s3 archived build log needs a .txt extension (e.g. build.txt).

* Add "MET_TOOL" definitions to new XML definition YAMLs

* Fix incorrect YAML if block in config_defaults, remove non-needed "USCORE_ENSMEM_NAME_OR_NULL" variable

* - Convert new test "MET_ensemble_verification_only_vx" to new YAML format
 - Fix f-string for utils.py error message

* Fixing more failures (still more to go)

* More fixes, got stand-alone verification test to pass!

 - Fix copy-paste errors in parm/workflow yamls
 - Update corrected variables for new names in exscripts

* Improvement for monitor jobs script: if in debug mode, print the number of tasks that succeeded and failed for failed experiments

* Forgot to include VX_FCST_INPUT_DIR definition for MET_ensemble_verification_only_vx test

* Correct script for task_run_MET_EnsembleStat_vx_APCP

* Pull out CATE and ENSMEM_INDEX from default VX_FCST_INPUT_DIR. My naive attempt to simplify things was the root of all my problems!

* Everything working! Just need to solve problem of non-existent metatask dependencies!

* Fix last failing ensemble test, fundamental tests and all verification tests now pass on Hera!

---------

Co-authored-by: Christina Holt <[email protected]>
Co-authored-by: Bruce Kropp - Raytheon <[email protected]>
…to be launched only after the previous task has completed even when there is no dependency of the task on the previous one.
…eads in the netcdf versions of the NDAS obs.
…e dependency section of the PointStat tasks, empty out the dependencies on the post and get_obs tasks since those are not run for this test.
…place it only ub tasks that need it; define SLASH_ENSMEM_SUBDIR_OR_NULL locally in a loop when we loop over members in EnsembleStat vx task.
…PointStat task on get_obs_... and do_post tasks.
…umber from name of METplus config and task log files.
@gsketefian
Copy link
Collaborator Author

@MichaelLueken I updated this PR and reran (on Hera) the four verification tests:

MET_verification_only_vx
MET_ensemble_verification_only_vx
MET_verification
MET_ensemble_verification

In addition, I ran the fundamental tests, which currently are:

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

All tests were successful. I think this PR is ready for the Jenkins tests.

@gsketefian
Copy link
Collaborator Author

@MichaelLueken I see the error message "This commit cannot be buiild". I'd like the see the logs, but clicking on "Details" doesn't take me there (browser hangs). Are the logs not public?

@MichaelLueken
Copy link
Collaborator

@gsketefian Unfortunately, the Jenkins logs are no longer public due to security concerns. I'll look through the failures now, but the due to the upgrading of PrgEnv on Gaea this week, Gaea is having issues. I'll let you know if there are any failures that will need to be addressed before this work can be merged.

@MichaelLueken
Copy link
Collaborator

@gsketefian

On Cheyenne GNU, the nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km test failed intermittently in the post. This is an ongoing issue that has been seen in all PRs since the test was added to the coverage suite and might be due to issue #652. This will not keep your PR from being merged.

On Hera GNU, both the get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200 and get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS tests failed. The get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS test has been failing since it was introduced into the coverage tests. A rerun of the get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200 test showed that it passes. Can safely move forward.

On Hera Intel, the grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta test failed in the run_fcst task. This is known issue (#731).

Since the failures are expected and the rerun of the one unexpected failure successfully passed, I will now move forward with merging this work.

@MichaelLueken MichaelLueken merged commit 1249345 into ufs-community:develop May 12, 2023
@gsketefian
Copy link
Collaborator Author

@MichaelLueken Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Separate out ensemble verification (gen-ens-prod and ensemble-stat)
7 participants