Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Fixes for Issue #608, #610, and #616 #609

Merged

Conversation

gsketefian
Copy link
Collaborator

@gsketefian gsketefian commented Feb 11, 2023

DESCRIPTION OF CHANGES:

  1. Fix the bad WE2E test configuration file for MET_verification_only_vx (Issue [develop] Fix WE2E yaml configuration file for MET_verification_only_vx #608).
  2. Make creation of symlinks to pregenerated files depend on whether downstream tasks need those symlinks (Issue [develop] Create symlinks to pregenerated grid/orog/sfc_climo files only if downstream tasks need them #610).
  3. Set default value of FIXdir to HOMEdir/fix only when RUN_ENVIR="nco", not when RUN_TASK_MAKE_GRID=False; otherwise, set FIXdir to EXPTDIR (Issue [develop] Decide on location of symlinks to fix files depending on RUN_ENVIR, not RUN_TASK_MAKE_GRID #616).
  4. Add a flag to the script get_expts_status.sh so that if an experiment hasn't been launched yet, it calls the launch script launch_FV3LAM_wflow.sh to launch it instead of only outputting a message that it's not yet launched.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

The following fundamental tests were conducted on Hera with Intel and passed:

  • MET_verification
  • MET_verification_only_vx
  • community_ensemble_2mems
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
  • grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
  • grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
  • grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
  • MET_ensemble_verification
  • community_ensemble_2mems_stoch
  • pregen_grid_orog_sfc_climo

In addition, other tests were conducted that run some pre-processing tasks but not others (e.g. run make_orog but not make_grid and make_sfc_climo), and they also passed. Those are not included as new WE2E tests to keep the number of tests low.

DOCUMENTATION:

No changes to documentation are necessary as far as I can tell.

ISSUE:

This resolves Issues #608 and #610.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published
    NA

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

… for Hera and modify run_WE2E_tests.py so that this location is obtained in a platform-independent way.
…owing up in SRW clone); Make creation of symlinks to pregenerated files depend on whether downstream tasks need those symlinks.
…t been launched yet, use the launch script to launch it.
@@ -427,7 +427,8 @@ The last ${num_log_lines} lines of the workflow launch log file
print_info_msg "$msg" >> "${expts_status_fp}"
tail -n ${num_log_lines} ${launch_wflow_log_fn} >> "${expts_status_fp}"
else
wflow_status="Workflow status: NOT LAUNCHED YET"
wflow_status="Workflow status: NOT LAUNCHED YET
Launching workflow using script \"${launch_wflow_fn}\"..."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea especially for jenkins tests. This used to be like this before (i.e. launch job after checking status) but I have had issues with jenkins tests, which call this script every minute to check for status of experiments, so I changed it. Moreover, I believe the behavior of this script should be consistent with its name, should simply check status of experiments without launching them. That way it will also respect the launch time assigned in each cronjob.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa The original purpose of this script was to get the updated status of each experiment in a given directory from the command line without having to go into each directory to call launch_FV3LAM_wflow.sh. I did not realize it had been modified and being used in the Jenkins tests since I've been working on a different branch the past few months. I'm finding now that I can now no longer use it for its original purpose since the new run_WE2E_tests.py script does not call launch_FV3LAM_wflow.sh to launch the experiments (so there is no log.launch_FV3LAM_wflow log file even if the experiments are running). I suppose it will once the crontab capability is added back in, but still it would be nice to be able to relaunch a bunch of jobs manually to get their updated workflow statuses. I'm thinking an easy solution is to add an argument that determines this behavior, say launch_wflows=[true|false], with a default value of false so it normally behaves as you expect. Does that work for you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsketefian I disagree on this one. I think it is best to keep modularity and have a script/function that just checks experiment status, and another one that launches them etc. Having been burned with the surprise existence of "launch_wflow" in this script, which was launching the jobs on cheyene after a minute, I am hesitant to bringing back that functionality especially without changing the name of the script. Also the way it is now, you have no control over when to launch the job (it is immediately launched if doesn't exist), so I doubt the run_WE2E_tests.py would use it once it matures. It probably wants to control when to launch jobs and how frequently to check for status using either CRON or something else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa Ok, we can rename it to something like launch_expts_get_status.sh and put in that argument (this time with default of "true" makes more sense). I can also add another argument for a wait period before launching tasks. run_WE2E_tests.py doesn't have to use it. Will that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsketefian I still don't think it is a good idea but I will defer to other reviewers. Although the naming change helps, passing flags to make it do everything is not modular programming. I mean it would be odd if I pass a flag to launch_workflow.sh to make it check status of an experiment too, wouldn't you agree? In an ideal scenario, there should be one script/function that launches a single workflow, another one that just checks status of that single experiment, another one to iterateve over multiple workflows and launch them, and another to check the status of multiple workflows. Especially with launching workflows that has more logic (time interval of launch, is it using cron etc), I believe it should be its own function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa, @gsketefian, is there a partial solution to this problem that could be introduced in this PR and then fully implemented through a subsequent PR? In other words, @gsketefian, is it possible to make a few changes to get your feature working again, but also begin what @danielabdi-noaa has requested? @danielabdi-noaa, would you be OK with an initial change here, and then a future PR to fully introduce a set of scripts to completely modularize the behavior being discussed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBeck-NOAA I was waiting to hear back on slack from @mkavulich on whether he has plans to modify this script as part of his work on the WE2E tests, in which case it would not really make sense for me to do a whole rewrite and modularize it. But I think Mike is out of the office. It would certainly be ok with me to make the changes I suggested above and make an issue to do the bigger changes (modularizing it) at a later time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa I added the flag launch_wflows as mentioned above. Adding a flag for a wait time before launching is more difficult; that is really determined by the CRON_RELAUNCH_INTVL_MNTS variable in the workflow and/or argument to run_WE2E_tests.sh (and soon its python version), and trying to work it into this script would take too much work that is probably not justified as @mkavulich pointed out. Please let me know if this is a satisfactory solution for the time being. @mkavulich any thoughts?

@@ -653,7 +653,7 @@ workflow:
#
#-----------------------------------------------------------------------
#
FIXdir: '{{ EXPTDIR if workflow_switches.RUN_TASK_MAKE_GRID else [user.HOMEdir, "fix"]|path_join }}'
FIXdir: '{{ EXPTDIR }}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCO requires that fix files be placed under $HOMEdir/fix which is why we place the fix files there, and in EXPTDIR in "community" mode. The logic above is not straightforward since we are testing "RUN_TASK_MAKE_GRID", which is always set to false in NCO mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa Ok, I didn't realize NCO requires the fix files to be in the directory of the clone. In that case, why not test the value of RUN_ENVIR instead of RUN_TASK_MAKE_GRID, since many researchers (i.e. non-NCO users) will be running in community mode but will turn off the make_grid task once their grid is all set up. So the line would be changed to:

FIXdir: '{{ EXPTDIR if user.RUN_ENVIR == "community" else [user.HOMEdir, "fix"]|path_join }}'

I tested this with both RUN_ENVIR = "nco" and RUN_ENVIR = "community" and it works as expected in each case. Are you ok with that solution?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsketefian Sounds good. Note that this logic came about when de-tangling NCO mode from the "grid being pre-generated" so that we can run all WE2E tests with either nco or community modes. It was easier to base many of the logic based on RUN_TASK_MAKE_GRID, but some logic like symlinking/copying fix files have since became their own options, while NCO mode always symlinked fix files and community mode copied. It is now actually symlink by default to save space on experiment directory size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa Ok this one is updated now to check RUN_ENVIR instead of RUN_TASK_MAKE_GRID.

@MichaelLueken MichaelLueken added the bug Something isn't working label Feb 13, 2023
…NCO requires fix files to be in the SRW clone's home directory, but in community mode we want them to be in the experiment directory).
@gsketefian gsketefian changed the title [develop] Fixes for Issue #608 and #610 [develop] Fixes for Issue #608, #610, and #616 Feb 16, 2023
…ipt launches an unlaunched workflow when it encounters one.
@gsketefian
Copy link
Collaborator Author

@MichaelLueken I merged the latest develop into my branch and reran all the fundamental tests listed above on Hera. They all passed. Now just waiting to get some approves!

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my suggestions!

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsketefian Thanks for working with @danielabdi-noaa to address his concerns! I have run tests on Jet and the modifications to fix are working as expected. I will now approve these changes and submit the Jenkins tests.

@MichaelLueken MichaelLueken added ci-hera-intel-WE Kicks off automated workflow test on hera with intel run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Feb 16, 2023
@venitahagerty venitahagerty removed the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 16, 2023
@venitahagerty
Copy link
Collaborator

venitahagerty commented Feb 16, 2023

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1237470278/20230216215009/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: pregen_grid_orog_sfc_climo
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
2023-02-16 22:32:11 +0000 :: hfe04 :: Task make_ics, jobid=42067013, in state DEAD (FAILED), ran for 39.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
2023-02-16 22:32:05 +0000 :: hfe05 :: Task make_ics, jobid=42067002, in state DEAD (FAILED), ran for 40.0 seconds, exit status=256, try=2 (of 2)
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
Experiment Succeeded on hera: MET_ensemble_verification
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
All experiments completed

@gsketefian
Copy link
Collaborator Author

Thanks for addressing my suggestions!
@danielabdi-noaa You're welcome. Hope in future PRs, if you have concerns I can address them more thoroughly.

@gsketefian
Copy link
Collaborator Author

@MichaelLueken Thanks for shepherding this through :)

@MichaelLueken MichaelLueken added the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 17, 2023
@venitahagerty venitahagerty removed the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 17, 2023
@venitahagerty
Copy link
Collaborator

venitahagerty commented Feb 17, 2023

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1237470278/20230217145008/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Failed on hera: MET_ensemble_verification
2023-02-17 15:32:12 +0000 :: hfe02 :: Task run_fcst_mem001, jobid=42086698, in state DEAD (FAILED), ran for 110.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: MET_ensemble_verification
2023-02-17 15:32:12 +0000 :: hfe02 :: Task run_fcst_mem002, jobid=42086699, in state DEAD (FAILED), ran for 104.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
2023-02-17 15:32:13 +0000 :: hfe12 :: Task run_fcst, jobid=42086697, in state DEAD (FAILED), ran for 137.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
2023-02-17 15:32:09 +0000 :: hfe08 :: Task run_fcst, jobid=42086711, in state DEAD (FAILED), ran for 106.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: pregen_grid_orog_sfc_climo
2023-02-17 15:28:06 +0000 :: hfe03 :: Task run_fcst, jobid=42086616, in state DEAD (FAILED), ran for 108.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
2023-02-17 15:32:14 +0000 :: hfe10 :: Task run_fcst, jobid=42086714, in state DEAD (FAILED), ran for 104.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
2023-02-17 15:32:09 +0000 :: hfe05 :: Task run_fcst, jobid=42086700, in state DEAD (FAILED), ran for 106.0 seconds, exit status=256, try=1 (of 1)
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
2023-02-17 15:32:05 +0000 :: hfe05 :: Task run_fcst, jobid=42086703, in state DEAD (FAILED), ran for 104.0 seconds, exit status=256, try=1 (of 1)
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
All experiments completed

@MichaelLueken
Copy link
Collaborator

@gsketefian The first runs from the ci-hera-intel-WE, two tests failed. The rerun showed that the two tests that failed previously successfully passed. The Jenkins tests have all successfully completed without issue. I will now move forward with merging this work. Thanks!

@MichaelLueken MichaelLueken merged commit 532ed1c into ufs-community:develop Feb 17, 2023
@gsketefian
Copy link
Collaborator Author

@MichaelLueken Thanks for merging and closing the accompanying issues. Next vx PR coming soon!

@gsketefian gsketefian deleted the bugfix/preproc_task_pregen branch February 27, 2023 21:42
MichaelLueken pushed a commit that referenced this pull request Mar 14, 2023
The way the fix directory is set changed in PR #609, specifically item number 3.
I think this has been causing an issue with forced run_envir=nco mode. I don't have a full explanation for this at the moment, but reverting back to old way of setting FIXdir seems to atleast partially fix the issue. I was able to run fundamental tests successfully on Hera using run_envir=nco.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
7 participants