[develop] Fixes for Issue #608, #610, and #616 #609

gsketefian · 2023-02-11T01:13:18Z

DESCRIPTION OF CHANGES:

Fix the bad WE2E test configuration file for MET_verification_only_vx (Issue [develop] Fix WE2E yaml configuration file for MET_verification_only_vx #608).
Make creation of symlinks to pregenerated files depend on whether downstream tasks need those symlinks (Issue [develop] Create symlinks to pregenerated grid/orog/sfc_climo files only if downstream tasks need them #610).
Set default value of FIXdir to HOMEdir/fix only when RUN_ENVIR="nco", not when RUN_TASK_MAKE_GRID=False; otherwise, set FIXdir to EXPTDIR (Issue [develop] Decide on location of symlinks to fix files depending on RUN_ENVIR, not RUN_TASK_MAKE_GRID #616).
Add a flag to the script get_expts_status.sh so that if an experiment hasn't been launched yet, it calls the launch script launch_FV3LAM_wflow.sh to launch it instead of only outputting a message that it's not yet launched.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

TESTS CONDUCTED:

The following fundamental tests were conducted on Hera with Intel and passed:

MET_verification
MET_verification_only_vx
community_ensemble_2mems
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
MET_ensemble_verification
community_ensemble_2mems_stoch
pregen_grid_orog_sfc_climo

In addition, other tests were conducted that run some pre-processing tasks but not others (e.g. run make_orog but not make_grid and make_sfc_climo), and they also passed. Those are not included as new WE2E tests to keep the number of tests low.

DOCUMENTATION:

No changes to documentation are necessary as far as I can tell.

ISSUE:

This resolves Issues #608 and #610.

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
I have commented my code, particularly in hard-to-understand areas
My changes need updates to the documentation. I have made corresponding changes to the documentation
My changes do not require updates to the documentation (explain).
My changes generate no new warnings
New and existing tests pass with my changes
Any dependent changes have been merged and published
NA

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

… for Hera and modify run_WE2E_tests.py so that this location is obtained in a platform-independent way.

…owing up in SRW clone); Make creation of symlinks to pregenerated files depend on whether downstream tasks need those symlinks.

…emoved for another PR.

…er PR.

…t been launched yet, use the launch script to launch it.

danielabdi-noaa · 2023-02-11T12:29:59Z

tests/WE2E/get_expts_status.sh

@@ -427,7 +427,8 @@ The last ${num_log_lines} lines of the workflow launch log file
    print_info_msg "$msg" >> "${expts_status_fp}"
    tail -n ${num_log_lines} ${launch_wflow_log_fn} >> "${expts_status_fp}" 
  else
-    wflow_status="Workflow status:  NOT LAUNCHED YET"
+    wflow_status="Workflow status:  NOT LAUNCHED YET
+Launching workflow using script \"${launch_wflow_fn}\"..."


I don't think this is a good idea especially for jenkins tests. This used to be like this before (i.e. launch job after checking status) but I have had issues with jenkins tests, which call this script every minute to check for status of experiments, so I changed it. Moreover, I believe the behavior of this script should be consistent with its name, should simply check status of experiments without launching them. That way it will also respect the launch time assigned in each cronjob.

@danielabdi-noaa The original purpose of this script was to get the updated status of each experiment in a given directory from the command line without having to go into each directory to call launch_FV3LAM_wflow.sh. I did not realize it had been modified and being used in the Jenkins tests since I've been working on a different branch the past few months. I'm finding now that I can now no longer use it for its original purpose since the new run_WE2E_tests.py script does not call launch_FV3LAM_wflow.sh to launch the experiments (so there is no log.launch_FV3LAM_wflow log file even if the experiments are running). I suppose it will once the crontab capability is added back in, but still it would be nice to be able to relaunch a bunch of jobs manually to get their updated workflow statuses. I'm thinking an easy solution is to add an argument that determines this behavior, say launch_wflows=[true|false], with a default value of false so it normally behaves as you expect. Does that work for you?

@gsketefian I disagree on this one. I think it is best to keep modularity and have a script/function that just checks experiment status, and another one that launches them etc. Having been burned with the surprise existence of "launch_wflow" in this script, which was launching the jobs on cheyene after a minute, I am hesitant to bringing back that functionality especially without changing the name of the script. Also the way it is now, you have no control over when to launch the job (it is immediately launched if doesn't exist), so I doubt the run_WE2E_tests.py would use it once it matures. It probably wants to control when to launch jobs and how frequently to check for status using either CRON or something else.

@danielabdi-noaa Ok, we can rename it to something like launch_expts_get_status.sh and put in that argument (this time with default of "true" makes more sense). I can also add another argument for a wait period before launching tasks. run_WE2E_tests.py doesn't have to use it. Will that work?

@gsketefian I still don't think it is a good idea but I will defer to other reviewers. Although the naming change helps, passing flags to make it do everything is not modular programming. I mean it would be odd if I pass a flag to launch_workflow.sh to make it check status of an experiment too, wouldn't you agree? In an ideal scenario, there should be one script/function that launches a single workflow, another one that just checks status of that single experiment, another one to iterateve over multiple workflows and launch them, and another to check the status of multiple workflows. Especially with launching workflows that has more logic (time interval of launch, is it using cron etc), I believe it should be its own function.

@danielabdi-noaa, @gsketefian, is there a partial solution to this problem that could be introduced in this PR and then fully implemented through a subsequent PR? In other words, @gsketefian, is it possible to make a few changes to get your feature working again, but also begin what @danielabdi-noaa has requested? @danielabdi-noaa, would you be OK with an initial change here, and then a future PR to fully introduce a set of scripts to completely modularize the behavior being discussed?

@JeffBeck-NOAA I was waiting to hear back on slack from @mkavulich on whether he has plans to modify this script as part of his work on the WE2E tests, in which case it would not really make sense for me to do a whole rewrite and modularize it. But I think Mike is out of the office. It would certainly be ok with me to make the changes I suggested above and make an issue to do the bigger changes (modularizing it) at a later time.

@danielabdi-noaa I added the flag launch_wflows as mentioned above. Adding a flag for a wait time before launching is more difficult; that is really determined by the CRON_RELAUNCH_INTVL_MNTS variable in the workflow and/or argument to run_WE2E_tests.sh (and soon its python version), and trying to work it into this script would take too much work that is probably not justified as @mkavulich pointed out. Please let me know if this is a satisfactory solution for the time being. @mkavulich any thoughts?

danielabdi-noaa · 2023-02-11T12:34:35Z

ush/config_defaults.yaml

@@ -653,7 +653,7 @@ workflow:
  #
  #-----------------------------------------------------------------------
  #
-  FIXdir: '{{ EXPTDIR if workflow_switches.RUN_TASK_MAKE_GRID else [user.HOMEdir, "fix"]|path_join }}'
+  FIXdir: '{{ EXPTDIR }}'


NCO requires that fix files be placed under $HOMEdir/fix which is why we place the fix files there, and in EXPTDIR in "community" mode. The logic above is not straightforward since we are testing "RUN_TASK_MAKE_GRID", which is always set to false in NCO mode.

@danielabdi-noaa Ok, I didn't realize NCO requires the fix files to be in the directory of the clone. In that case, why not test the value of RUN_ENVIR instead of RUN_TASK_MAKE_GRID, since many researchers (i.e. non-NCO users) will be running in community mode but will turn off the make_grid task once their grid is all set up. So the line would be changed to:

FIXdir: '{{ EXPTDIR if user.RUN_ENVIR == "community" else [user.HOMEdir, "fix"]|path_join }}'

I tested this with both RUN_ENVIR = "nco" and RUN_ENVIR = "community" and it works as expected in each case. Are you ok with that solution?

@gsketefian Sounds good. Note that this logic came about when de-tangling NCO mode from the "grid being pre-generated" so that we can run all WE2E tests with either nco or community modes. It was easier to base many of the logic based on RUN_TASK_MAKE_GRID, but some logic like symlinking/copying fix files have since became their own options, while NCO mode always symlinked fix files and community mode copied. It is now actually symlink by default to save space on experiment directory size.

@danielabdi-noaa Ok this one is updated now to check RUN_ENVIR instead of RUN_TASK_MAKE_GRID.

…NCO requires fix files to be in the SRW clone's home directory, but in community mode we want them to be in the experiment directory).

…ipt launches an unlaunched workflow when it encounters one.

gsketefian · 2023-02-16T21:06:51Z

@MichaelLueken I merged the latest develop into my branch and reran all the fundamental tests listed above on Hera. They all passed. Now just waiting to get some approves!

danielabdi-noaa

Thanks for addressing my suggestions!

MichaelLueken

@gsketefian Thanks for working with @danielabdi-noaa to address his concerns! I have run tests on Jet and the modifications to fix are working as expected. I will now approve these changes and submit the Jenkins tests.

venitahagerty · 2023-02-16T22:11:38Z

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1237470278/20230216215009/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: pregen_grid_orog_sfc_climo
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
2023-02-16 22:32:11 +0000 :: hfe04 :: Task make_ics, jobid=42067013, in state DEAD (FAILED), ran for 39.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
2023-02-16 22:32:05 +0000 :: hfe05 :: Task make_ics, jobid=42067002, in state DEAD (FAILED), ran for 40.0 seconds, exit status=256, try=2 (of 2)
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
Experiment Succeeded on hera: MET_ensemble_verification
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
All experiments completed

gsketefian · 2023-02-17T00:48:09Z

Thanks for addressing my suggestions!
@danielabdi-noaa You're welcome. Hope in future PRs, if you have concerns I can address them more thoroughly.

gsketefian · 2023-02-17T00:49:04Z

@MichaelLueken Thanks for shepherding this through :)

venitahagerty · 2023-02-17T15:11:21Z

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1237470278/20230217145008/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Failed on hera: MET_ensemble_verification
2023-02-17 15:32:12 +0000 :: hfe02 :: Task run_fcst_mem001, jobid=42086698, in state DEAD (FAILED), ran for 110.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: MET_ensemble_verification
2023-02-17 15:32:12 +0000 :: hfe02 :: Task run_fcst_mem002, jobid=42086699, in state DEAD (FAILED), ran for 104.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
2023-02-17 15:32:13 +0000 :: hfe12 :: Task run_fcst, jobid=42086697, in state DEAD (FAILED), ran for 137.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
2023-02-17 15:32:09 +0000 :: hfe08 :: Task run_fcst, jobid=42086711, in state DEAD (FAILED), ran for 106.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: pregen_grid_orog_sfc_climo
2023-02-17 15:28:06 +0000 :: hfe03 :: Task run_fcst, jobid=42086616, in state DEAD (FAILED), ran for 108.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
2023-02-17 15:32:14 +0000 :: hfe10 :: Task run_fcst, jobid=42086714, in state DEAD (FAILED), ran for 104.0 seconds, exit status=256, try=1 (of 1)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
2023-02-17 15:32:09 +0000 :: hfe05 :: Task run_fcst, jobid=42086700, in state DEAD (FAILED), ran for 106.0 seconds, exit status=256, try=1 (of 1)
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
2023-02-17 15:32:05 +0000 :: hfe05 :: Task run_fcst, jobid=42086703, in state DEAD (FAILED), ran for 104.0 seconds, exit status=256, try=1 (of 1)
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
All experiments completed

MichaelLueken · 2023-02-17T16:56:52Z

@gsketefian The first runs from the ci-hera-intel-WE, two tests failed. The rerun showed that the two tests that failed previously successfully passed. The Jenkins tests have all successfully completed without issue. I will now move forward with merging this work. Thanks!

gsketefian · 2023-02-17T17:54:46Z

@MichaelLueken Thanks for merging and closing the accompanying issues. Next vx PR coming soon!

The way the fix directory is set changed in PR #609, specifically item number 3. I think this has been causing an issue with forced run_envir=nco mode. I don't have a full explanation for this at the moment, but reverting back to old way of setting FIXdir seems to atleast partially fix the issue. I was able to run fundamental tests successfully on Hera using run_envir=nco.

gsketefian added 7 commits February 9, 2023 16:30

Specify location of staged forecast output for vx in the machine file…

7b2ae3b

… for Hera and modify run_WE2E_tests.py so that this location is obtained in a platform-independent way.

Make FIXdir always point to EXPTDIR (don't want links to fix files sh…

d0187a6

…owing up in SRW clone); Make creation of symlinks to pregenerated files depend on whether downstream tasks need those symlinks.

Add back locations of staged obs and forecast files since they were r…

c33ccf6

…emoved for another PR.

Merge branch 'develop' into bugfix/preproc_task_pregen

41dc9e6

Remove changes that should be part of another PR.

4eee43d

Remove modification to Hera machine file that should be part of anoth…

12095b3

…er PR.

Change behavior of get_expts_status.sh so that if an experiment hasn'…

4b7f516

…t been launched yet, use the launch script to launch it.

gsketefian changed the title ~~Bugfix/preproc task pregen~~ [develop] Fixes for Issue #608 and #610 Feb 11, 2023

gsketefian marked this pull request as ready for review February 11, 2023 09:47

danielabdi-noaa reviewed Feb 11, 2023

View reviewed changes

This was linked to issues Feb 13, 2023

[develop] Fix WE2E yaml configuration file for MET_verification_only_vx #608

Closed

[develop] Create symlinks to pregenerated grid/orog/sfc_climo files only if downstream tasks need them #610

Closed

MichaelLueken added the bug Something isn't working label Feb 13, 2023

gsketefian added 2 commits February 13, 2023 12:58

Determine how FIXdir is set according to setting of RUN_ENVIR (since …

46e7516

…NCO requires fix files to be in the SRW clone's home directory, but in community mode we want them to be in the experiment directory).

Merge branch 'develop' into bugfix/preproc_task_pregen

315111f

gsketefian changed the title ~~[develop] Fixes for Issue #608 and #610~~ [develop] Fixes for Issue #608, #610, and #616 Feb 16, 2023

MichaelLueken linked an issue Feb 16, 2023 that may be closed by this pull request

[develop] Decide on location of symlinks to fix files depending on RUN_ENVIR, not RUN_TASK_MAKE_GRID #616

Closed

Add argument to "get_expts_status.sh" that determines whether the scr…

f733e88

…ipt launches an unlaunched workflow when it encounters one.

danielabdi-noaa approved these changes Feb 16, 2023

View reviewed changes

MichaelLueken approved these changes Feb 16, 2023

View reviewed changes

MichaelLueken added ci-hera-intel-WE Kicks off automated workflow test on hera with intel run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Feb 16, 2023

venitahagerty removed the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 16, 2023

JeffBeck-NOAA approved these changes Feb 16, 2023

View reviewed changes

willmayfield approved these changes Feb 16, 2023

View reviewed changes

MichaelLueken added the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 17, 2023

venitahagerty removed the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 17, 2023

MichaelLueken merged commit 532ed1c into ufs-community:develop Feb 17, 2023

MichaelLueken mentioned this pull request Feb 22, 2023

Weather model will not build on Orion with devbuild.sh #623

Closed

gsketefian mentioned this pull request Feb 27, 2023

Update documentation for verification #630

Open

gsketefian deleted the bugfix/preproc_task_pregen branch February 27, 2023 21:42

danielabdi-noaa mentioned this pull request Mar 14, 2023

[develop] Potential bugfix for run_envir=nco issue #670

Merged

37 tasks

danielabdi-noaa mentioned this pull request Mar 14, 2023

NCO mode (run_envir="nco") results in random failures for WE2E tests #652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] Fixes for Issue #608, #610, and #616 #609

[develop] Fixes for Issue #608, #610, and #616 #609

gsketefian commented Feb 11, 2023 •

edited

Loading

danielabdi-noaa Feb 11, 2023

gsketefian Feb 13, 2023

danielabdi-noaa Feb 13, 2023

gsketefian Feb 13, 2023

dshawul Feb 14, 2023

JeffBeck-NOAA Feb 15, 2023

gsketefian Feb 15, 2023

gsketefian Feb 16, 2023

danielabdi-noaa Feb 11, 2023

gsketefian Feb 13, 2023

danielabdi-noaa Feb 13, 2023

gsketefian Feb 16, 2023

gsketefian commented Feb 16, 2023

danielabdi-noaa left a comment

MichaelLueken left a comment

venitahagerty commented Feb 16, 2023 •

edited

Loading

gsketefian commented Feb 17, 2023

gsketefian commented Feb 17, 2023

venitahagerty commented Feb 17, 2023 •

edited

Loading

MichaelLueken commented Feb 17, 2023

gsketefian commented Feb 17, 2023

[develop] Fixes for Issue #608, #610, and #616 #609

[develop] Fixes for Issue #608, #610, and #616 #609

Conversation

gsketefian commented Feb 11, 2023 • edited Loading

DESCRIPTION OF CHANGES:

Type of change

TESTS CONDUCTED:

DOCUMENTATION:

ISSUE:

CHECKLIST

LABELS (optional):

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsketefian commented Feb 16, 2023

danielabdi-noaa left a comment

Choose a reason for hiding this comment

MichaelLueken left a comment

Choose a reason for hiding this comment

venitahagerty commented Feb 16, 2023 • edited Loading

gsketefian commented Feb 17, 2023

gsketefian commented Feb 17, 2023

venitahagerty commented Feb 17, 2023 • edited Loading

MichaelLueken commented Feb 17, 2023

gsketefian commented Feb 17, 2023

gsketefian commented Feb 11, 2023 •

edited

Loading

venitahagerty commented Feb 16, 2023 •

edited

Loading

venitahagerty commented Feb 17, 2023 •

edited

Loading