NCO mode (`run_envir="nco"`) results in random failures for WE2E tests #652

mkavulich · 2023-03-08T04:40:54Z

Expected behavior

The run_envir capability was included in the run_WE2E_tests.sh script (and its replacement run_WE2E_tests.py) in order to be able to force tests to run with either run_envir=community or run_envir=nco, regardless of what setting was included in the test config yaml file. Calling the WE2E run script with run_envir=nco will force all tests to run in NCO mode. Ideally this should not be a problem, as even though the nco_dirs directory is shared among the various tests, conflicts should be avoided by running each task in its own subdirectory.

Current behavior

Currently, running several experiments in parallel in nco mode reveals some problems with the system. Tasks seem to fail randomly -- often without a descriptive error message -- and will work upon re-running.

I confirmed that this behavior is random by running the same set of experiments twice, and seeing a completely different set of failures in each run. Running this same set of tests without the run_envir="nco" option, or running each task serially so that no two tests were running at the same time, resulted in all successes.

Examples of these failures can be found on Hera in /scratch2/BMC/fv3lam/kavulich/UFS/workdir/nco_tests/expt_dirs

Machines affected

All that I have tested so far (Hera and Jet). I assume this will affect all platforms.

Steps To Reproduce

Run the fundamental suite of WE2E tests using either the shell- or python-based run script:

./run_WE2E_tests.sh test_type=fundamental machine=hera account=ACCOUNT run_envir=nco

or

./run_WE2E_tests.py -m=hera -a=ACCOUNT --tests=fundamental --run_envir=nco

Note Due to the error described in #571, the above sets of tests will not be the same on Hera. The error should still occur regardless (though due to its random/inconsistent nature, it may take a few tries to replicate).

The text was updated successfully, but these errors were encountered:

mkavulich · 2023-03-09T00:09:11Z

Update: I was able to replicate this error without explicitly passing the run_envir=nco option, simply by specifying 4 nco-mode tests in my test list:

nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR

From what I can tell, the issue is due to some sort of problem the same task runs at the same time for two different experiments.

danielabdi-noaa · 2023-03-14T10:20:02Z

@mkavulich Have you found a solution for this. I think I am encountering this issue in PR #647 where re-running the tests in community mode works but not in NCO mode. Any idea which PR introduced this problem?

mkavulich · 2023-03-14T19:54:42Z

@danielabdi-noaa This issue appears to be fixed for some tasks, but I am still seeing task failures in run_fcst. I also have not had time to understand the failure but this likely was introduced in a different commit than the one you found.

Doing a bit more digging, it appears as if there are at least two different failure modes currently. The first is a failure with no helpful error message, which seems to resolve just by rewinding and re-submitting the run_fcst task. The second appears to be an un-caught failure in the make_lbcs task, where one or more files are either not created or accidentally deleted somehow. I saved the output for this one on Hera: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/expt_dirs/fundamental_nco_fix/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR_old_20230314_194720

danielabdi-noaa · 2023-03-14T20:07:49Z

@mkavulich The way you described the problem there seems to be some non-deterministic behaviour so I won't be surprized if there are additional issues. That is why I was careful with my PR description that got merged. I made i clear that it is a partial fix, quoted here again:

Potentially addresses issue #652. The way fix directory is set is changed in PR #609, specifically item number 3.
I think this has been causing an issue with forced run_envir=nco mode. I don't have a full explanation for this at the moment, but reverting back to old way of setting FIXdir seems to atleast partially fix the issue. I was able to run fundamental tests successfully on Hera using run_envir=nco.

Having said that, the fact that undoing the change regarding FIXdir in PR #652 makes all 10 fundamental tests pass on Hera while previously almost all of them failed atleast for me indicates the PR at the very least exacerbated the problem.

mkavulich · 2023-03-14T20:43:26Z

@danielabdi-noaa Thank you for your partial fix, I didn't mean to imply that you were responsible for fixing this problem. The issue was automatically closed because your PR referenced this issue, so I re-opened it to clarity that the problem still remains in part.

gsketefian · 2023-05-19T17:01:25Z

@mkavulich @danielabdi-noaa FYI those tests involving verification fail in NCO mode, and I'm fixing that in my PR #695. Not sure if it's directly related to your issues though.

MichaelLueken · 2024-08-27T15:03:36Z

The NCO sample configuration and NCO WE2E tests were removed in PR #1060. Before their removal, the random failures for the NCO WE2E tests were due for two reasons:

Incorrect fix files being available for the given test. Depending on the required files for a given WE2E test, these files could be overwritten depending on where tests were located in the run queue. This was corrected when each experiment directory was given it's own fix directory in PR [develop] Fixing bug: moved placing fix_lam tests' directories from common place (ufs-srweather-app) to each tests' run directory. #977.
In NCO mode, there is only a single output for each task for a given cycle date. If there are multiple WE2E tests that have the same DATE_FIRST_CYCL, they will share the same output for each task in the workflow. Issues were arising that several tests used the same DATE_FIRST_CYCL, but one of the tests used a different grid. This resulted in NCO mode failures. The tests with different grids were moved in PR [develop] Fixing bug: moved placing fix_lam tests' directories from common place (ufs-srweather-app) to each tests' run directory. #977.

From December 14, 2023, the NCO WE2E tests were running as expected, until their removal in PR #1060, on March 27, 2024.

Closing issue now.

mkavulich added bug Something isn't working Priority: HIGH labels Mar 8, 2023

mkavulich mentioned this issue Mar 8, 2023

[develop] Replace shell-based WE2E scripts with python versions #637

Merged

19 tasks

danielabdi-noaa mentioned this issue Mar 14, 2023

[develop] Add new RRFS variables such as NWGES and workflow control variables #647

Closed

37 tasks

danielabdi-noaa mentioned this issue Mar 14, 2023

[develop] Potential bugfix for run_envir=nco issue #670

Merged

37 tasks

danielabdi-noaa linked a pull request Mar 14, 2023 that will close this issue

[develop] Potential bugfix for run_envir=nco issue #670

Merged

37 tasks

danielabdi-noaa self-assigned this Mar 14, 2023

MichaelLueken closed this as completed in #670 Mar 14, 2023

mkavulich reopened this Mar 14, 2023

MichaelLueken mentioned this issue Mar 29, 2023

[develop] Update ufs-weather-model and UPP hash to correct post control file issue #699

Merged

23 tasks

mkavulich mentioned this issue Apr 17, 2023

[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732

Merged

22 tasks

This was referenced Apr 28, 2023

[develop] Bugfix find task function in setup.py #745

Merged

[develop] Remove redundancies in loading the run-time python environment. #761

Merged

This was referenced May 8, 2023

[develop] Move all unittest tests to a common area. #728

Merged

[develop] Use templates for METplus config files #683

Merged

MichaelLueken mentioned this issue May 16, 2023

[develop] Update WM and UPP hashes and minor rearrangement of WE2E coverage tests that fail on certain platforms #799

Merged

19 tasks

christinaholtNOAA mentioned this issue May 23, 2023

[develop] Fix retrieve data. #810

Merged

23 tasks

MichaelLueken mentioned this issue Aug 21, 2023

[develop] Update UFS-WM, UFS_UTILS, and UPP hashes. #892

Closed

17 tasks

mkavulich mentioned this issue Sep 7, 2023

[develop] Improvements for WE2E tests: script features, additional tests, remove unsupported domains #871

Merged

21 tasks

MichaelLueken closed this as completed Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCO mode (`run_envir="nco"`) results in random failures for WE2E tests #652

NCO mode (`run_envir="nco"`) results in random failures for WE2E tests #652

mkavulich commented Mar 8, 2023 •

edited

Loading

mkavulich commented Mar 9, 2023

danielabdi-noaa commented Mar 14, 2023

mkavulich commented Mar 14, 2023

danielabdi-noaa commented Mar 14, 2023

mkavulich commented Mar 14, 2023

gsketefian commented May 19, 2023

MichaelLueken commented Aug 27, 2024

NCO mode (run_envir="nco") results in random failures for WE2E tests #652

NCO mode (run_envir="nco") results in random failures for WE2E tests #652

Comments

mkavulich commented Mar 8, 2023 • edited Loading

Expected behavior

Current behavior

Machines affected

Steps To Reproduce

mkavulich commented Mar 9, 2023

danielabdi-noaa commented Mar 14, 2023

mkavulich commented Mar 14, 2023

danielabdi-noaa commented Mar 14, 2023

mkavulich commented Mar 14, 2023

gsketefian commented May 19, 2023

MichaelLueken commented Aug 27, 2024

NCO mode (`run_envir="nco"`) results in random failures for WE2E tests #652

NCO mode (`run_envir="nco"`) results in random failures for WE2E tests #652

mkavulich commented Mar 8, 2023 •

edited

Loading