Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCO mode (run_envir="nco") results in random failures for WE2E tests #652

Closed
mkavulich opened this issue Mar 8, 2023 · 7 comments · Fixed by #670
Closed

NCO mode (run_envir="nco") results in random failures for WE2E tests #652

mkavulich opened this issue Mar 8, 2023 · 7 comments · Fixed by #670
Assignees
Labels
bug Something isn't working Priority: HIGH

Comments

@mkavulich
Copy link
Collaborator

mkavulich commented Mar 8, 2023

Expected behavior

The run_envir capability was included in the run_WE2E_tests.sh script (and its replacement run_WE2E_tests.py) in order to be able to force tests to run with either run_envir=community or run_envir=nco, regardless of what setting was included in the test config yaml file. Calling the WE2E run script with run_envir=nco will force all tests to run in NCO mode. Ideally this should not be a problem, as even though the nco_dirs directory is shared among the various tests, conflicts should be avoided by running each task in its own subdirectory.

Current behavior

Currently, running several experiments in parallel in nco mode reveals some problems with the system. Tasks seem to fail randomly -- often without a descriptive error message -- and will work upon re-running.

I confirmed that this behavior is random by running the same set of experiments twice, and seeing a completely different set of failures in each run. Running this same set of tests without the run_envir="nco" option, or running each task serially so that no two tests were running at the same time, resulted in all successes.

Examples of these failures can be found on Hera in /scratch2/BMC/fv3lam/kavulich/UFS/workdir/nco_tests/expt_dirs

Machines affected

All that I have tested so far (Hera and Jet). I assume this will affect all platforms.

Steps To Reproduce

Run the fundamental suite of WE2E tests using either the shell- or python-based run script:

./run_WE2E_tests.sh test_type=fundamental machine=hera account=ACCOUNT run_envir=nco

or

./run_WE2E_tests.py -m=hera -a=ACCOUNT --tests=fundamental --run_envir=nco

Note Due to the error described in #571, the above sets of tests will not be the same on Hera. The error should still occur regardless (though due to its random/inconsistent nature, it may take a few tries to replicate).

@mkavulich
Copy link
Collaborator Author

Update: I was able to replicate this error without explicitly passing the run_envir=nco option, simply by specifying 4 nco-mode tests in my test list:

  • nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
  • nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
  • nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km
  • nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR

From what I can tell, the issue is due to some sort of problem the same task runs at the same time for two different experiments.

@danielabdi-noaa
Copy link
Collaborator

@mkavulich Have you found a solution for this. I think I am encountering this issue in PR #647 where re-running the tests in community mode works but not in NCO mode. Any idea which PR introduced this problem?

@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa This issue appears to be fixed for some tasks, but I am still seeing task failures in run_fcst. I also have not had time to understand the failure but this likely was introduced in a different commit than the one you found.

Doing a bit more digging, it appears as if there are at least two different failure modes currently. The first is a failure with no helpful error message, which seems to resolve just by rewinding and re-submitting the run_fcst task. The second appears to be an un-caught failure in the make_lbcs task, where one or more files are either not created or accidentally deleted somehow. I saved the output for this one on Hera: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/expt_dirs/fundamental_nco_fix/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR_old_20230314_194720

@danielabdi-noaa
Copy link
Collaborator

@mkavulich The way you described the problem there seems to be some non-deterministic behaviour so I won't be surprized if there are additional issues. That is why I was careful with my PR description that got merged. I made i clear that it is a partial fix, quoted here again:

Potentially addresses issue #652. The way fix directory is set is changed in PR #609, specifically item number 3.
I think this has been causing an issue with forced run_envir=nco mode. I don't have a full explanation for this at the moment, but reverting back to old way of setting FIXdir seems to atleast partially fix the issue. I was able to run fundamental tests successfully on Hera using run_envir=nco.

Having said that, the fact that undoing the change regarding FIXdir in PR #652 makes all 10 fundamental tests pass on Hera while previously almost all of them failed atleast for me indicates the PR at the very least exacerbated the problem.

@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa Thank you for your partial fix, I didn't mean to imply that you were responsible for fixing this problem. The issue was automatically closed because your PR referenced this issue, so I re-opened it to clarity that the problem still remains in part.

@gsketefian
Copy link
Collaborator

@mkavulich @danielabdi-noaa FYI those tests involving verification fail in NCO mode, and I'm fixing that in my PR #695. Not sure if it's directly related to your issues though.

@MichaelLueken
Copy link
Collaborator

The NCO sample configuration and NCO WE2E tests were removed in PR #1060. Before their removal, the random failures for the NCO WE2E tests were due for two reasons:

  1. Incorrect fix files being available for the given test. Depending on the required files for a given WE2E test, these files could be overwritten depending on where tests were located in the run queue. This was corrected when each experiment directory was given it's own fix directory in PR [develop] Fixing bug: moved placing fix_lam tests' directories from common place (ufs-srweather-app) to each tests' run directory. #977.
  2. In NCO mode, there is only a single output for each task for a given cycle date. If there are multiple WE2E tests that have the same DATE_FIRST_CYCL, they will share the same output for each task in the workflow. Issues were arising that several tests used the same DATE_FIRST_CYCL, but one of the tests used a different grid. This resulted in NCO mode failures. The tests with different grids were moved in PR [develop] Fixing bug: moved placing fix_lam tests' directories from common place (ufs-srweather-app) to each tests' run directory. #977.

From December 14, 2023, the NCO WE2E tests were running as expected, until their removal in PR #1060, on March 27, 2024.

Closing issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Priority: HIGH
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants