Add dependency in restart test case #95

xylar · 2023-07-19T14:03:22Z

Because we don't know the filename of the restart file at setup time, we need to instead make the full_run step an explicit dependency of the restart_run step.

Checklist

Testing comment in the PR documents testing used to verify the changes

xylar · 2023-07-19T14:12:54Z

Testing

The restart_test and the rest of the PR suite passed on Chrysalis and are BFB with a baseline using main.

xylar · 2023-07-20T09:14:46Z

This will need to be rebased and conflicts fixed after #96 goes in.

Because we don't know the filename of the restart file at setup time, we need to instead make the `full_run` step an explicit dependency of the `restart_run` step.

altheaden

I gave this another test with the PR suite against a baseline, and everything passed. I then ran the baroclinic channel restart test by manually running the init step, then the restart step, skipping the full run, and it crashed as expected:

Traceback (most recent call last):
  File "/home/ac.althea/miniconda3/envs/polaris-test-2/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 499, in _run_step
    raise OSError(
OSError: output file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/ac.althea/ac.althea/polaris_tests/baroclinic/fix-restart-test-inputs-outputs/ocean/baroclinic_channel/10km/restart/restart_run/output.nc']

Everything looks as it should, as far as I can tell.

xylar · 2023-07-26T10:38:38Z

@altheaden, that's odd. The error you see isn't what I expected or what I see when I try the same. I see:

$ cd init/
$ polaris serial
...
$ cd ../restart_run
$ polaris serial
polaris calling: polaris.run.serial._run_test()
  in /home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py

Traceback (most recent call last):
  File "/home/xylar/mambaforge/envs/polaris_test/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 431, in _run_step
    raise OSError(
OSError: input file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/xylar/data/polaris_0.1/test_20230726/fix-bcn-restart/ocean/baroclinic_channel/10km/restart/full_run/step_after_run.pickle']

That's what I was expecting to see -- it's complaining about an input rather than an output file.

xylar · 2023-07-26T10:39:36Z

I'm going to go ahead and merge but it would be good to know what the workflow was that produced the results you saw.

altheaden · 2023-07-26T17:57:16Z

I can recreate it today and see what the results are. As far as I can tell, I did the same process that you did, but let me see if my results are different this time around.

altheaden · 2023-07-26T18:15:41Z

@xylar Here is a longer version of the error message I get (not sure how much is useful for you to see), still ending in the same error. Not sure what I'm doing differently.

(polaris-test-2) [ac.althea@chr-0245 fix-restart-test-inputs-outputs]$ cd ocean/baroclinic_channel/10km/restart/init
(polaris-test-2) [ac.althea@chr-0245 init]$ polaris serial
...
(polaris-test-2) [ac.althea@chr-0245 init]$ cd ../restart_run/
(polaris-test-2) [ac.althea@chr-0245 restart_run]$ polaris serial
...
Bypassing step's run() method and running with command line args

polaris calling: polaris.parallel.run_command()
  in /gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/parallel.py

Running: srun -c 1 -N 1 -n 4 ./ocean_model -n namelist.ocean -s streams.ocean
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/ac.althea/miniconda3/envs/polaris-test-2/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 499, in _run_step
    raise OSError(
OSError: output file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/ac.althea/ac.althea/polaris_tests/baroclinic/fix-restart-test-inputs-outputs/ocean/baroclinic_channel/10km/restart/restart_run/output.nc']

xylar · 2023-07-26T18:42:39Z

@altheaden, is this in a directory where you already ran the command successfully once? Even if so, it's weird that it doesn't just run successfully and instead has errors. We would probably need to look at log.ocean.0000.err to see what the issue was that led to the MPI_ABORT.

But it seems like you're seeing a rather different and more unexpected behavior than I was seeing. Maybe let's let it be for now. If we see this again, we can investigate further.

altheaden · 2023-07-26T18:45:39Z

@xylar I actually just made sure to update the submodules and re-make before setting up the test again. Every time, I have been setting up a new directory and just doing the workflow I showed (cd init, polaris serial, cd restart_run, polaris serial). I just did it again and got the same results. Then, I went and manually ran the full run step before running the restart run step and they were both successful.

altheaden · 2023-07-26T18:46:16Z

I just checked the error files from my restart_run test, and they all just say that the files in the restarts directory don't exist.

altheaden · 2023-07-26T18:46:36Z

(polaris-test-2) [ac.althea@chr-0245 restart_run]$ cat *.err
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       0 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
ERROR: Error reading initial state in init
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       1 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       2 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       3 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20

xylar · 2023-07-26T19:00:30Z

@altheaden, those all look lik errors I would have expected to see before this branch. Any chance you were accidentally testing from a different branch (e.g. an earlier version of main) rather than my fix-restart-test-inputs-outputs? That branch is now gone but you could test with the latest main and it should behave like my test.

But also, like I said, it's not critical to figure this out if you'd rather let it go.

altheaden · 2023-07-26T19:28:33Z

@xylar I just did as you asked and now I'm getting the same error you were, a missing input file. No idea why it was different for me before...

xylar added bug Something isn't working ocean Related to ocean tests or analysis labels Jul 19, 2023

xylar requested a review from altheaden July 19, 2023 14:03

xylar self-assigned this Jul 19, 2023

Add dependency in restart test case

4f53281

Because we don't know the filename of the restart file at setup time, we need to instead make the `full_run` step an explicit dependency of the `restart_run` step.

xylar force-pushed the fix-restart-test-inputs-outputs branch from c88f20e to 4f53281 Compare July 25, 2023 11:42

altheaden approved these changes Jul 25, 2023

View reviewed changes

xylar merged commit 38a1803 into E3SM-Project:main Jul 26, 2023
5 checks passed

xylar deleted the fix-restart-test-inputs-outputs branch July 26, 2023 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dependency in restart test case #95

Add dependency in restart test case #95

xylar commented Jul 19, 2023 •

edited

Loading

xylar commented Jul 19, 2023

xylar commented Jul 20, 2023

altheaden left a comment

xylar commented Jul 26, 2023

xylar commented Jul 26, 2023

altheaden commented Jul 26, 2023

altheaden commented Jul 26, 2023

xylar commented Jul 26, 2023 •

edited

Loading

altheaden commented Jul 26, 2023

altheaden commented Jul 26, 2023

altheaden commented Jul 26, 2023

xylar commented Jul 26, 2023

altheaden commented Jul 26, 2023

Add dependency in restart test case #95

Add dependency in restart test case #95

Conversation

xylar commented Jul 19, 2023 • edited Loading

xylar commented Jul 19, 2023

Testing

xylar commented Jul 20, 2023

altheaden left a comment

Choose a reason for hiding this comment

xylar commented Jul 26, 2023

xylar commented Jul 26, 2023

altheaden commented Jul 26, 2023

altheaden commented Jul 26, 2023

xylar commented Jul 26, 2023 • edited Loading

altheaden commented Jul 26, 2023

altheaden commented Jul 26, 2023

altheaden commented Jul 26, 2023

xylar commented Jul 26, 2023

altheaden commented Jul 26, 2023

xylar commented Jul 19, 2023 •

edited

Loading

xylar commented Jul 26, 2023 •

edited

Loading