Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dependency in restart test case #95

Merged
merged 1 commit into from
Jul 26, 2023

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Jul 19, 2023

Because we don't know the filename of the restart file at setup time, we need to instead make the full_run step an explicit dependency of the restart_run step.

Checklist

  • Testing comment in the PR documents testing used to verify the changes

@xylar xylar added bug Something isn't working ocean Related to ocean tests or analysis labels Jul 19, 2023
@xylar xylar requested a review from altheaden July 19, 2023 14:03
@xylar xylar self-assigned this Jul 19, 2023
@xylar
Copy link
Collaborator Author

xylar commented Jul 19, 2023

Testing

The restart_test and the rest of the PR suite passed on Chrysalis and are BFB with a baseline using main.

@xylar
Copy link
Collaborator Author

xylar commented Jul 20, 2023

This will need to be rebased and conflicts fixed after #96 goes in.

Because we don't know the filename of the restart file at
setup time, we need to instead make the `full_run` step an
explicit dependency of the `restart_run` step.
@xylar xylar force-pushed the fix-restart-test-inputs-outputs branch from c88f20e to 4f53281 Compare July 25, 2023 11:42
Copy link
Collaborator

@altheaden altheaden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this another test with the PR suite against a baseline, and everything passed. I then ran the baroclinic channel restart test by manually running the init step, then the restart step, skipping the full run, and it crashed as expected:

Traceback (most recent call last):
  File "/home/ac.althea/miniconda3/envs/polaris-test-2/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 499, in _run_step
    raise OSError(
OSError: output file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/ac.althea/ac.althea/polaris_tests/baroclinic/fix-restart-test-inputs-outputs/ocean/baroclinic_channel/10km/restart/restart_run/output.nc']

Everything looks as it should, as far as I can tell.

@xylar
Copy link
Collaborator Author

xylar commented Jul 26, 2023

@altheaden, that's odd. The error you see isn't what I expected or what I see when I try the same. I see:

$ cd init/
$ polaris serial
...
$ cd ../restart_run
$ polaris serial
polaris calling: polaris.run.serial._run_test()
  in /home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py

Traceback (most recent call last):
  File "/home/xylar/mambaforge/envs/polaris_test/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/home/xylar/code/e3sm/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 431, in _run_step
    raise OSError(
OSError: input file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/xylar/data/polaris_0.1/test_20230726/fix-bcn-restart/ocean/baroclinic_channel/10km/restart/full_run/step_after_run.pickle']

That's what I was expecting to see -- it's complaining about an input rather than an output file.

@xylar
Copy link
Collaborator Author

xylar commented Jul 26, 2023

I'm going to go ahead and merge but it would be good to know what the workflow was that produced the results you saw.

@xylar xylar merged commit 38a1803 into E3SM-Project:main Jul 26, 2023
5 checks passed
@xylar xylar deleted the fix-restart-test-inputs-outputs branch July 26, 2023 10:39
@altheaden
Copy link
Collaborator

I can recreate it today and see what the results are. As far as I can tell, I did the same process that you did, but let me see if my results are different this time around.

@altheaden
Copy link
Collaborator

@xylar Here is a longer version of the error message I get (not sure how much is useful for you to see), still ending in the same error. Not sure what I'm doing differently.

(polaris-test-2) [ac.althea@chr-0245 fix-restart-test-inputs-outputs]$ cd ocean/baroclinic_channel/10km/restart/init
(polaris-test-2) [ac.althea@chr-0245 init]$ polaris serial
...
(polaris-test-2) [ac.althea@chr-0245 init]$ cd ../restart_run/
(polaris-test-2) [ac.althea@chr-0245 restart_run]$ polaris serial
...
Bypassing step's run() method and running with command line args

polaris calling: polaris.parallel.run_command()
  in /gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/parallel.py

Running: srun -c 1 -N 1 -n 4 ./ocean_model -n namelist.ocean -s streams.ocean
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Traceback (most recent call last):
  File "/home/ac.althea/miniconda3/envs/polaris-test-2/bin/polaris", line 33, in <module>
    sys.exit(load_entry_point('polaris', 'console_scripts', 'polaris')())
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/__main__.py", line 62, in main
    commands[args.command]()
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 176, in main
    run_single_step(args.step_is_subprocess)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 134, in run_single_step
    _run_test(test_case, available_resources)
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 409, in _run_test
    _run_step(test_case, step, test_case.new_step_log_file,
  File "/gpfs/fs1/home/ac.althea/code/polaris/fix-restart-test-inputs-outputs/polaris/run/serial.py", line 499, in _run_step
    raise OSError(
OSError: output file(s) missing in step restart_run of ocean/baroclinic_channel/10km/restart: ['/home/ac.althea/ac.althea/polaris_tests/baroclinic/fix-restart-test-inputs-outputs/ocean/baroclinic_channel/10km/restart/restart_run/output.nc']

@xylar
Copy link
Collaborator Author

xylar commented Jul 26, 2023

@altheaden, is this in a directory where you already ran the command successfully once? Even if so, it's weird that it doesn't just run successfully and instead has errors. We would probably need to look at log.ocean.0000.err to see what the issue was that led to the MPI_ABORT.

But it seems like you're seeing a rather different and more unexpected behavior than I was seeing. Maybe let's let it be for now. If we see this again, we can investigate further.

@altheaden
Copy link
Collaborator

@xylar I actually just made sure to update the submodules and re-make before setting up the test again. Every time, I have been setting up a new directory and just doing the workflow I showed (cd init, polaris serial, cd restart_run, polaris serial). I just did it again and got the same results. Then, I went and manually ran the full run step before running the restart run step and they were both successful.

@altheaden
Copy link
Collaborator

I just checked the error files from my restart_run test, and they all just say that the files in the restarts directory don't exist.

@altheaden
Copy link
Collaborator

(polaris-test-2) [ac.althea@chr-0245 restart_run]$ cat *.err
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       0 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
ERROR: Error reading initial state in init
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       1 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       2 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20
----------------------------------------------------------------------
Beginning MPAS-ocean Error Log File for task       3 of       4
    Opened at 2023/07/26 13:41:20
----------------------------------------------------------------------

ERROR: Stream 'restart' attempted to read non-existent file '../restarts/rst.0001-01-01_00.05.00.nc'
CRITICAL ERROR: Core init failed for core ocean
Logging complete.  Closing file at 2023/07/26 13:41:20

@xylar
Copy link
Collaborator Author

xylar commented Jul 26, 2023

@altheaden, those all look lik errors I would have expected to see before this branch. Any chance you were accidentally testing from a different branch (e.g. an earlier version of main) rather than my fix-restart-test-inputs-outputs? That branch is now gone but you could test with the latest main and it should behave like my test.

But also, like I said, it's not critical to figure this out if you'd rather let it go.

@altheaden
Copy link
Collaborator

@xylar I just did as you asked and now I'm getting the same error you were, a missing input file. No idea why it was different for me before...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ocean Related to ocean tests or analysis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants