Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

+More consistent treatment of input_filename = 'F' #1370

Merged
merged 4 commits into from
Apr 14, 2021

Conversation

Hallberg-NOAA
Copy link
Collaborator

Restructured the code slightly so that the output files that are generated
when input_filename = 'F' is exactly the same as if it is 'n' if there are no
restart files in the restart_input_dir, or as if it is 'r' if the restart files
are there. Previously, the solutions with 'F' worked this way, but no
ocean_geometry.nc or Vertical_grid.nc files were written when WRITE_GEOM=1,
regardless of the presence or absence of the restart files, and the
MOM_parameter_doc.all files differed slightly between the 'n' and 'F' or 'r'
cases. As a part of these changes, the determination of whether this is a new
run is moved earlier in the algorithm, and now sits outside of MOM_initialize_state.
All solutions are bitwise identical, but there are changes in the position of the
PARALLEL_RESTARTFILES and REFERENCE_HEIGHT entries in most
MOM_parameter_doc.all files.

Despite appearances to the contrary, this is actually a relatively simple
single-commit PR with just two files modified, that builds on MOM6 PR #1369.

Hallberg-NOAA and others added 2 commits April 10, 2021 08:39
  Restructured the code slightly so that the output files that are generated
when input_filename = 'F' is exactly the same as if it is 'n' if there are no
restart files in the restart_input_dir, or as if it is 'r' if the restart files
are there.  Previously, the solutions with 'F' worked this way, but no
ocean_geometry.nc or Vertical_grid.nc files were written when WRITE_GEOM=1,
regardless of the presence or absence of the restart files, and the
MOM_parameter_doc.all files differed slightly between the 'n' and 'F' or 'r'
cases.  As a part of these changes, the determination of whether this is a new
run is moved earlier in the algorithm, and now sits outside of
MOM_initialize_state.  All solutions are bitwise identical, but there are
changes in the position of the PARALLEL_RESTARTFILES and REFERENCE_HEIGHT
entries in most MOM_parameter_doc.all files.
@marshallward
Copy link
Collaborator

marshallward commented Apr 12, 2021

@marshallward
Copy link
Collaborator

There is an answer change in tc2.a when using restarts with GNU on Gaea.

Neither GitHub Actions nor my home machine caught this issue, which both use newer compiler versions (at least 9.2) whereas Gaea uses 7.3.0 so it may only happen in older compilers.

This may have something to do with expected default namelist values.

@Hallberg-NOAA
Copy link
Collaborator Author

I am able to reproduce the gnu restart problems with TC2 (and I know why I missed this problem). I will be debugging this PR and correcting it tomorrow, hopefully.

@marshallward
Copy link
Collaborator

If this is a real issue then we should try to modify our tests so that the newer compilers can catch it.

  Changed the declarations of the vardesc and fields arrays to allocatable in
write_ocean_geometry_files, primarily to get one of the TC test cases to run
properly with the gcc compiler by shifting the memory for these arrays from
stack to heap.  The reason why this change works is not clear.  Some comments
describing these variables were also added.  All answers are bitwise identical.
@Hallberg-NOAA
Copy link
Collaborator Author

After further investigation, the resstart reproducability appears to be tickling some sort of memory corruption issue in TC2 that I do not understand. This failure of TC2 is triggered if the call to ocean_write_geometry file occurs after the call to tracer_registry_init. In particular if only 18 variables' metadata is stored in the 23 member array of vardesc types in ocean_write_geometry, every thing is fine, but if the metadata for the 19th is stored, the TC2 restart tests start to fail with gcc 7.3.0, even if a return call is used to keep this array from actually being used after it is set. (Similar problems seem to arise with gcc 8.3.0 or 9.2.0 as well, although I haven't put all of the time into narrowing down exactly when in the code this occurs with those compiler versions.) All of the memory in question was being set automatically from stack, and not via an allocate call, so this seems to be some sort of memory management issue with the particular version of the gnu compiler that is being used on Gaea when working with arrays of structures of a sufficient size.

The fact that this issue only arises in one of the TCs, for which we have all diagnostics enabled, and not in any of our regression tests suggests to me that we might be approaching memory limits associated with the increasing number of diagnostics in MOM6. Perhaps a broad survey of the memory associated with each diagnostic (e.g., fixed character string lengths) or other use of static memory related to I/O would be in order.

Changing ocean_write_geometry to explicitly allocate and deallocate its memory from the heap avoids this problem, but I find it to be profoundly disconcerting that I do not understand how this actually arose, or what we could do to avoid anything similar in the future. I am adding a modification to this PR to MOM_shared_initialization.F90 that addresses this problem.

@marshallward
Copy link
Collaborator

marshallward commented Apr 14, 2021

It does seem a bit odd, maybe vars loses integrity inside of the function call as an array of derived types using a parameter for size. But if so then it is probably caused by compiler error. It would not be the first gcc 7.3.0 bug.

We could hit it with valgrind, but I think let's take this as a win and move on.

@marshallward
Copy link
Collaborator

marshallward commented Apr 14, 2021

https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/12461 ✔️ 🟡

This requires a parameter update.

@marshallward
Copy link
Collaborator

I tested this on Gaea with GCC 8.3 and it did not raise the error, so I am going to speculate that the issue was a compiler bug, and the change here is sufficient to avert it.

Maybe it's something more severe but I think let's leave this for now and just take a note in the future.

@marshallward marshallward merged commit fc75e0f into mom-ocean:dev/gfdl Apr 14, 2021
@Hallberg-NOAA Hallberg-NOAA deleted the F_case_file_output branch July 30, 2021 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants