-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix several bugs in fv_regional_bc.F90 relating to uninitialized or incorrectly initialized memory. #219
Fix several bugs in fv_regional_bc.F90 relating to uninitialized or incorrectly initialized memory. #219
Conversation
…nNOAA/GFDL_atmos_cubed_sphere into bugfix/rrfs-debug-mode
These changes are going into the RRFS as an emergency fix, in this PR: It's headed to the RRFS_dev branch in the NOAA-GSL fork, which is used as the authoritative repo for RRFS parallels. I'm hoping to un-fork the GFDL_atmos_cubed_sphere as soon as possible. The developers do not want wait the several weeks it'll take to update RRFS_dev to the authoritative branches. They want this fix as soon as possible, due to the severity of the bug and its widespread implications. |
@yangfanglin Could we ask Kate or Kevin to review this PR? Thanks |
Thank you Jun, I wasn't sure who to ask to review this. I can update the reviewers if you would like. |
Thanks, Lauren. I think some dynamics export on UFS regional application can review the PR. |
The original code looks good to me. The proposed change does not seem correct because it remaps the winds at cell edges based on surface pressure at the cell center. FV3 needs winds on the cell edges (C- and D-grids). When the model runs in the regional mode, the remapping of the winds in the boundary domain requires the surface pressure to be on the cell edges, which is obtained by the linear interpolation of the surface pressure at the cell center. For example, 0.5*(psc(i,j-1)+psc(i,j)), as shown in the original code. I am uncertain if there is an issue with the boundary condition files. There are four halos of data, which should be more than enough to construct winds at the edges of three halos required by the FV3 algorithm. For this reason, I don’t see when the lack of surface pressure for remapping winds will happen. |
Turn back on ps_reg=real_snan and you will see real_snan in those two lines and quite a few others. The code extensively uses uninitialized memory. |
There are three reasons I know of why surface pressure is unavailable in the wind remapping:
However, I don't even need to know that any of those are true, to know the code reads uninitialized areas. Just turn on the safeguard ps_reg=real_snan, and you'll see it fail all over the place. |
Until we can fix this properly, the only way I see to put the i-1 and j-1 back in, is to use i-1 or j-1 when they are available, but only the local point when they're not. I can use my "is_not_finite" subroutine to determine that. To fix it properly, we need PS on all the required points. There are points where i-1 or j-1 are inside the domain, where pressure is unavailable, or beyond the boundary regions, where pressure is unavailable. We'll need PS in the boundary files to extend into those regions. I'll update my PS-related issue accordingly. Also, the "PS" reader will have to read into the next MPI rank's boundary region, to get i-1 and j-1. Edit: This comment was supposed to address two situations: now, when we lack surface pressure, and later, when we have it. |
Have you confirmed the fix by setting ps_reg = real_snan? |
Everyone is hyperfocusing on four lines of code when there are quite a few other bugs in the fv_regional_bc.F90. In particular, the bug causing the model to abort in debug mode is NOT the one involving accessing i-1 and j-1. Since some people don't want to read my bug reports or pull request, I'll give you a hint: -9999999 Pa does propagate up to the model, inside the domain, not just at the boundary. |
I do like the idea "ps_reg = real_snan" and I have it in my fix. It does not signal any error in my test case. I am not sure about what other bugs you are referring to. |
Good to hear. I will have to see your changes to know what (if anything) is not fixed. |
@kaiyuan-cheng 's PR is here: #220, but it is on top of main rather than dev/emc. I lack the ability to test this in UFS since it won't compile within UFS. (CMakeFile.txt is missing.) That means I can't even confirm #220 will fix the bugs. It also leaves me wondering if some of the bugs originate from elsewhere in dev/emc, rather than the three boundary index lines fixed in #220. Without the ability to test, I do not know. |
Although #220 does not fix the problem, I was able to dramatically reduce the complexity of my PR by replacing nearly all of my PS-related fixes with @kaiyuan-cheng's change. I believe this is the minimal possible set of fixes until someone has a more elegant fix to the blending bug. |
The UFSWM PR that contains this is on the top of the commit queue. I recommend we get reviews/approvals here that there are no visible issues in the code before we start testing in UFSWM. |
I concur with @BrianCurtis-NOAA. I'm less worried about the state of this PR after the fixes from #220 were merged, but I do want someone to vet the remaining changes before widespread testing. |
is=min(ubound(Atm%ps,1),max(lbound(Atm%ps,1),is)) | ||
ie=min(ubound(Atm%ps,1),max(lbound(Atm%ps,1),ie)) | ||
js=min(ubound(Atm%ps,2),max(lbound(Atm%ps,2),js)) | ||
je=min(ubound(Atm%ps,2),max(lbound(Atm%ps,2),je)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt that the modification here is really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without these changes, the model uses data outside the boundary regions that are overwritten with real_snan in the following loop. This error was how I found the ps_reg=-9999999 and later bugs. (Maybe this is caused by a region of the code you don't typically use?)
However, even if we assume that the changes are not necessary, it is then a harmless bounds check on copying to Atm%ps. As with initializing data to real_snan, bounds checks are reasonable safeguard.
if(is_not_finite(array(i,j,k))) then | ||
cycle ! Outside boundary | ||
endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t run the blending boundary condition so I cannot test it. This walkaround looks fine to me although I do think that revising the indices of the blending domain is the ultimate fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're probably right about the indices, but I don't trust my knowledge of the code. If someone else knows the blending code, hopefully they'll review and suggest a better solution.
I haven't had time to look in great detail, but philosophically it seems like a bandaid is being applied with these is_not_finite checks. It would be nice to know the root cause of why it is reaching out beyond defined data. How frequently are these checks excluding i,j values?
The other change that caught my eye was the changes to the regional_bc_bounds variables (for south, north, and west) - I just would be somewhat surprised that it had been wrong all this time given the testing and vetting of the code by the original coders. But certainly could be possible.
|
@MatthewPyle-NOAA Those index changes come from @kaiyuan-cheng in #220 and they do prevent most out-of-bounds access. However, the results are different than my original fixes in RRFS_dev so these have not been vetted in a parallel yet. |
I have not tested that, but I recall in earlier testing the first variable that has the problem is "pt". Also, those changes are to a no-op; the areas that it is blending have no valid data.
Many out-of-bounds bugs were masked by |
All tests for ufs-wm pr #1437 are finished. We can start merging in this pr. |
@laurenchilutti @SamuelTrahanNOAA @bensonr @kaiyuan-cheng just a reminder to have final approvals for this pr. |
@jkbk2004 There is nothing I can do to effect that end but wait. |
@kaiyuan-cheng It seems Kaiyuan is not available. @laurenchilutti can you review/merge this pr ? commit queue has been on hold to move to next PR. |
@laurenchilutti @kaiyuan-cheng @bensonr Thanks for following up with the pr! |
* fixes io performance issues by making everyone a reader when io_layout=1,1 adds capability to use FMS feature to ignore data integrity checksums in restarts * rename enforce_rst_cksum to ignore_rst_cksum and change the default value for compatibility * updated UFS/GFS atmosphere.F90 driver as per @BinLiu-NOAA and @junwang-noaa * Regional decomposition test fix (when nrows_blend > 0) (NOAA-GFDL#194) * Add missing instance for hailwat * Regional bc blend changes to extend into interior halos and overlap on corners. Still not working for u and v. * atmosphere.F90 : add hailwat to check dyn_core.F90 : Fix from Jun Wang to correct sync of u,v fv_regional_bc.F90 : add check for nrows_blend > tile size; fix error when nrows_blend=1 * Explanatory comment added * Removed commented code * Clean old code * In fv_fill.F90, use kind_phys for kind_phys instead of hard-coding 8 byte reals. (NOAA-GFDL#193) * Expose remap_scalar and remap_dwinds to fv3-jedi (NOAA-GFDL#199) * changed interface to public * added public * removed source * mods for jedi build * Transfer changes from PR NOAA-GFDL#202 to NOAA-GFDL#199 Made small changes from PR NOAA-GFDL#202 manually. * returned ignore checksum * fixed ignore checksum * Fix several bugs in fv_regional_bc.F90 relating to uninitialized or incorrectly initialized memory. (NOAA-GFDL#219) * fixes and workarounds for uninitialized memory in fv_regional_bc * remove workarounds and fix remaining known bugs in ps_reg * a few more surface pressure bug fixes; now the test case runs in debug mode * workarounds and bug fixes from gnu compiler testing * remove -9999999 commented-out code * quiet the NaNs passed to Atmp%ps * simplify comments and explain snan * use i-1 & j-1 for two-point averages, when available * Replace many changes with PR NOAA-GFDL#220 Co-authored-by: Rusty.Benson <[email protected]> Co-authored-by: Ted Mansell <[email protected]> Co-authored-by: Rusty Benson <[email protected]> Co-authored-by: Samuel Trahan (NOAA contractor) <[email protected]> Co-authored-by: Mark Potts <[email protected]>
Description
This removes the workaround of initializing ps_reg to -9999999 Pa in fv_regional_bc.F90 and fixes many other bugs that caused or masked. One bug related to blending has a workaround instead of a fix. A lot of these bugs were fixed by merging in #220. That PR from @kaiyuan-cheng has an elegant replacement for some of my earlier kludgy fixes. Unfortunately, that PR does not allow the model to run in debug mode. I believe this PR has the minimal set of fixes and workarounds needed, until someone comes up with a more elegant solution to the blending bug.
What is Fixed?
This PR fixes everything listed here, while related PRs (listed below) at higher levels fix several other issues. See the next section for the one bug that has a workaround instead of a fix.
Workaround for Blending Code Receiving Data Outside the Boundary
The blending code's
array
array contains NaNs in regions it is asked to blend. If my reading of the code is correct, thearray
array is what is being blended, not the boundary conditions themselves. This is probably (but not certainly) regions beyond the boundary in the domain. Any operations with a NaN return NaN (unless the NaN signals), so processing those locations is a no-op (or a SIGFPE). The workaround is to detect non-finite values usingis_not_finite()
and skip them.I'm concerned by this bug, as it means something may be wrong elsewhere. Someone with better knowledge of the code should look into this. The trouble may originate from the model level, so I've made an issue there (NOAA-EMC/fv3atm#586). It may be as simple as the blending code looping one point farther than it should.
Testing:
Fixes #218
The RRFS regression tests in the ufs-weather-model were compiled in DEBUG=ON mode, and extensively tested on Hera with the Gnu and Intel compilers. Many hours of debugging with gdb, ncview, and other tools. Testing this also required fixing a numerical precision issue in one ccpp physics scheme (ufs-community/ccpp-physics#5, ufs-community/ccpp-physics#6) and changes to the regression tests to get them to fit in 30 minutes (ufs-community/ufs-weather-model#1437). Ultimately, this fixes half of the test variants of RRFS that were failing before (ufs-community/ufs-weather-model#1222). It also obviates the problem of not having enough PS boundary data (ufs-community/ufs-weather-model#1436).
Preliminary results from a real-time parallel have shown this has a small impact on the bulk 1D statistics in a large domain with a large number of ranks. (Hence, boundary ranks have a small fraction of the domain.) However, regression tests have shown it does change the fields. Nobody has done any analysis between those two extremes.
Checklist:
Please check all whether they apply or not