Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several bugs in fv_regional_bc.F90 relating to uninitialized or incorrectly initialized memory. #219

Merged

Conversation

SamuelTrahanNOAA
Copy link

@SamuelTrahanNOAA SamuelTrahanNOAA commented Sep 22, 2022

Description

This removes the workaround of initializing ps_reg to -9999999 Pa in fv_regional_bc.F90 and fixes many other bugs that caused or masked. One bug related to blending has a workaround instead of a fix. A lot of these bugs were fixed by merging in #220. That PR from @kaiyuan-cheng has an elegant replacement for some of my earlier kludgy fixes. Unfortunately, that PR does not allow the model to run in debug mode. I believe this PR has the minimal set of fixes and workarounds needed, until someone comes up with a more elegant solution to the blending bug.

What is Fixed?

This PR fixes everything listed here, while related PRs (listed below) at higher levels fix several other issues. See the next section for the one bug that has a workaround instead of a fix.

  1. Only valid boundary values are copied to Atm%ps (not to data within the domain)
  2. Many uninitialized variables are now initialized with signalling NaN (real_snan)
  3. ps_reg is no longer initialzied to -9999999 Pa
  4. Replaced "go to 123" with "exit"
  5. PR Correct regional boundary condition indices #220 is included. Indexes are corrected so that wind remapping never accesses data outside boundary regions.

Workaround for Blending Code Receiving Data Outside the Boundary

The blending code's array array contains NaNs in regions it is asked to blend. If my reading of the code is correct, the array array is what is being blended, not the boundary conditions themselves. This is probably (but not certainly) regions beyond the boundary in the domain. Any operations with a NaN return NaN (unless the NaN signals), so processing those locations is a no-op (or a SIGFPE). The workaround is to detect non-finite values using is_not_finite() and skip them.

I'm concerned by this bug, as it means something may be wrong elsewhere. Someone with better knowledge of the code should look into this. The trouble may originate from the model level, so I've made an issue there (NOAA-EMC/fv3atm#586). It may be as simple as the blending code looping one point farther than it should.

Testing:

Fixes #218

The RRFS regression tests in the ufs-weather-model were compiled in DEBUG=ON mode, and extensively tested on Hera with the Gnu and Intel compilers. Many hours of debugging with gdb, ncview, and other tools. Testing this also required fixing a numerical precision issue in one ccpp physics scheme (ufs-community/ccpp-physics#5, ufs-community/ccpp-physics#6) and changes to the regression tests to get them to fit in 30 minutes (ufs-community/ufs-weather-model#1437). Ultimately, this fixes half of the test variants of RRFS that were failing before (ufs-community/ufs-weather-model#1222). It also obviates the problem of not having enough PS boundary data (ufs-community/ufs-weather-model#1436).

Preliminary results from a real-time parallel have shown this has a small impact on the bulk 1D statistics in a large domain with a large number of ranks. (Hence, boundary ranks have a small fraction of the domain.) However, regression tests have shown it does change the fields. Nobody has done any analysis between those two extremes.

Checklist:

Please check all whether they apply or not

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@SamuelTrahanNOAA
Copy link
Author

These changes are going into the RRFS as an emergency fix, in this PR:

NOAA-GSL#12

It's headed to the RRFS_dev branch in the NOAA-GSL fork, which is used as the authoritative repo for RRFS parallels. I'm hoping to un-fork the GFDL_atmos_cubed_sphere as soon as possible. The developers do not want wait the several weeks it'll take to update RRFS_dev to the authoritative branches. They want this fix as soon as possible, due to the severity of the bug and its widespread implications.

@junwang-noaa
Copy link
Collaborator

@yangfanglin Could we ask Kate or Kevin to review this PR? Thanks

@laurenchilutti
Copy link
Contributor

Thank you Jun, I wasn't sure who to ask to review this. I can update the reviewers if you would like.

@junwang-noaa
Copy link
Collaborator

Thanks, Lauren. I think some dynamics export on UFS regional application can review the PR.

@kaiyuan-cheng
Copy link
Contributor

The original code looks good to me. The proposed change does not seem correct because it remaps the winds at cell edges based on surface pressure at the cell center. FV3 needs winds on the cell edges (C- and D-grids). When the model runs in the regional mode, the remapping of the winds in the boundary domain requires the surface pressure to be on the cell edges, which is obtained by the linear interpolation of the surface pressure at the cell center. For example, 0.5*(psc(i,j-1)+psc(i,j)), as shown in the original code. I am uncertain if there is an issue with the boundary condition files. There are four halos of data, which should be more than enough to construct winds at the edges of three halos required by the FV3 algorithm. For this reason, I don’t see when the lack of surface pressure for remapping winds will happen.

model/fv_regional_bc.F90 Outdated Show resolved Hide resolved
model/fv_regional_bc.F90 Outdated Show resolved Hide resolved
@SamuelTrahanNOAA
Copy link
Author

For this reason, I don’t see when the lack of surface pressure for remapping winds will happen.

Turn back on ps_reg=real_snan and you will see real_snan in those two lines and quite a few others. The code extensively uses uninitialized memory.

@SamuelTrahanNOAA
Copy link
Author

There are three reasons I know of why surface pressure is unavailable in the wind remapping:

  1. PS is unavailable in the file on some velocity points (it is only available on scalar value points).
  2. The boundaries are processed in the wrong order, so the j-1 points are unavailable.
  3. Some of the i-1 and j-1 locations are outside the boundary region.

However, I don't even need to know that any of those are true, to know the code reads uninitialized areas. Just turn on the safeguard ps_reg=real_snan, and you'll see it fail all over the place.

@SamuelTrahanNOAA
Copy link
Author

SamuelTrahanNOAA commented Sep 27, 2022

Until we can fix this properly, the only way I see to put the i-1 and j-1 back in, is to use i-1 or j-1 when they are available, but only the local point when they're not. I can use my "is_not_finite" subroutine to determine that.

To fix it properly, we need PS on all the required points. There are points where i-1 or j-1 are inside the domain, where pressure is unavailable, or beyond the boundary regions, where pressure is unavailable. We'll need PS in the boundary files to extend into those regions. I'll update my PS-related issue accordingly.

Also, the "PS" reader will have to read into the next MPI rank's boundary region, to get i-1 and j-1.

Edit: This comment was supposed to address two situations: now, when we lack surface pressure, and later, when we have it.

@SamuelTrahanNOAA
Copy link
Author

SamuelTrahanNOAA commented Sep 28, 2022

Have you confirmed the fix by setting ps_reg = real_snan?
Have you fixed the other three bugs?

@SamuelTrahanNOAA
Copy link
Author

Everyone is hyperfocusing on four lines of code when there are quite a few other bugs in the fv_regional_bc.F90. In particular, the bug causing the model to abort in debug mode is NOT the one involving accessing i-1 and j-1. Since some people don't want to read my bug reports or pull request, I'll give you a hint: -9999999 Pa does propagate up to the model, inside the domain, not just at the boundary.

@kaiyuan-cheng
Copy link
Contributor

I do like the idea "ps_reg = real_snan" and I have it in my fix. It does not signal any error in my test case. I am not sure about what other bugs you are referring to.

@SamuelTrahanNOAA
Copy link
Author

I do like the idea "ps_reg = real_snan" and I have it in my fix. It does not signal any error in my test case. I am not sure about what other bugs you are referring to.

Good to hear. I will have to see your changes to know what (if anything) is not fixed.

@SamuelTrahanNOAA
Copy link
Author

SamuelTrahanNOAA commented Sep 28, 2022

@kaiyuan-cheng 's PR is here: #220, but it is on top of main rather than dev/emc. I lack the ability to test this in UFS since it won't compile within UFS. (CMakeFile.txt is missing.) That means I can't even confirm #220 will fix the bugs. It also leaves me wondering if some of the bugs originate from elsewhere in dev/emc, rather than the three boundary index lines fixed in #220. Without the ability to test, I do not know.

@SamuelTrahanNOAA
Copy link
Author

Although #220 does not fix the problem, I was able to dramatically reduce the complexity of my PR by replacing nearly all of my PS-related fixes with @kaiyuan-cheng's change. I believe this is the minimal possible set of fixes until someone has a more elegant fix to the blending bug.

@BrianCurtis-NOAA
Copy link

The UFSWM PR that contains this is on the top of the commit queue. I recommend we get reviews/approvals here that there are no visible issues in the code before we start testing in UFSWM.

@SamuelTrahanNOAA
Copy link
Author

I concur with @BrianCurtis-NOAA. I'm less worried about the state of this PR after the fixes from #220 were merged, but I do want someone to vet the remaining changes before widespread testing.

Comment on lines +3666 to +3669
is=min(ubound(Atm%ps,1),max(lbound(Atm%ps,1),is))
ie=min(ubound(Atm%ps,1),max(lbound(Atm%ps,1),ie))
js=min(ubound(Atm%ps,2),max(lbound(Atm%ps,2),js))
je=min(ubound(Atm%ps,2),max(lbound(Atm%ps,2),je))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that the modification here is really necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these changes, the model uses data outside the boundary regions that are overwritten with real_snan in the following loop. This error was how I found the ps_reg=-9999999 and later bugs. (Maybe this is caused by a region of the code you don't typically use?)

However, even if we assume that the changes are not necessary, it is then a harmless bounds check on copying to Atm%ps. As with initializing data to real_snan, bounds checks are reasonable safeguard.

Comment on lines +4936 to +4938
if(is_not_finite(array(i,j,k))) then
cycle ! Outside boundary
endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t run the blending boundary condition so I cannot test it. This walkaround looks fine to me although I do think that revising the indices of the blending domain is the ultimate fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably right about the indices, but I don't trust my knowledge of the code. If someone else knows the blending code, hopefully they'll review and suggest a better solution.

@MatthewPyle-NOAA
Copy link
Contributor

I haven't had time to look in great detail, but philosophically it seems like a bandaid is being applied with these is_not_finite checks. It would be nice to know the root cause of why it is reaching out beyond defined data. How frequently are these checks excluding i,j values?

if(is_not_finite(array(i,j,k))) then
                cycle ! Outside boundary
endif

The other change that caught my eye was the changes to the regional_bc_bounds variables (for south, north, and west) - I just would be somewhat surprised that it had been wrong all this time given the testing and vetting of the code by the original coders. But certainly could be possible.

-        regional_bc_bounds%is_west_uvw=ied-nhalo_model-nrows_blend+1
+        regional_bc_bounds%is_west_uvw=ied-nhalo_model-nrows_blend+2

@SamuelTrahanNOAA
Copy link
Author

@MatthewPyle-NOAA Those index changes come from @kaiyuan-cheng in #220 and they do prevent most out-of-bounds access. However, the results are different than my original fixes in RRFS_dev so these have not been vetted in a parallel yet.

@SamuelTrahanNOAA
Copy link
Author

SamuelTrahanNOAA commented Oct 4, 2022

How frequently are these checks excluding i,j values?

I have not tested that, but I recall in earlier testing the first variable that has the problem is "pt". Also, those changes are to a no-op; the areas that it is blending have no valid data.

I just would be somewhat surprised that it had been wrong all this time given the testing and vetting of the code by the original coders.

Many out-of-bounds bugs were masked by ps_reg=-9999999 which bypassed the normal safeguard of initializing to signalling NaN. My issue #218 lists the out-of-bounds accesses I found when restoring the safeguard.

@jkbk2004
Copy link

All tests for ufs-wm pr #1437 are finished. We can start merging in this pr.

@jkbk2004
Copy link

@laurenchilutti @SamuelTrahanNOAA @bensonr @kaiyuan-cheng just a reminder to have final approvals for this pr.

@SamuelTrahanNOAA
Copy link
Author

@jkbk2004 There is nothing I can do to effect that end but wait.

@jkbk2004
Copy link

@kaiyuan-cheng It seems Kaiyuan is not available. @laurenchilutti can you review/merge this pr ? commit queue has been on hold to move to next PR.

@laurenchilutti laurenchilutti merged commit aa42f6e into NOAA-GFDL:dev/emc Oct 17, 2022
@jkbk2004
Copy link

@laurenchilutti @kaiyuan-cheng @bensonr Thanks for following up with the pr!

junwang-noaa pushed a commit to NOAA-EMC/GFDL_atmos_cubed_sphere that referenced this pull request Oct 21, 2022
* fixes io performance issues by making everyone a reader when io_layout=1,1
adds capability to use FMS feature to ignore data integrity checksums in restarts
* rename enforce_rst_cksum to ignore_rst_cksum and change the default value for compatibility
* updated UFS/GFS atmosphere.F90 driver as per @BinLiu-NOAA and @junwang-noaa
* Regional decomposition test fix (when nrows_blend > 0) (NOAA-GFDL#194)
* Add missing instance for hailwat
* Regional bc blend changes to extend into interior halos and overlap on corners. Still not working for u and v.
* atmosphere.F90 : add hailwat to check
dyn_core.F90 : Fix from Jun Wang to correct sync of u,v
fv_regional_bc.F90 : add check for nrows_blend > tile size; fix error when nrows_blend=1

* Explanatory comment added

* Removed commented code

* Clean old code

* In fv_fill.F90, use kind_phys for kind_phys instead of hard-coding 8 byte reals. (NOAA-GFDL#193)

* Expose remap_scalar and remap_dwinds to fv3-jedi (NOAA-GFDL#199)

* changed interface to public

* added public

* removed source

* mods for jedi build

* Transfer changes from PR NOAA-GFDL#202 to NOAA-GFDL#199

Made small changes from PR NOAA-GFDL#202 manually.

* returned ignore checksum

* fixed ignore checksum

* Fix several bugs in fv_regional_bc.F90 relating to uninitialized or incorrectly initialized memory. (NOAA-GFDL#219)

* fixes and workarounds for uninitialized memory in fv_regional_bc

* remove workarounds and fix remaining known bugs in ps_reg

* a few more surface pressure bug fixes; now the test case runs in debug mode

* workarounds and bug fixes from gnu compiler testing

* remove -9999999 commented-out code

* quiet the NaNs passed to Atmp%ps

* simplify comments and explain snan

* use i-1 & j-1 for two-point averages, when available

* Replace many changes with PR NOAA-GFDL#220

Co-authored-by: Rusty.Benson <[email protected]>
Co-authored-by: Ted Mansell <[email protected]>
Co-authored-by: Rusty Benson <[email protected]>
Co-authored-by: Samuel Trahan (NOAA contractor) <[email protected]>
Co-authored-by: Mark Potts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants