Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coastal_ike_shinnecock_atm2sch2ww3 hanging on Orion with Intel compiler #4

Open
uturuncoglu opened this issue Mar 21, 2024 · 40 comments
Assignees
Labels
bug Something isn't working

Comments

@uturuncoglu
Copy link
Collaborator

coastal_ike_shinnecock_atm2sch2ww3test case is hanging on Orion with Intel compiler. I also tried to run it on Hercules with Intel and it is passing. So, this could be a system issue but maybe it is linked with #3 and needs to be investigated.

@uturuncoglu uturuncoglu added the bug Something isn't working label Mar 21, 2024
@uturuncoglu uturuncoglu self-assigned this Mar 21, 2024
@uturuncoglu uturuncoglu changed the title coastal_ike_shinnecock_atm2sch2ww3 hanging on Orion coastal_ike_shinnecock_atm2sch2ww3 hanging on Orion with Intel compiler Mar 21, 2024
@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 @platipodium I track down this issue and compile to code in debug mode. It seems it is crashing with following trace. It is throwing floating point exception from the following line,

https://github.com/schism-dev/schism/blob/3c3a13c97aea9eb9f0739c934933318203399d22/src/Hydro/schism_init.F90#L7028

I am not sure this is related with the model or coupling interface. I'll check the information that goes to SCHSIM but maybe you might have some idea.

13: [Orion-04-25:44363:0:44363] Caught signal 8 (Floating point exception: floating-point invalid operation)
13: ==== backtrace (tid:  44363) ====
13:  0 0x0000000003b35af1 schism_init_()  /work/noaa/nems/tufuk/COASTAL/ufs-coastal_dev/SCHISM-interface/SCHISM/src/Hydro/schism_init.F90:7028
13:  1 0x00000000038d7fd6 schism_nuopc_cap_mp_initializeadvertise_()  /work/noaa/nems/tufuk/COASTAL/ufs-coastal_dev/SCHISM-interface/SCHISM-ESMF/src/schism/schism_nuopc_cap.F90:345
13:  2 0x0000000000a77fe4 ESMCI::FTable::callVFuncPtr()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
13:  3 0x0000000000a7c0af ESMCI_FTableCallEntryPointVMHop()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
13:  4 0x0000000000bda1e7 ESMCI::VMK::enter()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:1125
13:  5 0x00000000008ea0e2 ESMCI::VM::enter()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
13:  6 0x0000000000a7942a c_esmc_ftablecallentrypointvm_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
13:  7 0x000000000097b950 esmf_compmod_mp_esmf_compexecute_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1223
13:  8 0x0000000000d571b1 esmf_gridcompmod_mp_esmf_gridcompinitialize_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1412
13:  9 0x000000000091e450 nuopc_driver_mp_loopmodelcompss_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:2886
13: 10 0x0000000000946734 nuopc_driver_mp_initializeipdv02p1_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:1313
13: 11 0x000000000095058b nuopc_driver_mp_initializegeneric_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:481
13: 12 0x0000000000a77fe4 ESMCI::FTable::callVFuncPtr()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
13: 13 0x0000000000a7c0af ESMCI_FTableCallEntryPointVMHop()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
13: 14 0x0000000000bd9fda ESMCI::VMK::enter()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2321
13: 15 0x00000000008ea0e2 ESMCI::VM::enter()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
13: 16 0x0000000000a7942a c_esmc_ftablecallentrypointvm_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
13: 17 0x000000000097b950 esmf_compmod_mp_esmf_compexecute_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1223
13: 18 0x0000000000d571b1 esmf_gridcompmod_mp_esmf_gridcompinitialize_()  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/cache/build_stage/spack-stage-esmf-8.5.0-7t7fsxpkw36g4ht6c6qbu4bvviztvaim/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1412
13: 19 0x000000000042c5a7 MAIN__()  /work/noaa/nems/tufuk/COASTAL/ufs-coastal_dev/driver/UFS.F90:381
13: 20 0x0000000000429392 main()  ???:0
13: 21 0x0000000000022495 __libc_start_main()  ???:0
13: 22 0x00000000004292a9 _start()  ???:0
13: =================================
13: forrtl: error (75): floating point exception
13: Image              PC                Routine            Line        Source             
13: fv3.exe            00000000042E199B  Unknown               Unknown  Unknown
13: libpthread-2.17.s  00002BA28A9915D0  Unknown               Unknown  Unknown
13: fv3.exe            0000000003B35AF1  schism_init_             7028  schism_init.F90
13: fv3.exe            00000000038D7FD6  schism_nuopc_cap_         345  schism_nuopc_cap.F90
13: fv3.exe            0000000000A77FE4  Unknown               Unknown  Unknown
13: fv3.exe            0000000000A7C0AF  Unknown               Unknown  Unknown
13: fv3.exe            0000000000BDA1E7  Unknown               Unknown  Unknown
13: fv3.exe            00000000008EA0E2  Unknown               Unknown  Unknown
13: fv3.exe            0000000000A7942A  Unknown               Unknown  Unknown
13: fv3.exe            000000000097B950  Unknown               Unknown  Unknown
13: fv3.exe            0000000000D571B1  Unknown               Unknown  Unknown
13: fv3.exe            000000000091E450  Unknown               Unknown  Unknown
13: fv3.exe            0000000000946734  Unknown               Unknown  Unknown
13: fv3.exe            000000000095058B  Unknown               Unknown  Unknown
13: fv3.exe            0000000000A77FE4  Unknown               Unknown  Unknown
13: fv3.exe            0000000000A7C0AF  Unknown               Unknown  Unknown
13: fv3.exe            0000000000BD9FDA  Unknown               Unknown  Unknown
13: fv3.exe            00000000008EA0E2  Unknown               Unknown  Unknown
13: fv3.exe            0000000000A7942A  Unknown               Unknown  Unknown
13: fv3.exe            000000000097B950  Unknown               Unknown  Unknown
13: fv3.exe            0000000000D571B1  Unknown               Unknown  Unknown
13: fv3.exe            000000000042C5A7  MAIN__                    381  UFS.F90
13: fv3.exe            0000000000429392  Unknown               Unknown  Unknown
13: libc-2.17.so       00002BA28ADD8495  __libc_start_main     Unknown  Unknown
13: fv3.exe            00000000004292A9  Unknown               Unknown  Unknown

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 @platipodium BTW, this is in the initialization and just before everything and it is called in InitializeAdvertise.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 @platipodium Okay. I print out values of diffmin and dfv and diffmin is NaN and dfv is zero as expected. I think this is a bug in model level since there is no indication to set diffmin if itur = 0in the configuration file. In this test case, we have following,

  itur = 0
  dfv0 = 0 !needed if itur=0
  dfh0 = 1.e-4 !needed if itur=0

Anyway, let me know if you are agree about the issue and the bug. Then, please let me know about solution. I'll try to give initial value to diffmin as 0 and I think same also required for diffmax.

@josephzhang8
Copy link

Th @uturuncoglu . I'll fix this bug. diffm[in,ax] are not used with itur=0, but it's best to init them.

@uturuncoglu
Copy link
Collaborator Author

uturuncoglu commented Mar 26, 2024

@josephzhang8 Thanks. If you want I could test your fix branch in my end to be sure that it is fixing the issue. After passing this point, there might be another issues that you want to fix. Anyway, it is your call.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 BTW, I'll also test the code with GNU. Maybe the underlying issue is same with #3. BTW, I am not sure why atm2sch test is working without any issue. Maybe that one is using different options for itur. I'll check it. Anyway, having different tests and running then regularly with DEBUG mode and different compilers will give us a capability to catch these issues in advance.

@josephzhang8
Copy link

Yes we test different compilers regularly to catch potential issues. The fixes are needed no matter what, to make SCHISM robust on all platforms. Thx for working with us!

@josephzhang8
Copy link

@uturuncoglu: I've fixed the bug in master version. Can u plz pull? Thx

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 Okay. Thanks. I'll try to run the case with your fix and update you.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 Okay. The code passed that point but now giving error like following,

10: [Orion-06-19:73244:0:73244] Caught signal 8 (Floating point exception: floating-point invalid operation)
10: ==== backtrace (tid:  73244) ====
10:  0 0x0000000004169c3d compute_wave_force_lon_()  /work/noaa/nems/tufuk/COASTAL/ufs-coastal_dev/SCHISM-interface/SCHISM/src/Hydro/misc_subs.F90:6194
10:  1 0x000000000393dda3 schism_esmf_util_mp_schism_stateimportwavetensor_()  /work/noaa/nems/tufuk/COASTAL/ufs-coastal_dev/SCHISM-interface/SCHISM-ESMF/src/schism/schism_esmf_util.F90:2592
10:  2 0x00000000038e6c0e schism_nuopc_cap_mp_schism_import_()  /work/noaa/nems/tufuk/COASTAL/ufs-coastal_dev/SCHISM-interface/SCHISM-ESMF/src/schism/schism_nuopc_cap.F90:1005
10:  3 0x00000000038e1483 schism_nuopc_cap_mp_modeladvance_()  /work/noaa/nems/tufuk/COASTAL/ufs-coastal_dev/SCHISM-interface/SCHISM-ESMF/src/schism/schism_nuopc_cap.F90:766

Here is the line, https://github.com/schism-dev/schism/blob/84866bf95d779a43056db4c8885908bd675010b3/src/Hydro/misc_subs.F90#L6194. I did not debug further but it seems that RSXX0 is also NaN but I'll check it. I need to track the source of it.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 this might be a bug in cap side. I think that the variable assignments that are used in the following call is wrong,

  call compute_wave_force_lon(eastward_wave_radiation_stress, &
    eastward_northward_wave_radiation_stress,northward_wave_radiation_stress)

I think this needs to use SCHISM_StateUpdate call rather than direct assignment since we are not using element based fields. Anyway, I'll try to fix it and let you know.

@uturuncoglu
Copy link
Collaborator Author

uturuncoglu commented Mar 27, 2024

@josephzhang8 @platipodium I was looking for the issue related with the compute_wave_force_lon. It seems that is is an issue with the array sizes. So, Why stress related fields (like northward_wave_radiation_stress) are defined in the size of isPtr%numOwnedNodes (or np) but other forcing variables like pr2 is defined in npa size. Is there any underlying reason for it? It is hard to fallow that piece of code. At the end of the day, this information is used to set wwave_force variable and it is in a size of (2,nvrt,nsa). What is the relationship between np, nsa and npa? It seems that hgrad_nodes call basically handles the moving information from RSXX to DSXX3D but that also uses size of npa. Anyway, maybe I am confused but it seems the size of the arrays are not consistent. Let me know what you think?

@josephzhang8
Copy link

@uturuncoglu: npa=np+npg (resident + ghost). I thought EMSF does not handle ghost zone so the input arrays RSXX0 have a dim of np, and then RSXX etc are used in exchange to get ghost.

ns, nsa are # of edges of elements. Wave forces are defined at edges (side centers) due to gradient operator. The arrays in hgrad_nodes() have to be dim of npa (augmented).

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 Yes, we don't have ghost elements anymore. So, we need to adjust the call based on this reality (we might have two version of it one with node based mesh and another for element until we switch element based completely. Anyway, it would be hard for me to understand the logic over there. So, is there any both in your side to look at closely that part of code? If not, I could try to look at but I am not sure who I have the fix since I also need to spend time in other projects.

@josephzhang8
Copy link

@uturuncoglu : that routine is basically same as the (well tested) routine used by the internal wave module (WWM), so I think the only potentially errors may be from interface.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 Okay. So, you think that we don't need to change anything in the routine but maybe add some logic to fill the arrays in a correct way before passing to it. Right?

@josephzhang8
Copy link

Hold on... I think I may have found the bug....

@josephzhang8
Copy link

@uturuncoglu I just pushed a new master; can u plz check? Thx!
The bug: hgrad_nodes() expects 3D variables for radiation stress components.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 Okay. Let me test. This might still need some fix in upper level to get the required data from import state.

@josephzhang8
Copy link

Yes that's the part I'm not sure about. I noticed last time u added some allocatable, target arrays in schism_glbl.F90

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 I am getting FPE from sum1=sum(RSXX0). This is probably, RSXX0 field is not filled correctly and includes some NaN value (it is not initialized after allocation). I'll try to fix it in the cap side.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 I think I fixed the issue. The wave related fields looks fine now. I'll do more test and then crate PR in SCHSIM-ESMF repository. Just for your information, here are the changes https://github.com/oceanmodeling/schism-esmf/pull/new/hotfix/wave_stress

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 I run the case and plot rsxx, rsyy and rsxy with my simple NCL script for last time step. It looks like rsxx, rsyy is consistent with the currents but rsxy looks little bit off to me. Since you are more experienced than me, I wonder if you have any idea. BTW, this is idealized case that @pvelissariou1 created before. So, the results could be weird.

rsxx
plot_sch_rsxx

rsyy
plot_sch_rsyy

rsxy
plot_sch_rsxy

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 So, we might have issue in outputting rsxy or calculating it. I'll check import state for the wave field if i see same structure also in there or not.

@josephzhang8
Copy link

I don't see scale bar; maybe SXY is very small?

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 JFYI, i checked the import state for the rsxy and it looks fine over there. So, I am not sure but there could be an issue in the model side when it is trying to write it.
Screenshot 2024-03-28 at 2 38 36 PM

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 let me use same scale with the preview and NCL to double check.

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 Okay. I think it is hard to make scale same since SCHSIM is applying some unit conversion I think. The data coming from wave has range of -9.07627 to 18.6689 but in output the ranges are -8.24308e-07 to 7.92218e-07. Anyway it seems that I have also plotting different time step but after fixing the plot range the NCL plot looks reasonable.

plot_sch_rsxy

Anyway, I think this is fine and fix working as expected. I'll also check the GNU issue to see if this fix also handles that case too.

@josephzhang8
Copy link

Great to know; thx @uturuncoglu!
I'm almost ready to work on the 3D (vortex) coupling (just found out the array names we need from WW3 this afternoon).

@pvelissariou1
Copy link

@josephzhang8 Joseph, could you please update the document (if needed)? ww3-exports-ocn-3Dwave-terms

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 It is also working with GNU. I will do one last test with GNU and DEBUG mode. It that also pass I'll create PR and maybe we could close two issues in the same time.

@josephzhang8
Copy link

gr8!

@josephzhang8
Copy link

@pvelissariou1 : I just did that

@josephzhang8
Copy link

@uturuncoglu : do u want me to review/merge the PR now? Thx

@josephzhang8
Copy link

scratch that... I see the request from u now

@uturuncoglu
Copy link
Collaborator Author

@josephzhang8 JFYI, I replied in the PR side. Once this in I am plaining to define two more test in UFS Coastal side one for GNU and one for DEBUG mode to cover different cases. So, we would be sure where are fine with those option in the future.

@pvelissariou1
Copy link

@josephzhang8 Thank you Joseph. @uturuncoglu After the PR is merged, I guess you will update ufs-coastal as well. I am planning to check all SCHISM related tests.

@uturuncoglu
Copy link
Collaborator Author

@pvelissariou1 You could test by checking out master in SCHSIM and https://github.com/oceanmodeling/schism-esmf/tree/hotfix/wave_stress branch in SCHSIM-ESMF side. If we have issue, it would be better to know it before merge. @josephzhang8 maybe we could wait for Takis to perform initial test.

@pvelissariou1
Copy link

@uturuncoglu , @josephzhang8 Thank you very much both. Ufuk I will check SCHISM as you suggested and we will talk about this on Monday. Hopefully all will be fine.

@janahaddad
Copy link
Collaborator

@mansurjisan to check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Backlog
Development

No branches or pull requests

4 participants