Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ESMF-managed threading for all coupled components #824

Closed
rsdunlapiv opened this issue Sep 23, 2021 · 23 comments · Fixed by NOAA-EMC/fv3atm#469, #1018 or NOAA-EMC/WW3#596
Closed
Labels
enhancement New feature or request

Comments

@rsdunlapiv
Copy link

rsdunlapiv commented Sep 23, 2021

Description

To achieve optimal performance of coupled UFS applications, the number of threads need to be tuned separately for each component.

Solution

ESMF recently introduced flexible threading options that allows each component model to independently set its own threading level. This was discussed at a recent UFS/CMEPS call (see slides)

UFS will need to first be updated to ESMF 820bs20+.

@theurich and @mark-a-potts have started this work on branches.

Alternatives

There are some options for machine-specific threading layouts. However, these are not portable between machines and do not support setting per-component threading levels when the components are running on the same nodes.

Related to

Depends on: NOAA-EMC/HYCOM-src#1

@rsdunlapiv rsdunlapiv added the enhancement New feature or request label Sep 23, 2021
@rsdunlapiv
Copy link
Author

@junwang-noaa there is a build of ESMF 820bs20 available on Hera for testing.

module use /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/modulefiles
module load mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20

If you cannot use these modules directly, you should be able to override existing modules for testing on Hera by setting this and leaving the rest of the modules unchanged:

export ESMFMKFILE=/scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk

Can you please use this to run the RTs against this snapshot of ESMF?

@DusanJovic-NOAA
Copy link
Collaborator

$ ls -l /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk
ls: cannot access /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk: No such file or directory

@rsdunlapiv
Copy link
Author

Apologies @DusanJovic-NOAA. Can @mark-a-potts please help?

@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Sep 30, 2021

Something is wrong with the modulefile:

$ module use /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/modulefiles
$ module load mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20

$ module show mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20.lua:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
help([[]])
conflict("mpi/intel/18.0.5.274/impi/2018.0.4/esmf")
prepend_path("PATH","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/bin")
prepend_path("LD_LIBRARY_PATH","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib")
prepend_path("DYLD_LIBRARY_PATH","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib")
prepend_path("LIBRARY_PATH","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib")
prepend_path("MANPATH","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/share/man")
prepend_path("CPATH","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/include")
setenv("ESMF_ROOT","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20")
setenv("ESMF_DIR","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20")
setenv("ESMF_PATH","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20")
setenv("ESMF_BIN","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/bin")
setenv("ESMF_INC","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/include")
setenv("ESMF_INCLUDES","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/include")
setenv("ESMF_LIB","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib")
setenv("ESMF_LIBRARIES","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib")
setenv("ESMF_VERSION","8_2_0_beta_snapshot_20")
setenv("ESMF_MOD","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/mod")
setenv("ESMFMKFILE","/opt/modules/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk")
whatis("Name: mpi/intel/18.0.5.274/impi/2018.0.4/esmf")
whatis("Version: 8_2_0_beta_snapshot_20")
whatis("Category: library")
whatis("Description: ESMF library")

ESMFMKFILE points to "/opt/modules/....." which does not exist.

I should have hpc-stack modules loaded first but even then ESMFMKFILE is points to a file that does not exist.

@mark-a-potts
Copy link
Contributor

Weird. When I do a module show, the /opt/modules is not there. From the module file, this is in there, though---

`local pkgName = myModuleName()
local pkgVersion = myModuleVersion()
local pkgNameVer = myModuleFullName()

local hierA = hierarchyA(pkgNameVer,2)
local mpiNameVer = hierA[1]
local compNameVer = hierA[2]
local mpiNameVerD = mpiNameVer:gsub("/","-")
local compNameVerD = compNameVer:gsub("/","-")

conflict(pkgName)

local opt = os.getenv("HPC_OPT") or os.getenv("OPT") or "/opt/modules"

local base = pathJoin(opt,compNameVerD,mpiNameVerD,pkgName,pkgVersion)

prepend_path("PATH", pathJoin(base,"bin"))
prepend_path("LD_LIBRARY_PATH", pathJoin(base,"lib"))
prepend_path("DYLD_LIBRARY_PATH", pathJoin(base,"lib"))
prepend_path("LIBRARY_PATH", pathJoin(base,"lib"))
prepend_path("MANPATH", pathJoin(base,"share","man"))
prepend_path("CPATH", pathJoin(base,"include"))

setenv( "ESMF_ROOT", base)
setenv( "ESMF_DIR", base)
`

Do you have HPC_OPT or OPT defined in your environment?

@DusanJovic-NOAA
Copy link
Collaborator

In modulefiles/ufs_hera.intel I have:

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack

module load hpc/1.1.0

module load hpc-intel/18.0.5.274
module load hpc-impi/2018.0.4

module load ufs_common

module use /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/modulefiles
module load mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20

After I load ufs_hera.intel I see:

$ module show mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20
------------------------------------------------------------------------------------------------------
   /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20.lua:
------------------------------------------------------------------------------------------------------

....

setenv("ESMFMKFILE","/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk")
whatis("Name: mpi/intel/18.0.5.274/impi/2018.0.4/esmf")
whatis("Version: 8_2_0_beta_snapshot_20")
whatis("Category: library")
whatis("Description: ESMF library")

@DusanJovic-NOAA
Copy link
Collaborator

Looks like combining modules from two different stacks does not work.

@mark-a-potts
Copy link
Contributor

If you only want to use the new version of ESMF from my install, you can unload the esmf module (of comment out the load in ufs_common) and then set the ESMFMKFILE to point to my install in /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk

I recently got rt.sh to find the right library by doing that and adding a "-DESMFMKFILE=/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk" to the cmake options in rt.conf.

@DusanJovic-NOAA
Copy link
Collaborator

That file (/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/scratch1-NCEPDEV-da-Mark.Potts-sandbox-hpc-modules-modulefiles/mpi/intel/18.0.5.274/impi/2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk) does not exist. See my comment above.

Was you compilation successful?

@mark-a-potts
Copy link
Contributor

Sorry, I pasted in the wrong path. Use this instead--/scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/intel-18.0.5.274/impi-2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk

I was able to successfully compile and run the cpld_bmark_wave_v16 test with rt.sh, but that was the only one I tried.

@DusanJovic-NOAA
Copy link
Collaborator

I added:

setenv ESMFMKFILE /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/intel-18.0.5.274/impi-2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk

to modulefiles/ufs_hera.intel, and removed loading of old esmf from ufs_common, and ran control test.

In compile log I see:

-- Found ESMF library: /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/intel-18.0.5.274/impi-2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/libesmf.a                                                                  

and compilation was successful, but run failed with this errors:

  2: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_156621/control/./fv3.exe: symbol lookup error: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_156621/control/./fv3.exe: undefined symbol: __libm_feature_flag
 14: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_156621/control/./fv3.exe: symbol lookup error: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_156621/control/./fv3.exe: undefined symbol: __libm_feature_flag
 26: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_156621/control/./fv3.exe: symbol lookup error: /scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_156621/control/./fv3.exe: undefined symbol: __libm_feature_flag

Obviously something in my environment is different.

@mark-a-potts
Copy link
Contributor

Hmm. I was able to get the first cpld_control_wave_p7 to run, but then it failed on the restart. You can check out how I have things set up here--/scratch1/NCEPDEV/da/Mark.Potts/sandbox/tmp/ufs-weather-model

What happened to the cpld_bmark_wave_v16 test?

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Oct 1, 2021 via email

@DeniseWorthen
Copy link
Collaborator

@mark-a-potts With today's commit, the coupled regression tests have all been updated to the P7 configuration. The old cpld_bmark_wave_v16 test is now called cpld_bmark_p7.

@rsdunlapiv
Copy link
Author

@DusanJovic-NOAA are you able to build now using the updated ESMF path from Mark?
/scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/intel-18.0.5.274/impi-2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA are you able to build now using the updated ESMF path from Mark? /scratch1/NCEPDEV/da/Mark.Potts/sandbox/hpc-modules/intel-18.0.5.274/impi-2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk

Yes. The compilation was successful, but model crashed, see the runtime error above.

@rsdunlapiv
Copy link
Author

@DusanJovic-NOAA that looks like a linking error, so is probably still related to the build itself. We need a clear approach to how to update version of ESMF for testing, given an existing HPC stack. I don't understand why @mark-a-potts was able to get the test to run but Dusan could not.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Oct 6, 2021 via email

@rsdunlapiv
Copy link
Author

A small update is needed in HYCOM before updating to ESMF8bs20.
NOAA-EMC/HYCOM-src#1

@rsdunlapiv
Copy link
Author

@junwang-noaa @DusanJovic-NOAA
The following RTs are passing with ESMF 8.2.0bs20.

[Rocky.Dunlap@hfe02 log_hera.intel.esmf8bs20]$ ls -la rt_00*
-rw-r--r-- 1 Rocky.Dunlap stmp 2628 Oct 18 22:34 rt_001_cpld_control_p7.log
-rw-r--r-- 1 Rocky.Dunlap stmp 2450 Oct 18 22:53 rt_002_cpld_bmark_p7.log
-rw-r--r-- 1 Rocky.Dunlap stmp  453 Oct 18 23:10 rt_003_hafs_regional_atm.log
-rw-r--r-- 1 Rocky.Dunlap stmp  550 Oct 18 23:21 rt_004_hafs_regional_atm_ocn.log
-rw-r--r-- 1 Rocky.Dunlap stmp 2524 Oct 18 23:37 rt_005_control_atm_aerosols.log

See logs here: /scratch2/NCEPDEV/stmp1/Rocky.Dunlap/ufs-weather-model/tests/log_hera.intel.esmf8bs20

This requires merging this HYCOM PR first:
NOAA-EMC/HYCOM-src#1

Please let me know what else is needed to update ufs-weather-model to ESMF 8.2.0bs20. For further testing, you can use this build of ESMF on Hera:

setenv ESMFMKFILE /scratch2/NCEPDEV/stmp1/Rocky.Dunlap/esmftest/ESMF-INSTALL/intel-18.0.5.274/impi-2018.0.4/esmf/8_2_0_beta_snapshot_20/lib/esmf.mk

@junwang-noaa
Copy link
Collaborator

@rsdunlapiv Thanks for the testing. Does the HYCOM backward compatible? Do we need to commit the HYCOM PR along with the ESMF update?

@rsdunlapiv
Copy link
Author

rsdunlapiv commented Oct 19, 2021 via email

@rsdunlapiv
Copy link
Author

We decided to go ahead and update to the official release ESMF_8_2_0 instead of the 820bs20.

epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
* add paths to recent MET/METplus installations on Gaea

* change data staging directory to ncep_shared
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
5 participants