Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WCOSS2: pio install does not seem to support pnetcdf #2232

Closed
BrianCurtis-NOAA opened this issue Apr 11, 2024 · 97 comments · Fixed by #2302
Closed

WCOSS2: pio install does not seem to support pnetcdf #2232

BrianCurtis-NOAA opened this issue Apr 11, 2024 · 97 comments · Fixed by #2302
Labels
bug Something isn't working

Comments

@BrianCurtis-NOAA
Copy link
Collaborator

BrianCurtis-NOAA commented Apr 11, 2024

Description

PR #2145 brought in a change where CICE switched to use pnetcdf in PIO instead of hdf5. This worked on all machines except WCOSS2.

This leads us to believe that the PIO install on WCOSS2 was not built with proper pnetcdf support.

Efforts are ongoing trying to determine the specific of any build differences between spack-stack and the hpc-stack on WCOSS2.

To Reproduce:

Run cpld_control_gfsv17 intel RT with develop branch of UFSWM (From PR #2145 ) on WCOSS2 dev machine

Needs alongside solving of this issue

  1. Remove temporary workaround in default_vars.sh for WCOSS2
@BrianCurtis-NOAA BrianCurtis-NOAA added the bug Something isn't working label Apr 11, 2024
@junwang-noaa
Copy link
Collaborator

@HangLei-NOAA would you please check the library on wcoss2 and install a test version of netcdf with pio on acorn for us to test? Thanks

@BrianCurtis-NOAA
Copy link
Collaborator Author

Apologies for missing to post here, but there is an install Hang has made at: /lfs/h2/emc/eib/save/Hang.Lei/forgdit/nco_wcoss2/install

@DeniseWorthen
Copy link
Collaborator

@BrianCurtis-NOAA Are you testing Hang's install?

@BrianCurtis-NOAA
Copy link
Collaborator Author

Yes. I will do more today once I get todays PR started.

@BrianCurtis-NOAA
Copy link
Collaborator Author

using the updated libraries:
/lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_63569/

MPICH ERROR [Rank 170] [job id 0882c1cf-0967-46e2-88a1-63f55d8cd95f] [Wed Apr 17 17:48:46 2024] [nid001019] - Abort(128) (rank 170 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 170
 (abort_ice)ABORTED:
 (abort_ice) called from ice_pio.F90
 (abort_ice) line number          223
 (abort_ice) error =
 (ice_pio_check)Unknown Error, (ice_pio_init) ERROR: Failed to create file ./his
 tory/iceh_ic.2021-03-22-21600.nc

@Hang-Lei-NOAA
Copy link

@BrianCurtis-NOAA Please let me know if a specific version of UFS is using for the testing. I just finished the GSI library task. I will start the UFS test.
If I still get the failure, I will install a new pnetcdf library.
Currently, we are using the system installed pnetcdf library.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@BrianCurtis-NOAA Please let me know if a specific version of UFS is using for the testing. I just finished the GSI library task. I will start the UFS test. If I still get the failure, I will install a new pnetcdf library. Currently, we are using the system installed pnetcdf library.

@Hang-Lei-NOAA the develop branch of ufs-weather-model has the issue, use:

./rt.sh -a <ACCNR> -n "cpld_control_gfsv17 intel"

but first remove (or comment out):

if [[ ${MACHINE_ID} == wcoss2 ]]; then
export CICE_RESTART_FORMAT='hdf5'
export CICE_HISTORY_FORMAT='hdf5'
fi

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Apr 19, 2024

@BrianCurtis-NOAA
I have checked, and get this fixed with rebuild the netcdf/4.9.2 with pnetcdf.
/lfs/h2/emc/eib/noscrub/Hang.Lei/works/brianufs/tests/logs/log_wcoss2/run_cpld_control_gfsv17_intel.log

Please use:
/lfs/h2/emc/eib/noscrub/Hang.Lei/works/brianufs/modulefiles/ufs_wcoss2.intel.lua

Please copy it soon, I will do more sensitivity tests on the UFS to use system installed libs this afternoon after 3pm. Thanks.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@Hang-Lei-NOAA I can confirm that your lua file works for that test. Please proceed with getting these adjustments made on WCOSS2 Dev.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Apr 22, 2024

@junwang-noaa @BrianCurtis-NOAA
Bongi has set up an installation on acorn systems installation, please load the intel environment. You will see:
For now, here is what is deployed to production on Acorn:

$ module -t avail 2>&1 | grep -- "-C/."
esmf-C/8.6.0
fms-C/2023.04
hdf5-C/1.14.0
mapl-C/2.40.3
netcdf-C/4.9.2
pio-C/2.5.10
pnetcdf-C/1.12.2

please test them and let me know if any issues found. Thanks

@BrianCurtis-NOAA
Copy link
Collaborator Author

I've been able to load those modules and build/compile a test case OK. I am running the full suite now on Acorn using WCOSS2 setup. I will pass along the results as soon as I can.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@Hang-Lei-NOAA Bongi needed Acorn for other things today, so I was only able to run a subset of tests with the -C libraries but they included tests for cpld, control, regional, 2threads, mpi, restarts, p8, gfsv17, decomp, (the problem case from before) all with success (PASS). Im comfortable saying the -C libraries are OK to use on WCOSS2.

brian.curtis@alogin03:/lfs/h1/emc/nems/noscrub/brian.curtis/git/ufs-community/ufs-weather-model/tests/log
s/log_acorn> grep -ril PASS ./rt*.log
./rt_control_2threads_p8_intel.log
./rt_control_c192_intel.log
./rt_control_c384gdas_intel.log
./rt_control_c384_intel.log
./rt_control_c48_intel.log
./rt_control_c48.v2.sfc_intel.log
./rt_control_CubedSphereGrid_intel.log
./rt_control_CubedSphereGrid_parallel_intel.log
./rt_control_decomp_p8_intel.log
./rt_control_flake_intel.log
./rt_control_iovr4_intel.log
./rt_control_iovr5_intel.log
./rt_control_latlon_intel.log
./rt_control_lndp_intel.log
./rt_control_noqr_p8_intel.log
./rt_control_p8_lndp_intel.log
./rt_control_p8_mynn_intel.log
./rt_control_p8_rrtmgp_intel.log
./rt_control_p8_ugwpv1_intel.log
./rt_control_p8.v2.sfc_intel.log
./rt_control_stochy_intel.log
./rt_control_stochy_restart_intel.log
./rt_control_wrtGauss_netcdf_parallel_intel.log
./rt_cpld_2threads_p8_intel.log
./rt_cpld_control_c48_intel.log
./rt_cpld_control_ciceC_p8_intel.log
./rt_cpld_control_gfsv17_intel.log
./rt_cpld_control_noaero_p8_agrid_intel.log
./rt_cpld_control_noaero_p8_intel.log
./rt_cpld_control_nowave_noaero_p8_intel.log
./rt_cpld_control_p8_faster_intel.log
./rt_cpld_control_p8_intel.log
./rt_cpld_control_p8_mixedmode_intel.log
./rt_cpld_control_p8.v2.sfc_intel.log
./rt_cpld_control_pdlib_p8_intel.log
./rt_cpld_control_qr_p8_intel.log
./rt_cpld_debug_gfsv17_intel.log
./rt_cpld_debug_pdlib_p8_intel.log
./rt_cpld_decomp_p8_intel.log
./rt_cpld_mpi_gfsv17_intel.log
./rt_cpld_mpi_p8_intel.log
./rt_cpld_mpi_pdlib_p8_intel.log
./rt_cpld_restart_gfsv17_intel.log
./rt_cpld_restart_p8_intel.log
./rt_cpld_restart_pdlib_p8_intel.log
./rt_cpld_restart_qr_p8_intel.log
./rt_cpld_s2sa_p8_intel.log
./rt_merra2_thompson_intel.log
./rt_regional_2threads_intel.log
./rt_regional_control_intel.log
./rt_regional_decomp_intel.log
./rt_regional_spp_sppt_shum_skeb_intel.log

@Hang-Lei-NOAA
Copy link

Okay, let's push forward.
I tested the special case and one aerosol case last night. They are fine.
We will do full testing once it is temporally set up on wcoss2.

@JessicaMeixner-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA Do you have an estimate about when you think this might be resolved? I'm asking in context of efforts trying to update the global-workflow: NOAA-EMC/global-workflow#2505 and just trying to figure out what the fastest path to updating the model in the global-workflow. Currently the workflow cannot update because HDF5 usage with CICE means you cannot use linked files. I confirmed that this is the same behavior with hdf5 on hera as well. While there are plans to move away from linked files in the global-workflow, it will take some time. So I'm curious if this will be available relatively soon.

@Hang-Lei-NOAA
Copy link

@JessicaMeixner-NOAA Since modifying the netcdf pio Esmf, with netcdf, we delivered it and closely working with GDIT. As my recent check, they said that it will be ready on wcoss2 cactus for testing on this Thursday. It has been very fast. These updates have already been available on acorn. You can start test on acorn.

@JessicaMeixner-NOAA
Copy link
Collaborator

Thanks for the information @Hang-Lei-NOAA

@Hang-Lei-NOAA
Copy link

@BrianCurtis-NOAA @junwang-noaa lib-C series are available on CACTUS for testing.
Please fully test it as soon as possible.

@JessicaMeixner-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA apologies if I missed this information elsewhere, but can you share where exactly this new module file is on Cactus for testing?

@BrianCurtis-NOAA
Copy link
Collaborator Author

I have a modulefile i'm testing, i'll pass it along if all goes well.

@DeniseWorthen
Copy link
Collaborator

@BrianCurtis-NOAA I think it would be worthwhile to be able to confirm that the G-W, using linked files, is functional. I presume that is the testing that @JessicaMeixner-NOAA could do in parallel with yours.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented May 2, 2024 via email

@BrianCurtis-NOAA
Copy link
Collaborator Author

Here's what I am using for testing.

/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/c-libs/modulefiles/ufs_wcoss2.intel.lua

@JessicaMeixner-NOAA
Copy link
Collaborator

Thanks @BrianCurtis-NOAA @DeniseWorthen and @Hang-Lei-NOAA.

I will test in the g-w this afternoon using the modules from @BrianCurtis-NOAA and will report back how this goes.

@BrianCurtis-NOAA
Copy link
Collaborator Author

This is what i've got from my UFSWM testing, this also includes FMS 2023.04 and ESMF 8.6.0/MAPL built with that.

in compile_atml_debug_intel:

/lfs/h2/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/c-libs/FV3/ccpp/physics/physics/Interstitials/UFS_SCM_NEPTUNE/gcycle.F90(236): error #8284: If the actual argument is scalar, the dummy argument shall be scalar unless the actual argument is of type character or is an element of an array that is not assumed shape, pointer, or polymorphic.   [SIG1T]
      CALL SFCCYCLE (9998, npts, max(lsoil,lsoil_lsm), sig1t, fhcyc, &
-----------^

in cpld_control_gfsv17_iau_intel:

Comparing history/iceh_06h.2021-03-23-43200.nc .....USING NCCMP......NOT IDENTICAL

cpld_restart_pdlib_p8 intel (finished but interrupted?)
control_p8_atmlnd_sbs intel ((wallclock) failed to complete run)
control_p8_atmlnd intel ((wallclock) failed to complete run)
control_restart_p8_atmlnd intel (compare test failed, not run)
control_p8_atmlnd_debug intel (compile failure, not run)

@DeniseWorthen I believe the iceh file not reproducing is correct because it switched to pnetcdf this time, correct?

The (finished but interrupted) issue i've seen before but it's intermittent and not easily reproduced, rerunning usually is successful.

@junwang-noaa should the p8 atmlnd (& sbs) tests be running out of wallclock? It almost seems like it hung somewhere and hit wallclock vs just not being able to complete in time. But i recall some hang issue we've seen before but I'm unsure if it would be remotely related.

@uturuncoglu
Copy link
Collaborator

@BrianCurtis-NOAA Just let me if you need anything from my side. We were not having issue in terms of wall clock time with land tests in the past. Right? So, I am not sure why they have issue now. Is this a particular platform? I could also play those tests and reduce the simulation length or I/O if we need.

@DeniseWorthen
Copy link
Collaborator

@BrianCurtis-NOAA I would not be surprised if the history file was different. Does the nccmp log give any information about the difference? If you want to place the baseline file and the new run file on hera so I can use cprnc, I can do that.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 2, 2024

I can see in the global attributes that the file was created w/ io_pio2 pnetcdf2 vs the previous io_pio2 hdf5, so it is now able to use pnetcdf2. But cprnc and nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format show no differences in the two files you provided, so I'm not sure why the comparison failed in the RT.

@BrianCurtis-NOAA
Copy link
Collaborator Author

@Hang-Lei-NOAA in test: datm_cdeps_lnd_gswp3_intel
/lfs/h2/emc/ptmp/brian.curtis/FV3_RT/rt_257182/datm_cdeps_lnd_gswp3_intel/PET150.ESMF_LogFile
/lfs/h1/emc/nems/noscrub/brian.curtis/git/ufs-community/ufs-weather-model/modulefiles/ufs_acorn.intel.lua.clibs

20240514 123445.985 INFO             PET150 (lnd_comp_domain):(lnd_set_decomp_and_domain_from_mosaic) : begl, endl, im =     1  384  384
20240514 123445.985 INFO             PET150 (lnd_comp_io): (read_tiled_file)  called for INPUT/oro_data.tile*.nc
20240514 123445.985 INFO             PET150 (lnd_comp_io): (read_tiled_file) adding land_frac to FB
20240514 123445.992 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile1.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240514 123445.992 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile2.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240514 123445.993 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile3.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240514 123445.994 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile4.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240514 123445.995 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile5.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240514 123445.995 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile6.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240514 123445.996 ERROR            PET150 ESMCI_PIO_Handler.C:617 ESMCI::PIO_Handler::arrayReadOne Unable to read from file  - file not open
20240514 123445.996 ERROR            PET150 ESMCI_IO_Handler.C:405 ESMCI::IO_Handler::arrayRead() Unable to read from file  - Internal subroutine call returned Error
20240514 123445.996 ERROR            PET150 ESMCI_IO.C:382 ESMCI::IO::read() Unable to read from file  - Internal subroutine call returned Error

@Hang-Lei-NOAA
Copy link

@BrianCurtis-NOAA I have been testing it too.
DIfferent failure but failed. I could not say where the problem is.
I am doing some extra tests on my end and asking Bongi moving them to cactus for testing.

@DusanJovic-NOAA
Copy link
Collaborator

May I ask what's the reason for switching from hdf5 to pnetcdf? Is it faster, or does it produce smaller files?

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 15, 2024

@DusanJovic-NOAA From my CICE I/O tests, pnetcdf was much faster. See https://docs.google.com/spreadsheets/d/1xD0-gvbfI2Nwhf-ys_JdQEHR4Wibb0U5hGyVBBEqNUg/edit#gid=93260697

Here, hdf5 is the namelist setting, which writes through the pio hdf5/netcdf4 interface.

@DusanJovic-NOAA
Copy link
Collaborator

Do you know whether hdf5 was configured to use compression/shuffling and/or chunking. How big are the output files?

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 15, 2024

When I was testing the CICE IO, I did not use any chunking or compression. The namelist allows you to set these, for example

    restart_chunksize   = 0,0
    restart_deflate     = 0

CICE output files are not that large ~500MB for a history and ~2G for a restart at 1/4deg.

@Hang-Lei-NOAA
Copy link

As recording by the GDIT helpdesk ticket [Ticket#2024040310000021]

Bongi has get
The HDF5, pnetcdf, PIO, FMS, ESMF, and Mapl works fine on cactus testing area.
But still have netcdf using his own settings.
I keep a testing sets on cactus /lfs/h2/emc/eib/noscrub/hang.lei/c-libs/tests
to verify his changes.

@aerorahul
Copy link
Contributor

@Hang-Lei-NOAA
Is there an update on this issue? This is really backing up a lot of work.

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented May 28, 2024 via email

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented May 28, 2024 via email

@BrianCurtis-NOAA
Copy link
Collaborator Author

@Hang-Lei-NOAA is this separate from the C-libs?

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented May 28, 2024 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA would you please list your code directory so that we can take a look at the modules files and the compile job? Thanks

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented May 29, 2024 via email

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented May 29, 2024

I just tried running the control_p8_atmlnd_debug_intel using Hang's module file (.bak version) and it completed w/ no other changes.

/lfs/h2/emc/ptmp/denise.worthen/FV3_RT/rt_46787/control_p8_atmlnd_debug_intel

@junwang-noaa
Copy link
Collaborator

Thanks, Denise. @Hang-Lei-NOAA I want to clarify that the ufs_wcoss2.intel.lua is the one with libraries Bongi installed that we are expected to use, right?

@Hang-Lei-NOAA
Copy link

@junwang-noaa It is not. It's my testing one.
summarize the findings and status by now:
(1) Using my builds always works. Separate my build do not work, it is because of UFS issue. The UFS issue is: fv3 will source the ufs_common.lua in runtime. cp ufs_wcoss2.intel.lua ufs_common.lua will address this issue. I suggest that UFS change this settings in script.

(2) Bongi's libraries have gradually matching my script. I have solved the netcdf, pio issues in error. But the esmf still have problem. I insist on using my script. He is approaching. I could not tell him what exactly the UFS error is. But fullly match mine will solve the issue. He update it yesterday, but it was wrong. I am waiting his correct updating. My conversation with Bongi is recorded in the helpdesk ticket.

@DeniseWorthen
Copy link
Collaborator

Thanks @Hang-Lei-NOAA There is a separate email chain w/ ESMF and PIO developers also on-going. I believe the fact we can run this test on wcoss2 using your modules shows this is not related to the input file type or esmf functionality, which makes sense since wcoss2 is the only platform showing this issue.

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA To clarify the issues in 1), in the current ufs-weather-model develop branch, the ufs_common.lua is copied to the modulefiles/. under run directory, but it is not loaded at runtime. In the job_card, we have:

module use $PWD/modulefiles
module load modules.fv3
module load cray-pals
module list

The modules.fv3 is the same as the ufs_wcoss2.intel.lua.

@Hang-Lei-NOAA
Copy link

@junwang-noaa if you set the ufs_common.lua losding esmf/8.5.0 or 8.4.0, but only use ufs_wcoss2.intel.lua loading libraries. Your run will crash due to no esmf/8.5.0 available.

@Hang-Lei-NOAA
Copy link

or you just remove the ufs_common.lua. the runtime err will say no ufs_common.lua to source.

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Are you running with the latest develop branch on wcoss2?

@Hang-Lei-NOAA
Copy link

I run Brian's copy.

@junwang-noaa
Copy link
Collaborator

@BrianCurtis-NOAA do you see the issue 1) Hang mentioned?

@BrianCurtis-NOAA
Copy link
Collaborator Author

@BrianCurtis-NOAA do you see the issue 1) Hang mentioned?

No.

@Hang-Lei-NOAA
Copy link

@junwang-noaa
Bongi updated his installation on cactus. But not set the modulefiles. I added a independent set by loading
/lfs/h2/emc/eib/noscrub/hang.lei/c-libs/modulefiles/ufs_wcoss2.intel.lua

I did comparison experiments this morning. Although his build on esmf have used my settings, but still come to the overwalltime case. All other libs are using his installations. Please refer to the results:
my esmd: /lfs/h2/emc/eib/noscrub/hang.lei/c-libs/tests/logs/log_wcoss2_5302
Bongi esmf: /lfs/h2/emc/eib/noscrub/hang.lei/c-libs/tests/logs/log_wcoss2_5301

We need to find out the reason. Thanks

@Hang-Lei-NOAA
Copy link

I have added ldd in front of fv3.exe. The comparison cannot figure out the difference other than the one that I found in the esmd build log.
case /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_159983/control_p8_atmlnd_intel/ is the one over walltime and used Bongi's esmf.
case /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_73442/control_p8_atmlnd_intel/ is the one that succeed and used my esmf.

diff /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_159983/control_p8_atmlnd_intel/out /lfs/h2/emc/ptmp/hang.lei/FV3_RT/rt_73442/control_p8_atmlnd_intel/out >zzz.log

err file can see the loading of libraries, no difference in loadings.

Please check if anythings are different.
Thanks,
Hang

@Hang-Lei-NOAA
Copy link

Here is the esmf build log file difference :
/lfs/h2/emc/eib/noscrub/hang.lei/c-libs/tests/lei.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants