Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Integrate 'online-cmaq' into 'develop' branch: Part 2, build script and module files #549

Merged
merged 16 commits into from
Feb 10, 2023

Conversation

chan-hoo
Copy link
Collaborator

@chan-hoo chan-hoo commented Jan 19, 2023

DESCRIPTION OF CHANGES:

Currently, the Online-CMAQ for air quality modeling is separately managed in the 'online-cmaq' branch of the authoritative UFS SRW App repository. The main target of the current 'online-cmaq' branch is the implementation of AQM ver.7 and delivery to NCO in late March. Therefore, it has been developed and tested on WCOSS2 and Hera. The Online-CMAQ checks out the same ufs weather model as the 'develop' branch. However, since it couples another component 'AQM', the compile process is different from that of the 'develop' branch.

This 'online-cmaq' branch has beeen integrating into the 'develop' branch with PR #536, this PR, and the following PR. For the convenience of the reviewers, we split the 'online-cmaq' branch into three parts as follows:

Part 1: j-job and ex- scripts (PR #536)
Part 2: build scripts and module files (this PR)
Part 3: workflow set-up, generation, run scripts

For Online-CMAQ, a separate Externals.cfg file named Externals_AQM.cfg is introduced and it is automatically read by running devbuild.sh with the argument -a=ATMAQ. This is because Online-CMAQ may use a different hash of the ufs weather model unliess its hash in 'Externals.cfg' is always up to date.

  • To compile Online-CMAQ:
git clone -b feature/aqm_part2 https://github.com/chan-hoo/ufs-srweather-app
./manage_externals/checkout_externals
./devbuild.sh -p=wcoss2 -a=ATMAQ
  • Note) In compiling the app for Online-CMAQ, the 'coupled' ufs weather model failed with the current integrated cmake build system. Therefore, the 'coupled' ufs weather model is separately compiled first and then the other external components are compiled together only for Online-CMAQ.

As described above, in case of different hashes between 'Externals.cfg' and 'Externals_AQM.cfg', the versions of the modules would not be the same. Online-CMAQ has been developed and tested on wcoss2 and Hera. To avoid any conflicts on other HPC platforms, the versions of the modules are separately controlled by the version file `build.ver.[platform] in the 'versions' directory.

This PR should not affect any existing configuration and test cases.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • WE2E tests on wcoss2:
    grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
    grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
    MET_verification
    community_ensemble_2mems_stoch

  • hera.intel

  • orion.intel

  • cheyenne.intel

  • cheyenne.gnu

  • gaea.intel

  • jet.intel

  • wcoss2.intel

  • NOAA Cloud (indicate which platform)

  • Jenkins

  • fundamental test suite

  • comprehensive tests (specify which if a subset was used)

ISSUE:

Fixes issue mentioned in #534.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo From the Contributor's Guide, all externals live in a single Externals.cfg file. From your detailed write-up, I understand why you added Externals_AQM.cfg (different hashes between SRW App and Online-CMAQ), but this goes against the current coding standards. I would like to hear from others to see if they are okay with this before approving myself.

Having said that, I was able to successfully clone your fork's feature/aqm_part2 branch, checkout the necessary components and build the Online-CMAQ using the instructions that you provided, which completes the work associated with this PR.

@chan-hoo
Copy link
Collaborator Author

chan-hoo commented Jan 20, 2023

@MichaelLueken, thank you for the comment. I understand. Practically, the Online-CMAQ and the current develop branch (FV3-LAM) only share 'UFS-UTILS' in Externals.cfg. The hash of the ufs weather model in the develop branch is not up-to-date mostly. The major issue when we try to update the hash of the ufs weather model in Externals.cfg is that we (EMC) don't have access to other HPC platforms. So it will take long to update the hash. In addition, we don't want to check out 'upp', 'rrfs-utls', and 'gsi' for Online-CMAQ. I think this will be the same to FV3-LAM. The users of FV3-LAM would not want to check out the components only for Online-CMAQ. I agree with you. We should hear from others.

@danielabdi-noaa
Copy link
Collaborator

@chan-hoo If you decide to go the route of one Externals.cfg file, you can check out the components you need by listing them after the checkout_externals command, where all the components have lowercase names. I don't have an opinion on which route is better.

@chan-hoo
Copy link
Collaborator Author

@danielabdi-noaa, Thanks for the comment. In my opinion, one Externals.cfg file for one app is better for the reason I described above (maintenance at EMC). For example, 'Externals.cfg' for FV3-LAM (or RRFS) and 'Externals_AQM.cfg' for Online-CMAQ. If other applications are incorporated in the ufs srweather app in the future, I think, this approach will be more convenient. However, I'll follow the majority decision :)

Copy link
Collaborator

@christopherwharrop-noaa christopherwharrop-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so concerned about the use of an additional Externals.cfg file as long as it is for the right reasons. In this case, one of the stated reasons was to circumvent basic development processes because testing updating of ufs-weather-model will take too long. That is not an acceptable justification. The required updates need to be made to make it all work with develop. The objection to checking out components that aren't needed does make sense, however.

Another concern I have is that the build process is becoming too complex. There is a comment in the devbuild.sh script that says:

NOTE: This script is for internal developer use only;

But that is now a lie. I would like to see the sequence of commands needed to build the online-cmaq without using devbuild.sh because devbuild.sh was originally meant only to be a "convenience" script that experts could use to short-cut the process and make their work go faster. But, now, that appears no longer to be the case. One of the most difficult and insidious problems of modeling systems infrastructure developed over the past 20 years was the very non-portable and incomprehensible build systems that were developed. That has largely been mitigated in the current UFS and I would urge us to resist adding anything that takes the UFS backwards in that arena. What I'm saying is, it's starting to get to the point where a user can't possibly build the UFS without a very complicated, hard-to-understand (for an end-user), script with ever growing maintenance costs. It should be possible, without too much difficulty to build CMAQ without devbuild.sh and I'd like to see how that is done.

devbuild.sh Outdated Show resolved Hide resolved
@chan-hoo
Copy link
Collaborator Author

@christopherwharrop-noaa, I agree with you in principle. If the hash of the ufs weather model is updated with the latest hash of the authoritative repository, we won't have to maintain it separately. However, as you may know, the hash of the ufs weather model has not been updated frequently as expected. This is always a big issue on our side. I described it above from the practical point of view. I would like to hear from others if they also think this is not an acceptable justification to them either :) I thought it was very acceptable justification to me for having a separate Externals.cfg file for Online-CMAQ.

@arunchawla-NOAA
Copy link

I wanted to check where we are with this PR. My understanding is that it is waiting for a PR to the weather model that will move to a different HPC-Stack location that will allow the App to also use the same stack location? Is that correct ? Is there a time line on that?

@MichaelLueken
Copy link
Collaborator

I wanted to check where we are with this PR. My understanding is that it is waiting for a PR to the weather model that will move to a different HPC-Stack location that will allow the App to also use the same stack location? Is that correct ? Is there a time line on that?

@arunchawla-NOAA - Yes, you are correct. Applications are not supposed to move ahead from the ufs-weather-model with respect to HPC-stack modules. The ufs-weather-model repo has opened PR #1596 to bring in the EPIC AUS team's new HPC-stack locations for Tier-1 machines. I'm ready to assist @chan-hoo with updating this PR with the new stack locations once this PR has been opened and merged.

Unfortunately, due to issues encountered with the regression testing on Gaea and Orion, I have no ETA on when the weather model's PR will be moved to ready to review and ultimately merged.

@mkavulich
Copy link
Collaborator

A note for posterity: this branch was also affected by the rebase to remove large files introduced in PR #556. If you are working on a local copy of this repository it may be necessary to re-download a fresh copy. Please contact me if you have any questions about this.

@arunchawla-NOAA
Copy link

I wanted to check where we are with this PR. My understanding is that it is waiting for a PR to the weather model that will move to a different HPC-Stack location that will allow the App to also use the same stack location? Is that correct ? Is there a time line on that?

@arunchawla-NOAA - Yes, you are correct. Applications are not supposed to move ahead from the ufs-weather-model with respect to HPC-stack modules. The ufs-weather-model repo has opened PR #1596 to bring in the EPIC AUS team's new HPC-stack locations for Tier-1 machines. I'm ready to assist @chan-hoo with updating this PR with the new stack locations once this PR has been opened and merged.

Unfortunately, due to issues encountered with the regression testing on Gaea and Orion, I have no ETA on when the weather model's PR will be moved to ready to review and ultimately merged.

what are the issues on regression testing on Gaea and Orion ?

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Feb 6, 2023

Waiting on @natalie-perlin comments at ufs-community/ufs-weather-model#1596.

@MichaelLueken
Copy link
Collaborator

I wanted to check where we are with this PR. My understanding is that it is waiting for a PR to the weather model that will move to a different HPC-Stack location that will allow the App to also use the same stack location? Is that correct ? Is there a time line on that?

@arunchawla-NOAA - Yes, you are correct. Applications are not supposed to move ahead from the ufs-weather-model with respect to HPC-stack modules. The ufs-weather-model repo has opened PR #1596 to bring in the EPIC AUS team's new HPC-stack locations for Tier-1 machines. I'm ready to assist @chan-hoo with updating this PR with the new stack locations once this PR has been opened and merged.
Unfortunately, due to issues encountered with the regression testing on Gaea and Orion, I have no ETA on when the weather model's PR will be moved to ready to review and ultimately merged.

what are the issues on regression testing on Gaea and Orion ?

@arunchawla-NOAA - Looking at the conversations taking place in the WM's PR, it looks like the WM is failing to build with the following message on Orion:

cmake /work/noaa/epic-ps/jongkim/UFS-RT-tests/rt-1595 -DAPP=S2S -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_sfcocn -DCMEPS_AOFLUX=ON -DMPI=ON -DCMAKE_BUILD_TYPE=Release -DMOM6SOLO=ON CMake Error at /apps/cmake-3.22.1/share/cmake-3.22/Modules/CMakeDetermineCCompiler.cmake:49 (message):   Could not find compiler set in environment variable CC:

and on Gaea:

On gaea, some issues as well:
Lmod has detected the following error: /lustre/f2/pdata/ncep/Jong.Kim/rt-1595/modulefiles/ufs_gaea.intel: (ufs_gaea.intel): invalid command name "export"

@zach1221
Copy link

zach1221 commented Feb 6, 2023

@arunchawla-NOAA We're currently working to test the new HPC-Stack locations, using the RT suite, and tracking communication/progress in this issue created by Natalie.: #1465

The cmake related error experienced on Orion (referred to above by Mike) was also experienced on Jet. The stack has recently been rebuilt by AUS on Jet to resolve this issue, and I'm re-testing on Jet now. AUS is currently working to rebuild the stack as well on Orion/Gaea.

@MichaelLueken
Copy link
Collaborator

@chan-hoo The weather model updated their modulefiles earlier today. I have pushed updated modulefiles to this PR. I will now kick off the Jenkins tests once again.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo Finally, I have some very good news! Following my update of the modulefiles, the rest of the Jenkins tests are now passing! Please note that the Hera tests are still associated with this PR's build on Jenkins, so the abort in the details is due to needing to abort the Hera tests.

I have one minor request in devbuild.sh - if you would rather not move the highlighted section, that's fine, but I think that the Check out external components does need to be renamed, because that is no longer happening within devbuild.sh. Additionally, since wgrib2 is loaded in modulefiles/srw_common.lua, please remove the entry from modulefiles/build_hera_intel.lua.

Once you have made the changes, I'll be ready to give my approval to these changes.

Thanks!

devbuild.sh Outdated Show resolved Hide resolved
modulefiles/build_hera_intel.lua Outdated Show resolved Hide resolved
@MichaelLueken MichaelLueken added the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 10, 2023
@venitahagerty venitahagerty removed the ci-hera-intel-WE Kicks off automated workflow test on hera with intel label Feb 10, 2023
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chan-hoo Thanks for making the requested changes! Approved.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@MichaelLueken
Copy link
Collaborator

Hi @christopherwharrop-noaa -
Just wanting to check and see if you are okay with these changes, or if you would like to see further changes be made before you would be comfortable moving forward. Thank you very much!

@chan-hoo
Copy link
Collaborator Author

Thanks, @MichaelLueken , @danielabdi-noaa !!

@MichaelLueken
Copy link
Collaborator

For additional discussion on Externals.cfg, please see and discuss further in #602.

Copy link
Collaborator

@christopherwharrop-noaa christopherwharrop-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to address all the comments.

@christopherwharrop-noaa
Copy link
Collaborator

@MichaelLueken - I'm good with moving forward at this point. Thank you for asking and helping to get things synced up.

@venitahagerty
Copy link
Collaborator

venitahagerty commented Feb 10, 2023

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1210791792/20230210163510/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 10 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
Experiment Succeeded on hera: pregen_grid_orog_sfc_climo
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: community_ensemble_2mems_stoch
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta

@MichaelLueken MichaelLueken merged commit d036849 into ufs-community:develop Feb 10, 2023
@chan-hoo chan-hoo deleted the feature/aqm_part2 branch February 27, 2023 15:24
MichaelLueken pushed a commit that referenced this pull request Mar 14, 2023
The devbuild.sh is unable to build the rrfs_utils app even if it is being called from the command line like in:
./devbuild.sh -p=<platform> all or ./devbuild.sh -p=<platform> gsi rrfs_utils. After some investigation it appears the -DBUILD_RRFS_UTILS option was accidentally removed from the cmake settings in PR #549. Adding this line back in causes the rrfs_utils app to build again.

Co-authored-by: Edward Snyder <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.