Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated detect_machine.sh to be consistent with UFSWM. #691

Conversation

HenryRWinterbottom
Copy link
Contributor

@HenryRWinterbottom HenryRWinterbottom commented Jan 24, 2024

DUE DATE for merger of this PR into develop is 3/6/2024 (six weeks after PR creation).

Description

This PR addresses issue #690. The following is accomplished:

  • the GSI ush/detect_machine.sh is replaced by the UFS weather-model tests/detect_machine.sh prepared by @BrianCurtis-NOAA

Closes #690.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

CI/CD will test change.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@RussTreadon-NOAA
Copy link
Contributor

Hera test
Clone HenryWinterbottom-NOAA:feature/gsi_enkf_issue_690 on Hera. Execute GSI build script ush/build.sh. GSI build fails with

+ source /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr691/ush/detect_machine.sh
++ [[ -n '' ]]
++ case $(hostname -f) in
+++ hostname -f
++ MACHINE_ID=hera
++ [[ hera == \U\N\K\N\O\W\N ]]
++ MACHINE_ID=hera
++ [[ hera != \U\N\K\N\O\W\N ]]
++ return
+ set +x
Lmod has detected the following error:  The following module(s) are unknown: "gsi_hera"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "gsi_hera"

Also make sure that all modulefiles written in TCL start with the string #%Module

GSI has two modulefiles on Hera: gsi_hera.gnu.lua and gsi_hera.intel.lua. The GSI develop snapshot of detect_machine.sh contains the following at the end of the script

# Append compiler (only on machines that have multiple compilers)
if [ $MACHINE_ID = hera ] || [ $MACHINE_ID = cheyenne ]; then
    MACHINE_ID=${MACHINE_ID}.${COMPILER}
fi

Add the above scripting to ush/build.sh after sourcing detect_machine.sh as shown below

# Detect machine (sets MACHINE_ID)
source $DIR_ROOT/ush/detect_machine.sh

# Append compiler (only on machines that have multiple compilers)
if [ $MACHINE_ID = hera ] || [ $MACHINE_ID = cheyenne ]; then
    MACHINE_ID=${MACHINE_ID}.${COMPILER}
fi

ush/build.sh runs to completion on Hera after making this change.

The modified build.sh is on Hera in /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr691/ush.

@HenryWinterbottom-NOAA , would you please commit this change to your branch, HenryWinterbottom-NOAA:feature/gsi_enkf_issue_690

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 ctests

russ.treadon@clogin01:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr691/build> ctest -j 7
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr691/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  482.87 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  606.37 sec
3/7 Test #7: global_enkf ......................   Passed  613.07 sec
4/7 Test #2: rtma .............................   Passed  1108.18 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1329.80 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1451.14 sec
7/7 Test #1: global_4denvar ...................   Passed  1503.88 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 1503.89 sec

Not really necessary to run ctests since this PR only changes how the machine is detected for the build. Since the build successfully completed on WCOSS2 (Cactus), this is sufficient. However, it is good to see that all ctests pass as expected.

@RussTreadon-NOAA
Copy link
Contributor

Confirm that builds on Orion and Hercules work as intended with the changes found in this PR.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detect_machine.sh works as intended but we also need to modify ush/build.sh as described in the comments.

@RussTreadon-NOAA
Copy link
Contributor

@HenryWinterbottom-NOAA , what is the status of this PR?

Two items:

  1. Your branch, HenryWinterbottom-NOAA:feature/gsi_enkf_issue_690, is two commits behind the head of GSI develop. Please update your branch with recent changes to develop. The updates to develop changes do not conflict with your changes.

  2. Changes need to be made for ush/build.sh in your branch as described above. Please commit the indicated change to your branch.

We should be able to move forward with this PR once items 1 and 2 are addressed.

@HenryRWinterbottom
Copy link
Contributor Author

@RussTreadon-NOAA It should all be updated now.

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes here look right.
however, the ush/sub_ scripts will also need updating to include the compiler intel in them.
I am not sure how this impacts the GSI regression tests, but since there is no change to code, modulefile contents, I expect no change to RT results.
I will wait to approve until the ush/sub_ scripts are updated.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @HenryWinterbottom-NOAA for the update.

Question: Is adding the compiler to the name of the modulefile something we plan on doing across all repositories cloned by global-workflow?

@aerorahul
Copy link
Contributor

Thank you @HenryWinterbottom-NOAA for the update.

Question: Is adding the compiler to the name of the modulefile something we plan on doing across all repositories cloned by global-workflow?

@RussTreadon-NOAA
Yes.
That is for consistency as well as expandability when a gnu compiler stack is available on the machine.

@RussTreadon-NOAA
Copy link
Contributor

While one would expect detect_machine to be used in determining which ush/sub script to use, that's not how the GSI ctests actually work. I won't go through the twisted web but eventually regression/regression_var.sh sets local variable machine using the directory based logic we are trying to remove

# Determine the machine
if [[ -d /glade ]]; then # Cheyenne
  export machine="Cheyenne"
elif [[ -d /scratch1 ]]; then # Hera
  export machine="Hera"
elif [[ -d /mnt/lfs4 || -d /jetmon || -d /mnt/lfs1 ]]; then # Jet
  export machine="Jet"
elif [[ -d /discover ]]; then # NCCS Discover
  export machine="Discover"
elif [[ -d /sw/gaea ]]; then # Gaea
  export machine="Gaea"
elif [[ -d /data/prod ]]; then # S4
  export machine="S4"
elif [[ -d /work && $(hostname) =~ "Orion" ]]; then # Orion
  export machine="Orion"
elif [[ -d /work && $(hostname) =~ "hercules" ]]; then # Hercules
  export machine="Hercules"
elif [[ -d /lfs/h2 ]]; then # wcoss2
  export machine="wcoss2"
fi
echo "Running Regression Tests on '$machine'";

regression/regression_param.sh sets sub_cmd based on this machine value.

This is problematic but @aerorahul identified the real problem we now face with this PR. With the change in the name of the GSI modulefiles, ctests will fail since the machine specific sub commands use the previous name for the GSI modulefile loads

sub_cheyenne:echo "module load gsi_cheyenne.intel" >> $cfile
sub_discover:echo "module load gsi_discover" >> $cfile
sub_gaea:echo "module load gsi_gaea" >> $cfile
sub_hera:echo "module load gsi_hera.intel" >> $cfile
sub_hercules:echo "module load gsi_hercules" >> $cfile
sub_jet:echo "module load gsi_jet" >> $cfile
sub_orion:echo "module load gsi_orion" >> $cfile
sub_wcoss2:echo "module load gsi_wcoss2"               >> $cfile

If we want to fully pivot to detect_machine.sh there more work to do in this PR. I'll leave it up to you if you want to go through this effort now or revert the name change to the gsi modulefiles and open a new GSI issue and subsequent PR to transition GSI ctests to detect_machine.sh.

My two cents is to open a future GSI issue and PR in which GSI regression tests are refactored. The checks in the current suite of tests need to be updated to be more meaningful and robust. Some of the current checks often generate false positives (ie, the ctest fails but not for a fatal reason). We also need to update the actual tests. I'm not sure how well each test actually reflects what we currently run in operations in terms of configuration and observations. Only two out of the seven GSI ctests are global. A future refactoring of GSI ctests will require close coordination with regional DA teams.

@aerorahul
Copy link
Contributor

I don't think we need to refactor the regression tests for this.
We can simply add the .intel to the lines identified in the sub_MACHINE script.

The issue of regression testing against develop will only be in this PR. Once merged, none of the PRs will have any issue.

My 2c would be to merge this and leave the regression testing elephant for another day.

@RussTreadon-NOAA
Copy link
Contributor

Whatever works for the g-w team works for me. It's entirely possible that the GSI ctests will never be refactored. No one wants to tackle that elephant.

@HenryRWinterbottom
Copy link
Contributor Author

@RussTreadon-NOAA I made the changes to the sub_* files as noted above.

@RussTreadon-NOAA @aerorahul I apologize for my delayed responses to the PR comments. For whatever reason GitHub is not sending me an email WRT to this repo.

@RussTreadon-NOAA
Copy link
Contributor

Adding .intel to the module load gsi_ line in sub_* works in my local copy of HenryWinterbottom-NOAA:feature/gsi_enkf_issue_690 on Cactus.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 (Cactus) ctest
Install HenryWinterbottom-NOAA:feature/gsi_enkf_issue_690 at 2e07a29 on Cactus. Build and run ctest using develop at 8ed034f as control. All 7 ctests pass

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr691/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  542.72 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  544.87 sec
3/7 Test #7: global_enkf ......................   Passed  667.18 sec
4/7 Test #2: rtma .............................   Passed  1026.46 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1149.86 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1209.73 sec
7/7 Test #1: global_4denvar ...................   Passed  1322.06 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 1322.06 sec

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Jan 30, 2024

Hera ctest

Repeat Cactus test on Hera with following results

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr691/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #3: rrfs_3denvar_glbens ..............   Passed  551.64 sec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  607.72 sec
3/7 Test #7: global_enkf ......................***Failed  1064.48 sec
4/7 Test #2: rtma .............................   Passed  1219.40 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1292.67 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1351.21 sec
7/7 Test #1: global_4denvar ...................   Passed  1621.12 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) = 1621.14 sec

The following tests FAILED:
          7 - global_enkf (Failed)
Errors while running CTest
Output from these tests are in: /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr691/build/Testing/Temporary/LastTest.log

The global_enkf test failed due to the maximum threshold time check

The runtime for global_enkf_hiproc_updat is 64.597350 seconds.  This has exceeded maximum allowable threshold time of 63.817169 seconds,
resulting in Failure timethresh2 of the regression test.

A check of the enkf.x wall times for the various jobs does not show anomalous behavior

global_enkf_hiproc_contrl/stdout:The total amount of wall time                        = 58.015609
global_enkf_hiproc_updat/stdout:The total amount of wall time                        = 64.597350
global_enkf_loproc_contrl/stdout:The total amount of wall time                        = 79.493454
global_enkf_loproc_updat/stdout:The total amount of wall time                        = 80.999979

This is not a fatal fail.

@RussTreadon-NOAA
Copy link
Contributor

Orion ctest
Run ctests on Orion with the following results

Test project /work2/noaa/da/rtreadon/git/gsi/pr691/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  483.25 sec
2/7 Test #7: global_enkf ......................   Passed  548.89 sec
3/7 Test #3: rrfs_3denvar_glbens ..............   Passed  605.30 sec
4/7 Test #2: rtma .............................   Passed  1028.59 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1336.53 sec
6/7 Test #1: global_4denvar ...................   Passed  1622.20 sec
7/7 Test #5: hafs_4denvar_glbens ..............***Failed  1639.32 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) = 1639.33 sec

The following tests FAILED:
          5 - hafs_4denvar_glbens (Failed)
Errors while running CTest
Output from these tests are in: /work2/noaa/da/rtreadon/git/gsi/pr691/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

The hafs_4denvar_glbens test failed due to the maximum threshold time check

The runtime for hafs_4denvar_glbens_loproc_updat is 509.156037 seconds.  This has exceeded maximum allowable threshold time of 428.433941 seconds,
resulting in Failure time-thresh of the regression test.

A check of the gsi.x wall times shows that the loproc_updat job ran significantly longer than the contrl

hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time                        = 282.159384
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time                        = 300.306300
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time                        = 389.485401
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time                        = 509.156037

GSI ctests run in the /work fileset on Orion. Considerable run to run wall time variability can be observed when running jobs in this fileset. Given past experience on Orion this failure is not viewed as a fatal failure.

@RussTreadon-NOAA
Copy link
Contributor

Normally we ask the PR assignee to run ctests. I don't know if you, @HenryWinterbottom-NOAA, have run GSI ctests before. Hence my running the tests and reporting results in this PR. If you plan on opening future GSI PRs, it would be good to learn how to run GSI ctests.

We need two peer reviewers for this PR. Who would you like to review this PR. I am the handling reviewer for this PR. Thus, I can't serve as a peer reviewer.

@RussTreadon-NOAA
Copy link
Contributor

OK I see @aerorahul as a reviewer. This is sufficient along with @BrianCurtis-NOAA

@HenryRWinterbottom
Copy link
Contributor Author

@RussTreadon-NOAA Thank you for the review. And, yes, moving forward I will run and report the results of the ctests.

Thank you, again.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes have been tested on WCOSS2 (Cactus), Hera, and Orion. Build and ctests work as expected.

Approve

Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @aerorahul and @BrianCurtis-NOAA for your reviews and @HenryWinterbottom-NOAA for this PR. I'll cross check with the GSI Handling review team and merge this PR into develop as soon as I can.

@RussTreadon-NOAA RussTreadon-NOAA merged commit a898668 into NOAA-EMC:develop Jan 31, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Updates to detect_machine.sh needed after the UFSWM updated the return value
5 participants