Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the use of temporary files in MET and reduce it as much as reasonably possible #2690

Closed
8 of 21 tasks
JohnHalleyGotway opened this issue Sep 22, 2023 · 5 comments · Fixed by #2693
Closed
8 of 21 tasks
Assignees
Labels
component: code optimization Code optimization issue priority: high High Priority requestor: NOAA/other NOAA Laboratory, not otherwise specified type: enhancement Improve something that it is currently doing
Milestone

Comments

@JohnHalleyGotway
Copy link
Collaborator

JohnHalleyGotway commented Sep 22, 2023

Describe the Enhancement

This issue arose via the dtcenter/METplus#2364 discussion. @johnlwagner has encountered severe issues running Stat-Analysis jobs on WCOSS2. @dkokron was able to trace the performance issue back to MET's writing of temporary files. When he reconfigured with tmp_dir = "/tmp";, a large Stat-Analysis job completed in minutes rather than hanging for several hours. He found that MET was writing/reading/deleting thousands of tiny temp ascii files. Writing many of these tiny temp files to a project space across the lustre file system used by WCOSS2 is incredibly slow. Whereas writing to the local /tmp directory relieves this pinch point.

However, in 2021, NCO directed NOAA staff to explicitly not write to the local /tmp directory on the compute nodes to keep the output for each job well organized and prevent an accumulation of stale data in the local /tmp directory.

For this issue, do the following:

  1. Survey the use of temporary files in the MET codebase by searching for occurrences of MET_TMP_DIR and get_tmp_dir().
  2. Consider providing a workaround to prevent the need for writing a temp file.

Pay close attention to the MetConfig::read_string(const char * s) function. As described in this GitHub Discussion comment, avoiding writing a temp file there may go a long way to addressing this issue.

@johnlwagner notes that since this has to do with the processing of threshold strings, it may well likely be related to the perplexing behavior described in this dtcenter/METplus#1506 discussion.

Time Estimate

2 days

Sub-Issues

Consider breaking the enhancement down into sub-issues.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

Define the source of funding and account keys here or state NONE.

Define the Metadata

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Review default alert labels
  • Select component(s)
  • Select priority
  • Select requestor(s)

Milestone and Projects

  • Select Milestone as the next official version or Backlog of Development Ideas
  • For the next official version, select the MET-X.Y.Z Development project

Define Related Issue(s)

Consider the impact to the other METplus components.

Enhancement Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding Source.
  • Fork this repository or create a branch of develop.
    Branch name: feature_<Issue Number>_<Description>
  • Complete the development and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Push local changes to GitHub.
  • Submit a pull request to merge into develop.
    Pull request: feature <Issue Number> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Development issue
    Select: Milestone as the next official version
    Select: MET-X.Y.Z Development project for development toward the next official release
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Close this issue.
@JohnHalleyGotway JohnHalleyGotway added type: enhancement Improve something that it is currently doing requestor: NOAA/other NOAA Laboratory, not otherwise specified component: code optimization Code optimization issue priority: high High Priority labels Sep 22, 2023
@JohnHalleyGotway JohnHalleyGotway added this to the MET 12.0.0 milestone Sep 22, 2023
@JohnHalleyGotway JohnHalleyGotway self-assigned this Sep 22, 2023
@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Sep 22, 2023

@rgbullock the vx_config library in MET reads/writes too many temp files to run well on lustre file systems. For this issue, I'd like to modify it to read characters from a buffer rather than a FILE * pointer... essentially switch from FILE * configin to use a different input method. I went hunting around for alternatives and think that std::stringbuf might be a reasonable option. I note that it provides an snextc function which may be a convenient way of stepping through the buffer... similar to the existing calls to nextchar() and fgetc().

My thought is that I'd read the input config file line-by-line. For each line, recursively interpret any environment variables present, and then append the result to the end of the std::stringbuf object. After reading all the input lines, update the scanner to step through the characters of that stringbuf object rather than the file stream.

The obvious downside is that we'd be storing the entire config file in memory at once. But fortunately, these ASCII config files are relatively small.

As the original author of this library, do you think this approach is worth pursing? Any alternative approaches to recommend?

@mollybsmith-noaa
Copy link
Contributor

I've been having a similar issue with job slowness on Jet, where ever since mid-August, METplus GridStat jobs which should finish in 20-40 minutes started timing out after four hours. Jet is much busier than usual during hurricane season due to HFIP, which puts a high load on the lustre file system, but I was still surprised that everything was taking this long. Moving the METplus tmp output to /tmp allows jobs to finish in about an hour. It's not fully back to where it was, but it's a vast improvement.

JohnHalleyGotway added a commit that referenced this issue Sep 22, 2023
…into a stringstream stream rather than reading/writing/deleting temp files each time a config file or config string is read.
@JohnHalleyGotway
Copy link
Collaborator Author

JohnHalleyGotway commented Sep 22, 2023

The changes for eliminating temp files from the vx_config library came together easier than I expected. I am running the full set of unit tests now using the GHA testing workflow run and will report back on the status when the tests complete.

@JohnHalleyGotway
Copy link
Collaborator Author

As discussed with @jprestop on 9/25, recommend adding a new chapter to the Contributor's Guide between chapters 2 and 3 to describe the logic employed in MET. Specifically, add a sub-section describing our use of temporary files. Update the description of tmp_dir to link to that new sub-section.

JohnHalleyGotway added a commit that referenced this issue Sep 26, 2023
JohnHalleyGotway added a commit that referenced this issue Sep 27, 2023
@JohnHalleyGotway JohnHalleyGotway linked a pull request Sep 27, 2023 that will close this issue
15 tasks
@JohnHalleyGotway JohnHalleyGotway changed the title Reduce MET's use of temporary files as much as reasonably possible Document MET's use of temporary files and reduce it as much as reasonably possible Sep 27, 2023
JohnHalleyGotway added a commit that referenced this issue Sep 28, 2023
…into a stringstream stream rather than reading/writing/deleting temp files each time a config file or config string is read.
@JohnHalleyGotway
Copy link
Collaborator Author

This is fixed by PR #2693 and at least one related issue to review temp file use in stat_analysis has been created.

But this task to document our current use of temp files in the Contributor's Guide is now complete.

@JohnHalleyGotway JohnHalleyGotway changed the title Document MET's use of temporary files and reduce it as much as reasonably possible Document the use of temporary files in MET and reduce it as much as reasonably possible Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: code optimization Code optimization issue priority: high High Priority requestor: NOAA/other NOAA Laboratory, not otherwise specified type: enhancement Improve something that it is currently doing
Projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

2 participants