Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make espresso outfiles/outdir water tight #1956

Merged
merged 20 commits into from
Mar 31, 2024

Conversation

tomdemeyere
Copy link
Contributor

Summary of Changes

Currently, users can still bypass Quacc's intended behavior by using environment variables and other workarounds to obtain output directories outside of Quacc's control. This PR aims to address this issue by enforcing specific keywords to be either deleted or set to certain values at the last moment before writing the input files, ensuring that no tampering can occur.
In addition to this change, I would like to discuss a current problem faced when using Quacc with Espresso for larger systems:

Let's consider a scenario where I want to perform a calculation on a relatively large system (600 atoms), with a wavefunction file size of approximately 1TB. If I want to execute a static job followed by a projwfc job to calculate the projected DOS, I will end up with two folders:

tmp-static: This job will take around 3 hours of compute time and produce 1TB of data.
tmp-projwfc: This job will need to copy the data from tmp-static and run for 5 minutes.

The problem is straightforward: it seems inefficient to copy 1TB of data for a job that will run for only 5 minutes.

As a potential solution, I propose the following (although I suspect it may not be well-received): This solution would require setting GZIP_FILES: False and WORKFLOW_ENGINE: None. When both of these conditions are met, instead of copying files to new jobs, we would set the previous directory inside Espresso's keyword. Espresso will then read from the previous data without the need for copying.

Pros:

  • Not intrusive on the code side, requiring only a few additional lines.
  • Current working directories are still used, but not for storing already existing data.
  • It solves the problem entirely, allowing users with limited scratch space to use Quacc and Espresso.

Cons:

  • This mode requires setting unrelated settings, such as GZIP_FILES: False (because QE cannot read gzipped files) and WORKFLOW_ENGINE: None (because I'm unsure of the behavior if multiple QE instances try to read the same files).
  • It may go against Quacc's design philosophy to some extent.

It's worth noting that this problem is somewhat niche, and Espresso's extreme modularity is defintely the problem here.

Checklist

  • I have read the "Guidelines" section of the contributing guide. Don't lie! 😉
  • My PR is on a custom branch and is not named main.
  • I have added relevant, comprehensive unit tests.

Notes

  • Your PR will likely not be merged without proper and thorough tests.
  • If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.
  • When your code is ready for review, ping one of the active maintainers.

@buildbot-princeton
Copy link
Collaborator

Can one of the admins verify this patch?

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Mar 30, 2024

@tomdemeyere: Regarding the first part (file handling), all efforts are appreciated here. Espresso is especially tricky because most other codes don't give the user an option on where to write things --- it just dumps everything in the current working directory. Espresso's significant flexibility here is a blessing and a curse, mostly the latter in the context of a workflow engine. I'll trust your approach on this. If it's not too painful, you could consider logging a warning if the user sets one of these parameters in Espresso, but I anticipate that there are so many of them that this might not be a worthwhile effort.

As for the second topic about space, it is an interesting point to consider. These kinds of scales are just not possible with other codes. For VASP, I've run 500+ atom calculations and the wavefunctions are never more than a few GBs (1 TB would be absolutely wild). For VASP, there's also no way to avoid file copying because VASP has no knowledge of how to deal with files that aren't in the current working directory, and you'd overwrite the original output files if you ran in the same directory. That's true for most other codes as well, although Espresso is flexible here --- that's the blessing rather than the curse.

I think it is worth finding a mechanism to do this cleanly in Espresso given that it supports this kind of thing. Now for some implementation brainstorming:

  • I think there might be a way to avoid hard-coding the GZIP_FILES = False behavior. Presumably, in the Espresso calculator, we could check to see if the prev_dir parameter (or whatever that setting is called in Espresso --- sorry, I don't know) is set and if it is, automatically gunzip everything there. That might be a robust approach. We may just want to be careful about if it's recursive or not, e.g. if the prev_dir is someone's root directory, then you're going to get a big ol' problem.
  • Regarding the workflow engine business, that might be one scenario worth trying to find out. Concurrent read operations really shouldn't be a problem. Concurrent read operations are only generally an issue when you have some locking mechanism, which Espresso has no reason to implement. So, perhaps one would not need to play with this.
  • As a final thought, we want behavior to be clear and consistent for users. So, we should either always use the internal Espresso previous directory handling (i.e. avoid copying) for recipes like projwfc or we should have a kwarg or new Espresso setting or something that can be toggled. Basically, we just want to make it obvious to the user where their files will be and not have be dependent on some subtle criteria.

Copy link

codecov bot commented Mar 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.29%. Comparing base (68ad7f9) to head (1da55f9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1956      +/-   ##
==========================================
- Coverage   99.29%   99.29%   -0.01%     
==========================================
  Files          81       81              
  Lines        3270     3254      -16     
==========================================
- Hits         3247     3231      -16     
  Misses         23       23              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tomdemeyere
Copy link
Contributor Author

@Andrew-S-Rosen

Some further details:

  • I agree for GZIP_FILES, it should not be that hard to get around.
  • Good news: It seems that it does not matter how many instances are reading files.
  • Bad news: some post-processing binary still write files to the "outdir" (xml files), meaning that potentially multiple instances will be writing to the same file.

I think that you actually found the solution: adding a kwargs to the concerned post processing jobs (they don't have many kwargs anyway) like "in_place", or "prev_outdir" giving the choice to the users about how to do things. If they are concerned about disk space then they should use it. With proper documentation and a warning about the potential writing conflict that should be ok.

This PR is good to go if you are ok with it.

@Andrew-S-Rosen
Copy link
Member

@tomdemeyere: Thanks! I'm setting this to auto-merge. I fixed an issue with a docstring, but the rest is fine.

Agreed regarding the file copying approach. If there is a case of file clashing, we need to ensure that users have the option to avoid that. Giving users the option here seems like the most logical approach to me. I don't think that would be too much of a challenge.

@Andrew-S-Rosen Andrew-S-Rosen enabled auto-merge (squash) March 31, 2024 19:22
@Andrew-S-Rosen Andrew-S-Rosen changed the title [Draft] Make espresso outfiles/outdir water tight Make espresso outfiles/outdir water tight Mar 31, 2024
@Andrew-S-Rosen Andrew-S-Rosen merged commit 533b686 into Quantum-Accelerators:main Mar 31, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants