-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make espresso outfiles/outdir water tight #1956
Make espresso outfiles/outdir water tight #1956
Conversation
Can one of the admins verify this patch? |
@tomdemeyere: Regarding the first part (file handling), all efforts are appreciated here. Espresso is especially tricky because most other codes don't give the user an option on where to write things --- it just dumps everything in the current working directory. Espresso's significant flexibility here is a blessing and a curse, mostly the latter in the context of a workflow engine. I'll trust your approach on this. If it's not too painful, you could consider logging a warning if the user sets one of these parameters in Espresso, but I anticipate that there are so many of them that this might not be a worthwhile effort. As for the second topic about space, it is an interesting point to consider. These kinds of scales are just not possible with other codes. For VASP, I've run 500+ atom calculations and the wavefunctions are never more than a few GBs (1 TB would be absolutely wild). For VASP, there's also no way to avoid file copying because VASP has no knowledge of how to deal with files that aren't in the current working directory, and you'd overwrite the original output files if you ran in the same directory. That's true for most other codes as well, although Espresso is flexible here --- that's the blessing rather than the curse. I think it is worth finding a mechanism to do this cleanly in Espresso given that it supports this kind of thing. Now for some implementation brainstorming:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1956 +/- ##
==========================================
- Coverage 99.29% 99.29% -0.01%
==========================================
Files 81 81
Lines 3270 3254 -16
==========================================
- Hits 3247 3231 -16
Misses 23 23 ☔ View full report in Codecov by Sentry. |
Some further details:
I think that you actually found the solution: adding a kwargs to the concerned post processing jobs (they don't have many kwargs anyway) like "in_place", or "prev_outdir" giving the choice to the users about how to do things. If they are concerned about disk space then they should use it. With proper documentation and a warning about the potential writing conflict that should be ok. This PR is good to go if you are ok with it. |
@tomdemeyere: Thanks! I'm setting this to auto-merge. I fixed an issue with a docstring, but the rest is fine. Agreed regarding the file copying approach. If there is a case of file clashing, we need to ensure that users have the option to avoid that. Giving users the option here seems like the most logical approach to me. I don't think that would be too much of a challenge. |
Summary of Changes
Currently, users can still bypass Quacc's intended behavior by using environment variables and other workarounds to obtain output directories outside of Quacc's control. This PR aims to address this issue by enforcing specific keywords to be either deleted or set to certain values at the last moment before writing the input files, ensuring that no tampering can occur.
In addition to this change, I would like to discuss a current problem faced when using Quacc with Espresso for larger systems:
Let's consider a scenario where I want to perform a calculation on a relatively large system (600 atoms), with a wavefunction file size of approximately 1TB. If I want to execute a static job followed by a projwfc job to calculate the projected DOS, I will end up with two folders:
tmp-static: This job will take around 3 hours of compute time and produce 1TB of data.
tmp-projwfc: This job will need to copy the data from tmp-static and run for 5 minutes.
The problem is straightforward: it seems inefficient to copy 1TB of data for a job that will run for only 5 minutes.
As a potential solution, I propose the following (although I suspect it may not be well-received): This solution would require setting GZIP_FILES: False and WORKFLOW_ENGINE: None. When both of these conditions are met, instead of copying files to new jobs, we would set the previous directory inside Espresso's keyword. Espresso will then read from the previous data without the need for copying.
Pros:
Cons:
It's worth noting that this problem is somewhat niche, and Espresso's extreme modularity is defintely the problem here.
Checklist
main
.Notes