-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add provenance_exclude_list
attribute to CalcInfo
data structure
#3720
Add provenance_exclude_list
attribute to CalcInfo
data structure
#3720
Conversation
@sphuber What happens if the delete operation fails? |
The logic is now this:
This whole |
Thanks. It would be important to be sure the file is not there. If the delete operation fails (without Python catching an exception) we could possibly end up including the file anyway. One possibility is to control the copy/move side of the operation, such that if that fails, there is no way the file is present in the repository. Maybe we could introduce a second sandbox folder? |
2046bc0
to
14eb99a
Compare
@giovannipizzi and @espenfl : I have added a commit that changes the method from moving the sandbox to the repository, to manually recursively copying files that are not excluded. As noted in the commit message, this is less performant but once the new repository interface is there and we can no longer assume a file system solution, the move method will no longer be possible anyway. The only fundamental difference still that might be problematic is that with the current copy implementation, empty folders (or folders that would have been empty when taking the excludes into account) will not be present in the repository. Is this a problem? I know that certain solutions depend on empty folders being in the sandbox as they are uploaded to the working directory (such as the |
@sphuber That is formally a safer approach so I fully support this. Thanks for submitting the update. One can use a copy operation that uses |
I am not too worried about the performance hit, since anyway in the near future we will have to implement that change. So I take it you are happy with this implementation then? If @giovannipizzi is also ok, I will add the necessary documentation and squash the last commit in and then it should be ready to go in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with this. Thanks a lot @sphuber. I only have a question at this point.
for filename in filenames: | ||
filepath = os.path.join(root, filename) | ||
relpath = os.path.relpath(filepath, folder.abspath) | ||
if relpath not in provenance_exclude_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way that relpath
and provenance_exclude_list
can refer to the same file, but not contain the same string? Except of course if the plugin supply something that is not correct for provenance_exclude_list
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the provenance_exclude_list
is defined correctly, then no, this should work. The provenance_exclude_list
should contain file paths relative to the base path of the folder
sandbox. Since the files therein are created also with relative filepaths in the prepare_for_submission
where the provenance_exclude_list
is created, there is a one-to-one correspondence. So I think the risk for mistakes should be minimal.
Example would be:
def prepare_for_submission(self, folder):
provenance_exclude_list = []
for element, pseudo in self.inputs.pseudos.items():
with pseudo.open(mode='rb') as source:
filename = os.path.join('pseudos', element)
folder.create_file_from_filelike(source, filename)
provenance_exclude_list.append(filename)
the provenance_exclude_list
will now for example contain:
['pseudos/Si', 'pseudos/Ge']
which will now not be stored in the repository. For your use case it will be even easier because you just specify `provenance_exclude_list = ['POSCAR']`` and you're done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for commenting clearly on this.
caadf0d
to
cd68342
Compare
Since I think we are happy with naming/interface/implementation I wrote the necessary documentation and so this is now ready for final review @giovannipizzi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Looks great
This new attribute takes a flat list of relative filepaths, which correspond to files in the `folder` sandbox passed to the `prepare_for_submission` call of the `CalcJob`, that should not be copied to the repository of the `CalcJobNode`. This functionality is useful to avoid the content of input files, that should be copied to the working directory of the calculation, to also be stored permanently in the file repository. Example use cases are for very large input files or files whose content is proprietary. Both use cases could already be implemented using the `local_copy_list` but only in the case of files of an input node in its entirety. The syntax of the `local_copy_list` does not support the exclusion of arbitrary files that are written by the calculation plugin to the sandbox folder. Before the addition of this new feature, the contents of the sandbox folder were added to the repository of the calculation node simply by moving the contents of the sandbox entirely to the repository. This was changed to an explicit loop over the contents and only copying those files that do not appear in the `provenance_exclude_list` list. The advantage of recursively looping over the contents of the sandbox folder and *copying* the contents to the repository as long as it is not part of `provenance_exclude_list`, over deleting those excluded files from the sandbox before *moving* the remaining content to the repository, is that in the former there is a better guarantee that the excluded files do not accidentally end up in the repository due to an unnoticed problem in the deletion from the sandbox. The moving method is of course a lot more efficient then copying files one by one. However, this moving approach is only possible now that the repository is still implemented on the same filesystem as the sandbox. Once the new repository interface is fully implemented, where non file system repositories are also possible, moving the sandbox folder to the repository will no longer be possible anyway, so it is acceptable to already make this change now, since it will have to be done at some point anyway.
cd68342
to
cb4f9f1
Compare
Fixes #2956
This new attribute takes a flat list of relative filepaths, which
correspond to files in the
folder
sandbox passed to theprepare_for_submission
call of theCalcJob
, that should not becopied to the repository of the
CalcJobNode
. This functionality isuseful to avoid the content of input files, that should be copied to the
working directory of the calculation, to also be stored permanently in
the file repository. Example use cases are for very large input files or
files whose content is proprietary. Both use cases could already be
implemented using the
local_copy_list
but only in the case of files ofan input node in its entirety. The syntax of the
local_copy_list
doesnot support the exclusion of arbitrary files that are written by the
calculation plugin to the sandbox folder.
Question: currently in
aiida.engine.processes.calcjobs.daemon.upload_calculation
when I delete the files inprovenance_exclude_list
from the sandboxfolder
I do not catch any errors. Do we maybe want to ignore files that do not exist?Note: once this is tested and approved, I will add the documentation. First wanted to agree on the nomenclature and functionality so that I do not have to adapt the text every time if we need changes.