Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add provenance_exclude_list attribute to CalcInfo data structure #3720

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Jan 20, 2020

Fixes #2956

This new attribute takes a flat list of relative filepaths, which
correspond to files in the folder sandbox passed to the
prepare_for_submission call of the CalcJob, that should not be
copied to the repository of the CalcJobNode. This functionality is
useful to avoid the content of input files, that should be copied to the
working directory of the calculation, to also be stored permanently in
the file repository. Example use cases are for very large input files or
files whose content is proprietary. Both use cases could already be
implemented using the local_copy_list but only in the case of files of
an input node in its entirety. The syntax of the local_copy_list does
not support the exclusion of arbitrary files that are written by the
calculation plugin to the sandbox folder.

Question: currently in aiida.engine.processes.calcjobs.daemon.upload_calculation when I delete the files in provenance_exclude_list from the sandbox folder I do not catch any errors. Do we maybe want to ignore files that do not exist?

Note: once this is tested and approved, I will add the documentation. First wanted to agree on the nomenclature and functionality so that I do not have to adapt the text every time if we need changes.

@espenfl
Copy link
Contributor

espenfl commented Jan 20, 2020

@sphuber What happens if the delete operation fails?

@sphuber
Copy link
Contributor Author

sphuber commented Jan 20, 2020

@sphuber What happens if the delete operation fails?

The logic is now this:

  1. Take the folder sandbox and copy contents to remote work dir
  2. Remove the provenance_exclude_list files from the sandbox
  3. Move remaining contents of folder to the repository of the CalcJobNode

This whole upload_calculation is wrapped in the exponential backoff mechanism as part of the UPLOAD transport task. So if the deletion of the file from the sandbox fails, it will raise an exception and the task will be retried later on. If the failure was due to a programming error by the CalcJob implementation, for example the file in provenance_exclude_list does not exist causing the delete to fail, it will try the 5 times and then pause the process. The user will then notice the programming mistake and have to kill the paused process

@espenfl
Copy link
Contributor

espenfl commented Jan 20, 2020

Thanks. It would be important to be sure the file is not there. If the delete operation fails (without Python catching an exception) we could possibly end up including the file anyway. One possibility is to control the copy/move side of the operation, such that if that fails, there is no way the file is present in the repository. Maybe we could introduce a second sandbox folder?

@sphuber sphuber force-pushed the fix_2956_calcjob_unpersistable_input_files branch 2 times, most recently from 2046bc0 to 14eb99a Compare January 21, 2020 15:13
@sphuber
Copy link
Contributor Author

sphuber commented Jan 21, 2020

@giovannipizzi and @espenfl : I have added a commit that changes the method from moving the sandbox to the repository, to manually recursively copying files that are not excluded. As noted in the commit message, this is less performant but once the new repository interface is there and we can no longer assume a file system solution, the move method will no longer be possible anyway. The only fundamental difference still that might be problematic is that with the current copy implementation, empty folders (or folders that would have been empty when taking the excludes into account) will not be present in the repository. Is this a problem?

I know that certain solutions depend on empty folders being in the sandbox as they are uploaded to the working directory (such as the pseudo directory for PwCalculation) and that still works, except they won't be present in the repository. For efficiency reasons its obviously better to not store empty directories, but is there a situation possible where they carry important meaning for the provenance?

@espenfl
Copy link
Contributor

espenfl commented Jan 22, 2020

@sphuber That is formally a safer approach so I fully support this. Thanks for submitting the update. One can use a copy operation that uses reflinks which basically gives you the same performance as a move operation. This works on most systems. Python support for this is another question. There is a reflink library as I mentioned in #2956 here: https://gitlab.com/rubdos/pyreflink. If that is something we want to include, I do not know. Also, at some point we will leave file operations altogether. The future of this also connects with possibly considering a new backend for handling repository operations as I mentioned in #335 (comment). So my initial view of the performance at this point is that it is less critical since we know we will at some point in the not too distant future abandon such operations. It is much better to be safe than sorry in my view when it comes to these things at this point.

@sphuber
Copy link
Contributor Author

sphuber commented Jan 22, 2020

I am not too worried about the performance hit, since anyway in the near future we will have to implement that change. So I take it you are happy with this implementation then? If @giovannipizzi is also ok, I will add the necessary documentation and squash the last commit in and then it should be ready to go in.

Copy link
Contributor

@espenfl espenfl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with this. Thanks a lot @sphuber. I only have a question at this point.

for filename in filenames:
filepath = os.path.join(root, filename)
relpath = os.path.relpath(filepath, folder.abspath)
if relpath not in provenance_exclude_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way that relpath and provenance_exclude_list can refer to the same file, but not contain the same string? Except of course if the plugin supply something that is not correct for provenance_exclude_list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the provenance_exclude_list is defined correctly, then no, this should work. The provenance_exclude_list should contain file paths relative to the base path of the folder sandbox. Since the files therein are created also with relative filepaths in the prepare_for_submission where the provenance_exclude_list is created, there is a one-to-one correspondence. So I think the risk for mistakes should be minimal.
Example would be:

def prepare_for_submission(self, folder):
    provenance_exclude_list = []
    for element, pseudo in self.inputs.pseudos.items():
        with pseudo.open(mode='rb') as source:
            filename = os.path.join('pseudos', element)
            folder.create_file_from_filelike(source, filename)
            provenance_exclude_list.append(filename)

the provenance_exclude_list will now for example contain:

['pseudos/Si', 'pseudos/Ge']

which will now not be stored in the repository. For your use case it will be even easier because you just specify `provenance_exclude_list = ['POSCAR']`` and you're done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for commenting clearly on this.

@sphuber sphuber force-pushed the fix_2956_calcjob_unpersistable_input_files branch 2 times, most recently from caadf0d to cd68342 Compare January 24, 2020 11:18
@sphuber
Copy link
Contributor Author

sphuber commented Jan 24, 2020

Since I think we are happy with naming/interface/implementation I wrote the necessary documentation and so this is now ready for final review @giovannipizzi

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! Looks great

This new attribute takes a flat list of relative filepaths, which
correspond to files in the `folder` sandbox passed to the
`prepare_for_submission` call of the `CalcJob`, that should not be
copied to the repository of the `CalcJobNode`. This functionality is
useful to avoid the content of input files, that should be copied to the
working directory of the calculation, to also be stored permanently in
the file repository. Example use cases are for very large input files or
files whose content is proprietary. Both use cases could already be
implemented using the `local_copy_list` but only in the case of files of
an input node in its entirety. The syntax of the `local_copy_list` does
not support the exclusion of arbitrary files that are written by the
calculation plugin to the sandbox folder.

Before the addition of this new feature, the contents of the sandbox
folder were added to the repository of the calculation node simply by
moving the contents of the sandbox entirely to the repository. This was
changed to an explicit loop over the contents and only copying those
files that do not appear in the `provenance_exclude_list` list.

The advantage of recursively looping over the contents of the sandbox
folder and *copying* the contents to the repository as long as it is not
part of `provenance_exclude_list`, over deleting those excluded files
from the sandbox before *moving* the remaining content to the
repository, is that in the former there is a better guarantee that the
excluded files do not accidentally end up in the repository due to an
unnoticed problem in the deletion from the sandbox.

The moving method is of course a lot more efficient then copying files
one by one. However, this moving approach is only possible now that the
repository is still implemented on the same filesystem as the sandbox.
Once the new repository interface is fully implemented, where non file
system repositories are also possible, moving the sandbox folder to the
repository will no longer be possible anyway, so it is acceptable to
already make this change now, since it will have to be done at some
point anyway.
@sphuber sphuber force-pushed the fix_2956_calcjob_unpersistable_input_files branch from cd68342 to cb4f9f1 Compare February 10, 2020 16:52
@sphuber sphuber merged commit 53bbc74 into aiidateam:develop Feb 10, 2020
@sphuber sphuber deleted the fix_2956_calcjob_unpersistable_input_files branch February 10, 2020 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support to write CalcJob input files that are not stored in the repository
3 participants