-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching of file-set hashes by local path and mtimes #700
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #700 +/- ##
==========================================
+ Coverage 83.93% 84.22% +0.29%
==========================================
Files 24 25 +1
Lines 5029 5123 +94
Branches 1429 1449 +20
==========================================
+ Hits 4221 4315 +94
Misses 802 802
Partials 6 6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR. AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism? |
No worries, I just put you all down as reviewers so you were notified. Don't feel like you need to contribute if the area isn't familiar
Yes, your understanding is correct. This is how the execution cache works, both currently and in this PR. This PR seeks to improve (or more accurately, restore) performance by caching the hashes of file/directory types themselves, so files/directories don't need to be rehashed (which can be an expensive operation) each time the checksum is accessed. It does this by hashing the path and mtime of the file/directory (not to be confused with the hashing of the file/directory contents), and using it as a key to look up a "hash cache" (not to be confused with the execution cache) that contains previously computed file/directory hashes. |
Better to mock the call than to sleep. Here's an example where I've done that with |
Interesting, I assumed that the value for the mtime was controlled by the file-system not Python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions and a question mark regarding the cache path in the home folder. Maybe also worth having a look to mtime mocking instead of sleep steps as suggested by Chris.
Looks solid otherwise 👍
@tclose - you can try with new slurm testing workflow, should work now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! just left comments regarding the location of the cache directory
0940383
to
1c1c309
Compare
@effigies I'm not sure this works on Windows (it seems to work on Ubuntu), see test failures |
21540ad
to
96dcc48
Compare
@effigies, I think that I have addressed the outstanding issues with this PR so it just needs your review. However, I had to revert the mtime mocking as it doesn't appear to work for Windows (see https://github.com/nipype/pydra/actions/runs/6307685788/job/17124695852), unless you have some ideas on how to do it. |
@tclose - the recent commit to this PR comes from some rebasing? |
I rebased it on top of master after the environment PR was merged and fixed up a few things related to those changes |
ok, trying to figure out if I should review it again, since I already accepted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've completely lost the plot on this PR. I don't understand what's going on, but there are some things that look wrong to me.
sorry, I'm sure we discussed it at some point, but I'm not sure about one important thing... looks like if I move file around my filesystem I have no way of reusing the previous tasks now, is that right? |
sorry, I was wrong, the task has correct hash when the file is the same with a different path. Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568 (I was planning to create a PR to you, but I must have merged to the branch more things to my branch) |
for more information, see https://pre-commit.ci
Ok, nice addition |
I've just realized that we should also have a similar test when the persistent cache is used, but realized that it's not enough to set |
@djarecka I have added a new test to ensure that the persistent cache gets hit during the running of tasks. Let me know if there is anything else you need me to do |
return super().contents | ||
|
||
|
||
def test_task_files_persistentcache(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tclose - where are you setting the persistent cach path? i.e. PYDRA_HASH_CACHE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At @ghisvail's suggestion, the hash cache is stored in a system-dependent user cache directory using platformdirs.user_cache_dir
by default (e.g. /Users/<username>/Library/Caches/pydra/<version-number>
on MacOS).
I have just tweaked the code so that it is now put in a hashes
subdirectory of that user cache dir (accessible in the pydra.utils.user_cache_dir
variable) just in case any other cache data needs to be stored at some point in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, got it! Sorry I missed that. I've just realized that I made a typo when I was setting PYDRA_HASH_CACHE
and that's why it was not saving the hashes there, and was confused where this is being saved..
@tclose - thanks so much for the work! |
Types of changes
Summary
Will address #683 by
bytes_repr()
overloads for FileSets to yield a "time-stamp" object consisting of a key (file path) and modification time as the first item in the generator.hash_single()
checks for these key/mtime pairs in a global "local hashes cache" dictChecklist
Notes
I'm pretty happy with how this turned out with the exception of a few of wrinkles. Any suggestions would be most welcome
cache_location
path (my original plan) given thatchecksum
andhash
are object properties instead of methods~/.pydra-hash-cache