You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A casualty of the new hashing framework introduced by pydra#662 was the removal of file-hash caching (only calculating the file-hash once per task. For large files this could be a significant performance regression so it would be good to work out how to add it back in.
Suggestions
Just cache the checksum in the task object (if we place guards on it changing post-execution, see pydra:#681, this might be sufficient)
Return file mtime as part of bytes_repr and use this to create a local cache. This mapping could be potentially stored on disk for persistence between runs
The text was updated successfully, but these errors were encountered:
My idea for this feature is for the bytes_repr function overload of file-classes (and potentially any other type that you want to cache the hashes for) to yield a "time-stamp" object consisting of a key (file path) and modification time as the first item in the generator.
The calling hash_single function can check for these key/mtime pairs in a "hashes cache" dict loaded from the cache directory, and return the cached hash if present. Otherwise it proceeds through the remaining byte chunks, calculates the hash and then saves it into the hashes cache using the key/mtime
Yes, that makes sense to me. I would probably genericly type the yield type as Union[CacheKey, bytes], where CacheKey is a tuple[Hashable, ...] newtype. You could imagine some type wanting something besides (Path, int), e.g., an S3 path with a version key.
A casualty of the new hashing framework introduced by pydra#662 was the removal of file-hash caching (only calculating the file-hash once per task. For large files this could be a significant performance regression so it would be good to work out how to add it back in.
Suggestions
The text was updated successfully, but these errors were encountered: