-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe hashes #349
Recipe hashes #349
Conversation
Note that python's built in hash function is not deterministic across invocations. I would consider using a different hashing function. For example, you could pickle objects to bytes and compute the SHA-2 of those. |
I now see that this is noted in the official docs
But these same docs still recommend using builtin def __hash__(self):
return hash((self.name, self.nick, self.color)) My guess is that because But perhaps you're envisioning writing these hashes to the database, in which case consistency matters? |
Based on https://eng.lyft.com/hashing-and-equality-in-python-2ea8c738fb9d I'm now wondering if this PR would be better if titled Recipe equality and the implementation should be for |
Yes exactly. How else could we compare against previous runs of the recipe, which will occur not only in a different session but in a completely different vm, etc. Recipe equality sounds good to me, but I wonder if it relies on hashing under the hood. Having a deterministic SHA-2 hash for each recipe would be a cool feature to have regardless. |
This now demonstrates a presumably rather naive implementation of what should be consistent SHA-2 hashes for both In [1]: from pangeo_forge_recipes.patterns import pattern_from_file_sequence
...: from pangeo_forge_recipes.recipes import XarrayZarrRecipe
...:
...: pattern = pattern_from_file_sequence(["a", "b"], "time")
...: recipe = XarrayZarrRecipe(pattern, target_chunks={"time": 1})
In [2]: hash(pattern)
Out[2]: 1353826159998106392
In [3]: hash(recipe)
Out[3]: 803547776625927664
I'm making some opinionated choices in these methods, for example:
Beyond these opinionated choices, both the Python official docs and various blog posts seem to advise rather extreme caution when considering coercing a mutable object to be hashable. Everyone seems to agree that making the object itself frozen is a better path. So @rabernat, what do you think:
While option 1 does come with some risk of downstream bugs, I'll note that the official docs don't say never to do this, just to be careful. And though option 2 is less risky in terms of downstream bugs, the opportunity cost of undertaking such a major refactor right now feels like a significant risk in itself, albeit of a different kind. Regarding what bugs we are exposing ourself to by hashing mutable objects, it seems the main concerns here are the fact that |
Nice work exploring this complex issue Charles. I think your suggestion of excluding storage_config from the recipe's hash is a good one regardless of how we implement it. I think of that as something set at runtime, not part of the recipe itself. The code should reflect this somehow. In retrospect it might have been good for us to sync up a bit before you got into implementing this. If we had, i would probably have clarified the following point: I don't think we need to use Python's built-in hash function here. I think our needs are a bit different from what that was intended for. So I think we can bypass a bit of boilerplate and find a simpler solution. If I wanted to compute a hash of a recipe, I would do it like this serlialized = pickle.dumps(recipe)
sha256 = hashlib.sha256(serialized).hexdigest() This is literally just taking the python object, pickling it into bytes, and computing the hash of those bytes. Let's link up tomorrow and discuss the options here. |
A few observations from experimenting with this approach just now:
My guess is that pickle is not designed to produce outputs with deterministic hashes, but rather only to allow for recovery of functionally equivalent objects. I do agree that staying away from built-in |
This might be a case for using cloudpickle instead. (That's what Dask uses to send such functions over the wire.)
Interesting discovery. I wonder what the source is of this variability. Let's discuss at our meeting. |
Pickle definitely makes no such guarantee. My simplest guess is that the order of the objects in the output changes, but if could be any number of effects. You would end up defining your own pickle hooks ( I think it's worth saying that you can probably find a way to guarantee that when two recipe realisations have the same hash, they are equivalent, but not that different hashes are necessarily different. |
This blog post is extremely useful. https://death.andgravity.com/stable-hashing It explains why the approach we have tried so far don't work. It suggests serializing the class to json and then computing the hash of the json as bytes. That also won't work out of the box here because our dataclasses contain other dataclasses and / or python functions (preprocess, etc.) and FilePattern objects. I think the way forward is something like this, in pseudocode d = recipe.to_dict()
# now traverse the dictionary and hash everything that is not json serializable
d_ser
for k, v in d.items():
if is_json_serializable(v):
d_ser[k] = v
elif:
d_ser[k] = v.sha256() # we have to define that method, e.g. for filepattern
else:
raise ValueError("Don't know how to serialize this") So if we can figure out how to compute the sha256 for a FilePattern in a deterministic way, I think we should be good. This will probably fail on recipes with preprocessing functions. Maybe that's okay for now...we punt and say that if a recipe has preprocessing, it can't be hashed and therefore has to always be rerun. Would that be an okay tradeoff? If not, we need to solve the problem of how to deterministically serialize / hash a generic python function. |
In terms of implementation.
|
Here is the idea we just discussed for encoding the FilePattern as a blockchain import pandas as pd
from hashlib import sha256
from dataclasses import asdict
from json import dumps
from datetime import datetime
from enum import Enum
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')
URL_FORMAT = (
"https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/"
"v2.1/access/avhrr/{time:%Y%m}/oisst-avhrr-v02r01.{time:%Y%m%d}.nc"
)
def make_url(time):
return URL_FORMAT.format(time=time)
time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(make_url, time_concat_dim)
def json_default(thing):
# custom serializer for FileType, CombineOp, etc.
if isinstance(thing, Enum):
return thing.value
raise TypeError(f"object of type {type(thing).__name__} not serializable")
# https://death.andgravity.com/stable-hashing
def dict_to_sha256(thing):
b = dumps(
thing,
default=json_default,
ensure_ascii=False,
sort_keys=True,
indent=None,
separators=(',', ':'),
)
return sha256(b.encode('utf-8')).digest()
def dimindex_to_sha256(di):
d = asdict(di)
del d['sequence_len'] # don't want this in the hash; could change
return dict_to_sha256(d)
def pattern_blockchain(pattern):
# this contains general aspects of the file pattern
# we exclude the file pattern and concat dims because they are generated by iterating
# if they are generated in a different way, we ultimately don't care
# The only exception is nitems_per_file...need some way to include that
root = {
'fsspec_open_kwargs': pattern.fsspec_open_kwargs,
'query_string_secrets': pattern.query_string_secrets,
'file_type': pattern.file_type,
'is_opendap': pattern.is_opendap,
}
root_sha256 = dict_to_sha256(root)
blockchain = [root_sha256]
for k, v in pattern.items():
key_hash = b''.join(sorted([dimindex_to_sha256(dimindex) for dimindex in k]))
value_hash = sha256(v.encode('utf-8')).digest()
new_hash = key_hash + value_hash
new_block = sha256(blockchain[-1] + new_hash).digest()
blockchain.append(new_block)
return blockchain
chain = pattern_blockchain(pattern)
last_hash = chain[-1]
# now make a new file pattern that extends the old one
new_dates = pd.date_range('1981-09-01', '2022-05-01', freq='D')
new_pattern = FilePattern(make_url, ConcatDim("time", new_dates, nitems_per_file=1))
new_chain = pattern_blockchain(new_pattern)
for k, h in zip(new_pattern, new_chain):
if h == last_hash:
print(f"Found a match for {h.hex()}!")
break
print(k)
This approach only makes sense for FilePatterns with a single concat dim. We would need to take care to make the merge dim no break it. |
5171d4d
to
0d8694a
Compare
@rabernat, do you think Adapting the blog, I'm excluding keys in _hash_exclude_ = ["file_pattern", "storage_config"] but in the last commit I walked this back to remove pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py Lines 802 to 812 in e78febe
so even if we exclude
Thoughts? |
Personally, I think it actually makes a lot of sense to include When re-rerunning recipes in a feedstock,
Is there any advantage to independent recipe and pattern SHAs that I'm missing? Edit: I guess this is less important than the practical implications, but logically a recipe can't be instantiated without a file pattern, so it makes sense that this should be part of is unique SHA identifier. |
I just reviewed this, and I am very fine with this ☝️ decision. |
@rabernat, this is ready for review. I've opened the placeholders
to indicate what remains to do before we can close pangeo-forge/user-stories#3. Aside from the release notes, I haven't updated the docs because I haven't been thinking of this as a user-facing feature. If you see a good place to mention it in the docs, I'm happy to write something up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! Just a few minor comments.
Co-authored-by: Ryan Abernathey <[email protected]>
@rabernat, codecov is working in my browser now. Based on
it looks like we have near-100% line coverage on the additions in this PR, including of
AFAICT, the only new line that's not covered is the |
a5bcf1e should cover this |
WIP which aims to close #269 when complete.
Part of a solution for
pangeo-forge/cmip6-feedstock#2 is an example of a current, actual feedstock PR for which this would be useful.