Recipe hashes #349

cisaacstern · 2022-04-26T18:29:07Z

WIP which aims to close #269 when complete.

Part of a solution for

pangeo-forge/cmip6-feedstock#2 is an example of a current, actual feedstock PR for which this would be useful.

rabernat · 2022-04-26T19:54:52Z

Note that python's built in hash function is not deterministic across invocations. I would consider using a different hashing function. For example, you could pickle objects to bytes and compute the SHA-2 of those.

cisaacstern · 2022-04-26T21:20:27Z

Note that python's built in hash function is not deterministic across invocations. I would consider using a different hashing function. For example, you could pickle objects to bytes and compute the SHA-2 of those.

I now see that this is noted in the official docs

By default, the __hash__() values of str and bytes objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

But these same docs still recommend using builtin hash to define custom __hash__ methods in the style

def __hash__(self):
    return hash((self.name, self.nick, self.color))

My guess is that because __hash__ is meant for testing equality within a given Python session, the variability between sessions may not be important. This is how I've been imagining using this feature.

But perhaps you're envisioning writing these hashes to the database, in which case consistency matters?

cisaacstern · 2022-04-26T21:27:52Z

Based on https://eng.lyft.com/hashing-and-equality-in-python-2ea8c738fb9d I'm now wondering if this PR would be better if titled Recipe equality and the implementation should be for __eq__ logical equality, and not __hash__, because apparently implementing __hash__ on mutable objects is not regarded as a good idea.

rabernat · 2022-04-26T21:41:48Z

But perhaps you're envisioning writing these hashes to the database, in which case consistency matters?

Yes exactly. How else could we compare against previous runs of the recipe, which will occur not only in a different session but in a completely different vm, etc.

Recipe equality sounds good to me, but I wonder if it relies on hashing under the hood. Having a deterministic SHA-2 hash for each recipe would be a cool feature to have regardless.

cisaacstern · 2022-04-27T00:22:33Z

This now demonstrates a presumably rather naive implementation of what should be consistent SHA-2 hashes for both XarrayZarrRecipe (this should probably be moved into BaseRecipe) and FilePattern. Here, hash calls the newly-added custom __hash__ method on each of these objects, which itself calls hashlib.sha256:

In [1]: from pangeo_forge_recipes.patterns import pattern_from_file_sequence
   ...: from pangeo_forge_recipes.recipes import XarrayZarrRecipe
   ...: 
   ...: pattern = pattern_from_file_sequence(["a", "b"], "time")
   ...: recipe = XarrayZarrRecipe(pattern, target_chunks={"time": 1})

In [2]: hash(pattern)
Out[2]: 1353826159998106392

In [3]: hash(recipe)
Out[3]: 803547776625927664

Python requires __hash__ to return an integer, so we convert the SHA-2 hexdigest to int before returning it.

I'm making some opinionated choices in these methods, for example:

XarrayZarrRecipe.__hash__ excludes self.storage_config from consideration
FilePattern.__hash__ doesn't consider self.format_function + self.combine_dims directly, but rather takes self.items() (plus the nitems_per_input arg for each of the self.concat_dims) as a proxy for these attributes

Beyond these opinionated choices, both the Python official docs and various blog posts seem to advise rather extreme caution when considering coercing a mutable object to be hashable. Everyone seems to agree that making the object itself frozen is a better path. So @rabernat, what do you think:

Proceed with caution in making mutable objects hashable (I'm somewhat inclined to try this); or,
Pause on this and explore a refactor to move self.storage_config out of the recipe classes entirely (which I believe is the main barrier to making recipe classes frozen, and therefore hashable "without risk")

While option 1 does come with some risk of downstream bugs, I'll note that the official docs don't say never to do this, just to be careful. And though option 2 is less risky in terms of downstream bugs, the opportunity cost of undertaking such a major refactor right now feels like a significant risk in itself, albeit of a different kind.

Regarding what bugs we are exposing ourself to by hashing mutable objects, it seems the main concerns here are the fact that frozenset, set, and dict rely on __hash__ for lookups, so if a mutable object (with a custom __hash__) is used as a dict key, for example, and then that object is mutated, lookups in that dict may break. And the same would go for membership checks in frozensets and sets.

rabernat · 2022-04-27T02:33:54Z

Nice work exploring this complex issue Charles.

I think your suggestion of excluding storage_config from the recipe's hash is a good one regardless of how we implement it. I think of that as something set at runtime, not part of the recipe itself. The code should reflect this somehow.

In retrospect it might have been good for us to sync up a bit before you got into implementing this. If we had, i would probably have clarified the following point: I don't think we need to use Python's built-in hash function here. I think our needs are a bit different from what that was intended for. So I think we can bypass a bit of boilerplate and find a simpler solution.

If I wanted to compute a hash of a recipe, I would do it like this

serlialized = pickle.dumps(recipe)
sha256 = hashlib.sha256(serialized).hexdigest()

This is literally just taking the python object, pickling it into bytes, and computing the hash of those bytes.

Let's link up tomorrow and discuss the options here.

cisaacstern · 2022-04-27T16:33:26Z

A few observations from experimenting with this approach just now:

Only functions defined at the top level of a module can be pickled, therefore pickle.dumps(recipe) throws an error if recipe was defined with pattern_from_file_sequence. But this is actually less consequential than...

The output of hashlib.sha256(pickle.dumps(recipe)).hexdigest() is not consistent between interpreter sessions:

# test_rec.py

from pangeo_forge_recipes.patterns import FilePattern, ConcatDim
from pangeo_forge_recipes.recipes import XarrayZarrRecipe
 
def format_function(time):
    return f"{time}.nc"

pattern = FilePattern(format_function, ConcatDim("time", [1, 2]))
recipe = XarrayZarrRecipe(pattern, target_chunks={"time": 1})

# first interpreter session
In [1]: import hashlib, pickle; from test_rec import recipe
   ...: hashlib.sha256(pickle.dumps(recipe)).hexdigest()
Out[1]: '9052fbc0563fa94d916dcbcc332b821f76467b37462b0574719851002f7f3b84'

# second interpreter session; recipe module, environment, and machine unchanged
In [1]: import hashlib, pickle; from test_rec import recipe
   ...: hashlib.sha256(pickle.dumps(recipe)).hexdigest()
Out[1]: 'e1dfbdf8627ae8e02791879e10038215606a159198adf261922dde97ba009cd7'

My guess is that pickle is not designed to produce outputs with deterministic hashes, but rather only to allow for recovery of functionally equivalent objects. I do agree that staying away from built-in hash()/__hash__ saves us a lot of potential problems and boilerplate. Perhaps instead we want to borrow the approach of __hash__ (hashing a tuple of instance attributes), but use it within some method with a novel name, even just recipe.sha()?

rabernat · 2022-04-27T16:39:04Z

Only functions defined at the top level of a module can be pickled, therefore pickle.dumps(recipe) throws an error if recipe was defined with pattern_from_file_sequence

This might be a case for using cloudpickle instead. (That's what Dask uses to send such functions over the wire.)

My guess is that pickle is not designed to produce outputs with deterministic hashes, but rather only to allow for recovery of functionally equivalent objects

Interesting discovery. I wonder what the source is of this variability. Let's discuss at our meeting.

martindurant · 2022-04-27T16:46:51Z

I wonder what the source is of this variability.

Pickle definitely makes no such guarantee. My simplest guess is that the order of the objects in the output changes, but if could be any number of effects. You would end up defining your own pickle hooks (__getstate__ or __reduce__), which would look a lot like reimplementing the hash thing again. Dask has a similar problem, but consistency between sessions is not critical there either - it uses the string representation of the set of arguments to a task, plus any objects might define their own dask keys. fsspec's instance keys (used in cache files names possibly) have the same problem - and there is a discussion on whether we should have a strictly repeatable value or not.

I think it's worth saying that you can probably find a way to guarantee that when two recipe realisations have the same hash, they are equivalent, but not that different hashes are necessarily different.

rabernat · 2022-04-27T17:05:04Z

This blog post is extremely useful. https://death.andgravity.com/stable-hashing It explains why the approach we have tried so far don't work. It suggests serializing the class to json and then computing the hash of the json as bytes.

That also won't work out of the box here because our dataclasses contain other dataclasses and / or python functions (preprocess, etc.) and FilePattern objects.

I think the way forward is something like this, in pseudocode

d = recipe.to_dict()
# now traverse the dictionary and hash everything that is not json serializable
d_ser
for k, v in d.items():
    if is_json_serializable(v):
        d_ser[k] = v
    elif:
        d_ser[k] = v.sha256()  # we have to define that method, e.g. for filepattern
    else:
        raise ValueError("Don't know how to serialize this")

So if we can figure out how to compute the sha256 for a FilePattern in a deterministic way, I think we should be good.

This will probably fail on recipes with preprocessing functions. Maybe that's okay for now...we punt and say that if a recipe has preprocessing, it can't be hashed and therefore has to always be rerun. Would that be an okay tradeoff?

If not, we need to solve the problem of how to deterministically serialize / hash a generic python function.

rabernat · 2022-04-27T17:22:39Z

In terms of implementation.

Forget about hash() and .__hash__()

Implement a method .sha256() on the Recipe base class. It will be something very short like

 def sha265(self):
     from serialization import dataclass_sha265
     return dataclass_sha265(self)

Put the logic of hashing / serialization in a standalone module maybe called recipes/serialization.py so it can be unit tested.
That logic will look a lot like what we discussed in Recipe hashes #349 (comment)

rabernat · 2022-04-27T18:26:57Z

Here is the idea we just discussed for encoding the FilePattern as a blockchain

import pandas as pd
from hashlib import sha256
from dataclasses import asdict
from json import dumps
from datetime import datetime
from enum import Enum

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern

dates = pd.date_range('1981-09-01', '2022-02-01', freq='D')

URL_FORMAT = (
    "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/"
    "v2.1/access/avhrr/{time:%Y%m}/oisst-avhrr-v02r01.{time:%Y%m%d}.nc"
)

def make_url(time):
    return URL_FORMAT.format(time=time)

time_concat_dim = ConcatDim("time", dates, nitems_per_file=1)
pattern = FilePattern(make_url, time_concat_dim)

def json_default(thing):
    # custom serializer for FileType, CombineOp, etc.
    if isinstance(thing, Enum):
        return thing.value
    raise TypeError(f"object of type {type(thing).__name__} not serializable")

# https://death.andgravity.com/stable-hashing
def dict_to_sha256(thing):
    b = dumps(
        thing,
        default=json_default,
        ensure_ascii=False,
        sort_keys=True,
        indent=None,
        separators=(',', ':'),
    )
    return sha256(b.encode('utf-8')).digest()

def dimindex_to_sha256(di):
    d = asdict(di)
    del d['sequence_len']  # don't want this in the hash; could change
    return dict_to_sha256(d)

def pattern_blockchain(pattern):
    # this contains general aspects of the file pattern
    # we exclude the file pattern and concat dims because they are generated by iterating
    # if they are generated in a different way, we ultimately don't care
    # The only exception is nitems_per_file...need some way to include that
    root = {
        'fsspec_open_kwargs': pattern.fsspec_open_kwargs,
        'query_string_secrets': pattern.query_string_secrets,
        'file_type': pattern.file_type,
        'is_opendap': pattern.is_opendap,
    }

    root_sha256 = dict_to_sha256(root)

    blockchain = [root_sha256]
    for k, v in pattern.items():
        key_hash = b''.join(sorted([dimindex_to_sha256(dimindex) for dimindex in k]))
        value_hash = sha256(v.encode('utf-8')).digest()
        new_hash = key_hash + value_hash
        new_block = sha256(blockchain[-1] + new_hash).digest()
        blockchain.append(new_block)
        
    return blockchain

chain = pattern_blockchain(pattern)
last_hash = chain[-1]

# now make a new file pattern that extends the old one
new_dates = pd.date_range('1981-09-01', '2022-05-01', freq='D')
new_pattern = FilePattern(make_url, ConcatDim("time", new_dates, nitems_per_file=1))
new_chain = pattern_blockchain(new_pattern)

for k, h in zip(new_pattern, new_chain):
    if h == last_hash:
        print(f"Found a match for {h.hex()}!")
        break
        
print(k)

Found a match for e22bdc8a078cec796d6925a3f5dc1996daffe2a2ff4fac2383b3c01706c766d6!
Index({DimIndex(name='time', index=14764, sequence_len=14853, operation=<CombineOp.CONCAT: 2>)})

This approach only makes sense for FilePatterns with a single concat dim. We would need to take care to make the merge dim no break it.

tests/test_serialization.py

cisaacstern · 2022-04-27T23:43:51Z

@rabernat, do you think recipe.sha256() should be independent of self.file_pattern , as it is for storage config?

Adapting the blog, I'm excluding keys in BaseRecipe._hash_exclude_ from the recipe.sha256() calculation. Initially, I had

_hash_exclude_ = ["file_pattern", "storage_config"]

but in the last commit I walked this back to remove "file_pattern", because XarrayZarrRecipe.__post_init__ makes a number of self.file_pattern-based assignments

pangeo-forge-recipes/pangeo_forge_recipes/recipes/xarray_zarr.py

Lines 802 to 812 in e78febe

    
           self.concat_dim = self.file_pattern.concat_dims[0] 
        
           self.cache_metadata = any( 
        
               [v is None for v in self.file_pattern.concat_sequence_lens.values()] 
        
           ) 
        
           self.nitems_per_input = self.file_pattern.nitems_per_input[self.concat_dim] 
        
           if self.file_pattern.file_type == FileType.opendap: 
        
               if self.cache_inputs: 
        
                   raise ValueError("Can't cache opendap inputs.") 
        
               else: 
        
                   self.cache_inputs = False

so even if we exclude self.file_pattern from recipe.sha256(), the self.file_pattern-derived assignments in __post_init__ still mean that the recipe SHA will be influenced by the file pattern (via self.concat_dim, etc.). Some options:

Include self.file_pattern.sha256().hex() in the calculation of recipe.sha256(). This is what I'm doing now, and is probably the easiest path.

If we want recipe.sha256() to be independent of self.file_pattern, we could:

Replace all of these assignments with @propertys, which are not included when dataclasses are passed to asdict:

from dataclasses import asdict, dataclass

@dataclass
class A:
   a: int = 1
 
   @property
   def b(self):
       return self.a + 1

asdict(A())  # -> {'a': 1}

Append all of the derived attributes to self._hash_exclude_:

# recipes/xarray_zarr.py

@dataclass
class XarrayZarrRecipe(  # ... 

  def __post_init__(self):
      # file pattern-derived assignments, etc.
      self._hash_exclude_ += ["concat_dim", "cache_metadata", "nitems_per_input",  "cache_inputs"]

Thoughts?

cisaacstern · 2022-04-28T00:01:06Z

Personally, I think it actually makes a lot of sense to include self.file_pattern.sha256().hex() as part of the calculation of the recipe SHA. This is easier to implement and, in practical terms, it also means fewer checks at registration-time.

When re-rerunning recipes in a feedstock,

with independent pattern and recipe SHAs, we need to compare both of these hashes against the database every time, for every recipe in the feedstock, to determine whether to re-run. Only if both match, can we consider skipping the re-run.
with recipe SHAs dependent on pattern SHAs, we can just check recipe SHAs against the database as a first step for all recipes. If the recipe SHA matches, we can skip re-running without ever checking the pattern SHA. If the recipe SHA is not equal to the prior database record and we are not appending, we can also probably skip checking pattern SHAs and just re-run. If recipe SHAs are mismatched and we are appending, then we can compare the new pattern SHA against the prior one to determine where to pick up with appending.

Is there any advantage to independent recipe and pattern SHAs that I'm missing?

Edit: I guess this is less important than the practical implications, but logically a recipe can't be instantiated without a file pattern, so it makes sense that this should be part of is unique SHA identifier.

rabernat · 2022-04-29T17:36:05Z

with recipe SHAs dependent on pattern SHAs, we can just check recipe SHAs against the database as a first step for all recipes.

I just reviewed this, and I am very fine with this ☝️ decision.

…m pattern hash

cisaacstern · 2022-04-29T23:30:42Z

@rabernat, this is ready for review. I've opened the placeholders

to indicate what remains to do before we can close pangeo-forge/user-stories#3.

Aside from the release notes, I haven't updated the docs because I haven't been thinking of this as a user-facing feature. If you see a good place to mention it in the docs, I'm happy to write something up.

pangeo_forge_recipes/serialization.py

rabernat

This is looking great! Just a few minor comments.

docs/pangeo_forge_recipes/development/release_notes.md

pangeo_forge_recipes/patterns.py

pangeo_forge_recipes/recipes/base.py

pangeo_forge_recipes/serialization.py

tests/test_serialization.py

pangeo_forge_recipes/serialization.py

Co-authored-by: Ryan Abernathey <[email protected]>

cisaacstern · 2022-04-30T00:47:49Z

@rabernat, codecov is working in my browser now. Based on

https://app.codecov.io/gh/pangeo-forge/pangeo-forge-recipes/compare/349/tree/pangeo_forge_recipes

it looks like we have near-100% line coverage on the additions in this PR, including of

FilePattern.sha256
BaseRecipe.sha256
new helpers in .serialization and .patterns

AFAICT, the only new line that's not covered is the TypeError in either_encode_or_hash.

cisaacstern · 2022-04-30T00:59:51Z

the only new line that's not covered is the TypeError in either_encode_or_hash.

a5bcf1e should cover this

cisaacstern closed this Apr 27, 2022

cisaacstern force-pushed the recipe-hashes branch from 5171d4d to 0d8694a Compare April 27, 2022 20:00

add serialization.py

694f9d4

cisaacstern reopened this Apr 27, 2022

cisaacstern added 3 commits April 27, 2022 13:42

add test_serialization.py

1095095

add .sha256 methods to FilePattern and BaseRecipe

93c8231

in pattern.sha256 root, add nitems_per_file, remove is_opendap

50b522d

rabernat reviewed Apr 27, 2022

View reviewed changes

tests/test_serialization.py Show resolved Hide resolved

cisaacstern added 2 commits April 27, 2022 15:38

shorter date range for serialization test

e34afda

only _hash_exclude_ 'storage_config', not 'file_pattern'

5ee8342

cisaacstern added 5 commits April 28, 2022 10:29

parametrize hash exclude test with both recipe classes

ed66064

drop empty fields before sha256 calc

3c13247

parametrize additional fields test

9cb550f

parametrize kwargs for recipe sha256 tests

228eead

test recipes sha256 with matching and not_matching patterns

840df67

cisaacstern added 4 commits April 29, 2022 14:03

type hints and docstrings for serialization.py

568980c

parametrize file pattern hash test with kwargs; drop empty fields fro…

23f79fa

…m pattern hash

clarify addl kwargs test names

160e3e9

add release notes

0ca1776

cisaacstern mentioned this pull request Apr 29, 2022

Add recipe and pattern hashes to RecipeRun table pangeo-forge/pangeo-forge-orchestrator#63

Closed

cisaacstern marked this pull request as ready for review April 29, 2022 23:27

cisaacstern requested a review from rabernat April 29, 2022 23:28

cisaacstern commented Apr 29, 2022

View reviewed changes

pangeo_forge_recipes/serialization.py Outdated Show resolved Hide resolved

fix typo in match_pattern_blockchain docstring

0300877

rabernat reviewed Apr 29, 2022

View reviewed changes

cisaacstern mentioned this pull request Apr 29, 2022

Don't rerun unchanged recipes pangeo-forge/user-stories#3

Open

cisaacstern added 2 commits April 29, 2022 16:54

move imports to top level

4092932

release notes text wrapping

8303cfc

rabernat reviewed Apr 30, 2022

View reviewed changes

pangeo_forge_recipes/serialization.py Outdated Show resolved Hide resolved

cisaacstern and others added 5 commits April 29, 2022 17:24

duck type json_default

f4c563b

Co-authored-by: Ryan Abernathey <[email protected]>

Merge remote-tracking branch 'charles/recipe-hashes' into recipe-hashes

838005b

fix circular dependencies

a60d3f3

rename json_default -> either_encode_or_hash

d3489f5

make nitems_per_file a dict comprehension inclusive of op.name

b48849c

test either_encode_or_hash raises TypeError

a5bcf1e

rabernat merged commit cc0a1be into pangeo-forge:master Apr 30, 2022

rabernat mentioned this pull request May 5, 2022

Support incremental appending #37

Closed

cisaacstern mentioned this pull request May 12, 2022

Persist execution context in storage target #359

Merged

rabernat mentioned this pull request Oct 26, 2022

Generating a recipe's hash fails when using process_chunk and process_input functions #427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipe hashes #349

Recipe hashes #349

cisaacstern commented Apr 26, 2022

rabernat commented Apr 26, 2022

cisaacstern commented Apr 26, 2022

cisaacstern commented Apr 26, 2022

rabernat commented Apr 26, 2022

cisaacstern commented Apr 27, 2022 •

edited

Loading

rabernat commented Apr 27, 2022

cisaacstern commented Apr 27, 2022 •

edited

Loading

rabernat commented Apr 27, 2022

martindurant commented Apr 27, 2022

rabernat commented Apr 27, 2022

rabernat commented Apr 27, 2022

rabernat commented Apr 27, 2022 •

edited

Loading

cisaacstern commented Apr 27, 2022

cisaacstern commented Apr 28, 2022 •

edited

Loading

rabernat commented Apr 29, 2022

cisaacstern commented Apr 29, 2022

rabernat left a comment

cisaacstern commented Apr 30, 2022 •

edited

Loading

cisaacstern commented Apr 30, 2022

Recipe hashes #349

Recipe hashes #349

Conversation

cisaacstern commented Apr 26, 2022

rabernat commented Apr 26, 2022

cisaacstern commented Apr 26, 2022

cisaacstern commented Apr 26, 2022

rabernat commented Apr 26, 2022

cisaacstern commented Apr 27, 2022 • edited Loading

rabernat commented Apr 27, 2022

cisaacstern commented Apr 27, 2022 • edited Loading

rabernat commented Apr 27, 2022

martindurant commented Apr 27, 2022

rabernat commented Apr 27, 2022

rabernat commented Apr 27, 2022

rabernat commented Apr 27, 2022 • edited Loading

cisaacstern commented Apr 27, 2022

cisaacstern commented Apr 28, 2022 • edited Loading

rabernat commented Apr 29, 2022

cisaacstern commented Apr 29, 2022

rabernat left a comment

Choose a reason for hiding this comment

cisaacstern commented Apr 30, 2022 • edited Loading

cisaacstern commented Apr 30, 2022

cisaacstern commented Apr 27, 2022 •

edited

Loading

cisaacstern commented Apr 27, 2022 •

edited

Loading

rabernat commented Apr 27, 2022 •

edited

Loading

cisaacstern commented Apr 28, 2022 •

edited

Loading

cisaacstern commented Apr 30, 2022 •

edited

Loading