Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collisions when hashing unbounded strings #288

Closed
ArvidJB opened this issue Nov 7, 2023 · 2 comments · Fixed by #289
Closed

Collisions when hashing unbounded strings #288

ArvidJB opened this issue Nov 7, 2023 · 2 comments · Fixed by #289

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Nov 7, 2023

Because dtype='O' arrays of strings are hashed by computing a SHA256 hash of their contents we can get collisions if the concatenation of the contents of the array are the same:

>>> import h5py
... import versioned_hdf5
... import numpy as np
...
... path = "temp.h5"
...
... with h5py.File(path, mode="w") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     with file.stage_version("r0") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...     with file.stage_version("r1") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...
... with h5py.File(path, mode="r") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     for i in range(2):
...         print(file[f"r{i}"]["values"][:])
...
[b'a' b'bc']
[b'a' b'bc']

Here the two versions were supposed to contain [b'a' b'bc'] and [b'ab' b'c'], but, because the concatenation of the strings is the same, the hashes are the same, and the chunk gets (incorrectly) reused.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Nov 7, 2023

I think we need to include the length of the string into the hash as well. Something like

     def hash(self, data: np.ndarray):
         """
         Compute hash for `data` array.
         """
         # Object dtype arrays store the ids of the elements, which may or may not be
         # reused, making it unsuitable for hashing. Instead, we need to make a combined
         # hash with the value of each element.
         if data.dtype == 'object':
             hash_value = self.hash_function()
             for value in data.flat:
                 if isinstance(value, str):
                     # default to utf-8 encoding since it's a superset of ascii (the only other
                     # valid encoding supported in h5py)
                     value = value.encode('utf-8')
+                hash_value.update(str(len(value)).encode('ascii'))
                 hash_value.update(value)
             hash_value.update(bytes(str(data.shape), 'utf-8'))
             return hash_value.digest()
         else:
             return self.hash_function(
                 data.data.tobytes() + bytes(str(data.shape), 'ascii')
             ).digest()

ArvidJB added a commit to ArvidJB/versioned-hdf5 that referenced this issue Nov 7, 2023
When hashing `dtype=object` arrays we need to include the
length of the individual strings, otherwise arrays with
identical contents after concatenating all elements
hash to the same value:
```
>>> import h5py
... import versioned_hdf5
... import numpy as np
...
... path = "temp.h5"
...
... with h5py.File(path, mode="w") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     with file.stage_version("r0") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...     with file.stage_version("r1") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...
... with h5py.File(path, mode="r") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     for i in range(2):
...         print(file[f"r{i}"]["values"][:])
...
[b'a' b'bc']
[b'a' b'bc']
```

By including the length of each string as part of the hash the hashes
will now be unique.

This also includes the changes in deshaw#287
and supersedes that PR.

Fixes deshaw#288
@peytondmurray
Copy link
Collaborator

You're absolutely right about this, h.update(a) and h.update(b) is equivalent to h.update(a+b). Reviewing now!

peytondmurray pushed a commit that referenced this issue Nov 9, 2023
* Include string length when hashing

When hashing `dtype=object` arrays we need to include the
length of the individual strings, otherwise arrays with
identical contents after concatenating all elements
hash to the same value:
```
>>> import h5py
... import versioned_hdf5
... import numpy as np
...
... path = "temp.h5"
...
... with h5py.File(path, mode="w") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     with file.stage_version("r0") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...     with file.stage_version("r1") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...
... with h5py.File(path, mode="r") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     for i in range(2):
...         print(file[f"r{i}"]["values"][:])
...
[b'a' b'bc']
[b'a' b'bc']
```

By including the length of each string as part of the hash the hashes
will now be unique.

This also includes the changes in #287
and supersedes that PR.

Fixes #288

* Update comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants