Collisions when hashing unbounded strings #288

ArvidJB · 2023-11-07T21:04:45Z

Because dtype='O' arrays of strings are hashed by computing a SHA256 hash of their contents we can get collisions if the concatenation of the contents of the array are the same:

>>> import h5py
... import versioned_hdf5
... import numpy as np
...
... path = "temp.h5"
...
... with h5py.File(path, mode="w") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     with file.stage_version("r0") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...     with file.stage_version("r1") as group:
...         group.create_dataset(
...             "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object),
...             maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False,
...         )
...
... with h5py.File(path, mode="r") as f:
...     file = versioned_hdf5.VersionedHDF5File(f)
...     for i in range(2):
...         print(file[f"r{i}"]["values"][:])
...
[b'a' b'bc']
[b'a' b'bc']

Here the two versions were supposed to contain [b'a' b'bc'] and [b'ab' b'c'], but, because the concatenation of the strings is the same, the hashes are the same, and the chunk gets (incorrectly) reused.

The text was updated successfully, but these errors were encountered:

ArvidJB · 2023-11-07T21:32:58Z

I think we need to include the length of the string into the hash as well. Something like

     def hash(self, data: np.ndarray):
         """
         Compute hash for `data` array.
         """
         # Object dtype arrays store the ids of the elements, which may or may not be
         # reused, making it unsuitable for hashing. Instead, we need to make a combined
         # hash with the value of each element.
         if data.dtype == 'object':
             hash_value = self.hash_function()
             for value in data.flat:
                 if isinstance(value, str):
                     # default to utf-8 encoding since it's a superset of ascii (the only other
                     # valid encoding supported in h5py)
                     value = value.encode('utf-8')
+                hash_value.update(str(len(value)).encode('ascii'))
                 hash_value.update(value)
             hash_value.update(bytes(str(data.shape), 'utf-8'))
             return hash_value.digest()
         else:
             return self.hash_function(
                 data.data.tobytes() + bytes(str(data.shape), 'ascii')
             ).digest()

When hashing `dtype=object` arrays we need to include the length of the individual strings, otherwise arrays with identical contents after concatenating all elements hash to the same value: ``` >>> import h5py ... import versioned_hdf5 ... import numpy as np ... ... path = "temp.h5" ... ... with h5py.File(path, mode="w") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... with file.stage_version("r0") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... with file.stage_version("r1") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... ... with h5py.File(path, mode="r") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... for i in range(2): ... print(file[f"r{i}"]["values"][:]) ... [b'a' b'bc'] [b'a' b'bc'] ``` By including the length of each string as part of the hash the hashes will now be unique. This also includes the changes in deshaw#287 and supersedes that PR. Fixes deshaw#288

peytondmurray · 2023-11-09T17:35:31Z

You're absolutely right about this, h.update(a) and h.update(b) is equivalent to h.update(a+b). Reviewing now!

* Include string length when hashing When hashing `dtype=object` arrays we need to include the length of the individual strings, otherwise arrays with identical contents after concatenating all elements hash to the same value: ``` >>> import h5py ... import versioned_hdf5 ... import numpy as np ... ... path = "temp.h5" ... ... with h5py.File(path, mode="w") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... with file.stage_version("r0") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... with file.stage_version("r1") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... ... with h5py.File(path, mode="r") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... for i in range(2): ... print(file[f"r{i}"]["values"][:]) ... [b'a' b'bc'] [b'a' b'bc'] ``` By including the length of each string as part of the hash the hashes will now be unique. This also includes the changes in #287 and supersedes that PR. Fixes #288 * Update comment

ArvidJB mentioned this issue Nov 7, 2023

Include string length when hashing #289

Merged

peytondmurray closed this as completed in #289 Nov 9, 2023

ArvidJB mentioned this issue Nov 10, 2023

Update error/warning message to give correct instructions #287

Closed

ArvidJB mentioned this issue Nov 20, 2023

Hash collision checks (PyInf#11487) #294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collisions when hashing unbounded strings #288

Collisions when hashing unbounded strings #288

ArvidJB commented Nov 7, 2023

ArvidJB commented Nov 7, 2023

peytondmurray commented Nov 9, 2023

Collisions when hashing unbounded strings #288

Collisions when hashing unbounded strings #288

Comments

ArvidJB commented Nov 7, 2023

ArvidJB commented Nov 7, 2023

peytondmurray commented Nov 9, 2023