-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collisions when hashing unbounded strings #288
Comments
I think we need to include the length of the string into the hash as well. Something like
|
ArvidJB
added a commit
to ArvidJB/versioned-hdf5
that referenced
this issue
Nov 7, 2023
When hashing `dtype=object` arrays we need to include the length of the individual strings, otherwise arrays with identical contents after concatenating all elements hash to the same value: ``` >>> import h5py ... import versioned_hdf5 ... import numpy as np ... ... path = "temp.h5" ... ... with h5py.File(path, mode="w") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... with file.stage_version("r0") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... with file.stage_version("r1") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... ... with h5py.File(path, mode="r") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... for i in range(2): ... print(file[f"r{i}"]["values"][:]) ... [b'a' b'bc'] [b'a' b'bc'] ``` By including the length of each string as part of the hash the hashes will now be unique. This also includes the changes in deshaw#287 and supersedes that PR. Fixes deshaw#288
You're absolutely right about this, |
peytondmurray
pushed a commit
that referenced
this issue
Nov 9, 2023
* Include string length when hashing When hashing `dtype=object` arrays we need to include the length of the individual strings, otherwise arrays with identical contents after concatenating all elements hash to the same value: ``` >>> import h5py ... import versioned_hdf5 ... import numpy as np ... ... path = "temp.h5" ... ... with h5py.File(path, mode="w") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... with file.stage_version("r0") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"a", b"bc"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... with file.stage_version("r1") as group: ... group.create_dataset( ... "values", shape=(2,), dtype=h5py.string_dtype(encoding='ascii'), data=np.array([b"ab", b"c"], dtype=object), ... maxshape=(None,), chunks=(10000,), compression="gzip", shuffle=False, ... ) ... ... with h5py.File(path, mode="r") as f: ... file = versioned_hdf5.VersionedHDF5File(f) ... for i in range(2): ... print(file[f"r{i}"]["values"][:]) ... [b'a' b'bc'] [b'a' b'bc'] ``` By including the length of each string as part of the hash the hashes will now be unique. This also includes the changes in #287 and supersedes that PR. Fixes #288 * Update comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Because
dtype='O'
arrays of strings are hashed by computing a SHA256 hash of their contents we can get collisions if the concatenation of the contents of the array are the same:Here the two versions were supposed to contain
[b'a' b'bc']
and[b'ab' b'c']
, but, because the concatenation of the strings is the same, the hashes are the same, and the chunk gets (incorrectly) reused.The text was updated successfully, but these errors were encountered: