Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash collision checks (PyInf#11487) #294

Closed
ArvidJB opened this issue Nov 20, 2023 · 4 comments · Fixed by #296
Closed

Hash collision checks (PyInf#11487) #294

ArvidJB opened this issue Nov 20, 2023 · 4 comments · Fixed by #296
Assignees

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Nov 20, 2023

In light of our recent issues with hashing (see #280 and #288), we should add a simple sanity check when we write chunks. In particular I am thinking we should modify the logic in write_dataset here

idx = hashtable.largest_index
data_s = data[s.raw]
raw_slice = Slice(idx*chunk_size, idx*chunk_size + data_s.shape[0])
data_hash = hashtable.hash(data_s)
raw_slice2 = hashtable.setdefault(data_hash, raw_slice)
if raw_slice2 == raw_slice:
slices_to_write[raw_slice] = s
slices[s] = raw_slice2

to add a check like

if raw_slice2 == raw_slice:
    slices_to_write[raw_slice] = s
elif not np.array_equal(ds[Tuple(raw_slice2, *[slice(0, i) for i in data_s.shape[1:]]).raw], data_s, equal_nan=True):
    raise ValueError('hashed data chunk and new data chunk do not match')

I'd argue that this check will be very fast (compared to the SHA256 computation at least!) so the extra overhead will be negligible.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Nov 20, 2023

Internal ticket for tracking: PyInf#11487.

@peytondmurray
Copy link
Collaborator

Reading this code a little more closely, I think this will not work in its current form, because writing data to a file happens in two stages:

  1. We update the hashtable, adding the hashes of new chunks we will write
  2. We actually write the data

So we actually can't index into ds because it won't be "up to date" with the hashtable for this check. I'll think a little more about how to do this - maybe we can defer the check until after writing to ds? Alternatively I guess we can index into data instead of ds, although that only works for hashes that are being written on this call to write_dataset.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Nov 30, 2023

I see, we also need to consider slices_to_write, since the same slice could be present in the data to be written a second time. I think that's basically your second proposal? Or we could defer the check to after writing as well.
I think this should still be very doable?

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Nov 30, 2023

Something like this should work?

                if raw_slice2 == raw_slice:
                    slices_to_write[raw_slice] = s
                else:
                    # check that the reused data is the same
                    if raw_slice2 in slices_to_write:
                        # chunk will be written in this commit
                        reused_s = data[slices_to_write[raw_slice2].raw]
                    else:
                        # chunk already exists in raw data from previous commit
                        reused_s = ds[Tuple(raw_slice2, *[slice(0, i) for i in data_s.shape[1:]]).raw]
                    if not np.array_equal(reused_s, data_s, equal_nan=True):
                        raise ValueError(f'Hashed data chunk {reused_s} and new data chunk {data_s} do not match '
                                         f'for hash {data_hash}.')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants