Hash collision checks (PyInf#11487) #294

ArvidJB · 2023-11-20T22:53:31Z

In light of our recent issues with hashing (see #280 and #288), we should add a simple sanity check when we write chunks. In particular I am thinking we should modify the logic in write_dataset here

versioned-hdf5/versioned_hdf5/backend.py

Lines 128 to 135 in ee3efe0

    
           idx = hashtable.largest_index 
        
           data_s = data[s.raw] 
        
           raw_slice = Slice(idx*chunk_size, idx*chunk_size + data_s.shape[0]) 
        
           data_hash = hashtable.hash(data_s) 
        
           raw_slice2 = hashtable.setdefault(data_hash, raw_slice) 
        
           if raw_slice2 == raw_slice: 
        
               slices_to_write[raw_slice] = s 
        
           slices[s] = raw_slice2

to add a check like

if raw_slice2 == raw_slice:
    slices_to_write[raw_slice] = s
elif not np.array_equal(ds[Tuple(raw_slice2, *[slice(0, i) for i in data_s.shape[1:]]).raw], data_s, equal_nan=True):
    raise ValueError('hashed data chunk and new data chunk do not match')

I'd argue that this check will be very fast (compared to the SHA256 computation at least!) so the extra overhead will be negligible.

The text was updated successfully, but these errors were encountered:

ArvidJB · 2023-11-20T22:54:26Z

Internal ticket for tracking: PyInf#11487.

peytondmurray · 2023-11-30T01:48:15Z

Reading this code a little more closely, I think this will not work in its current form, because writing data to a file happens in two stages:

We update the hashtable, adding the hashes of new chunks we will write
We actually write the data

So we actually can't index into ds because it won't be "up to date" with the hashtable for this check. I'll think a little more about how to do this - maybe we can defer the check until after writing to ds? Alternatively I guess we can index into data instead of ds, although that only works for hashes that are being written on this call to write_dataset.

ArvidJB · 2023-11-30T13:20:21Z

I see, we also need to consider slices_to_write, since the same slice could be present in the data to be written a second time. I think that's basically your second proposal? Or we could defer the check to after writing as well.
I think this should still be very doable?

ArvidJB · 2023-11-30T15:00:43Z

Something like this should work?

                if raw_slice2 == raw_slice:
                    slices_to_write[raw_slice] = s
                else:
                    # check that the reused data is the same
                    if raw_slice2 in slices_to_write:
                        # chunk will be written in this commit
                        reused_s = data[slices_to_write[raw_slice2].raw]
                    else:
                        # chunk already exists in raw data from previous commit
                        reused_s = ds[Tuple(raw_slice2, *[slice(0, i) for i in data_s.shape[1:]]).raw]
                    if not np.array_equal(reused_s, data_s, equal_nan=True):
                        raise ValueError(f'Hashed data chunk {reused_s} and new data chunk {data_s} do not match '
                                         f'for hash {data_hash}.')

peytondmurray mentioned this issue Nov 30, 2023

Add check of hashed data when writing new data #296

Merged

ArvidJB changed the title ~~Hash collision checks~~ Hash collision checks (PyInf#11487) Nov 30, 2023

peytondmurray self-assigned this Nov 30, 2023

peytondmurray closed this as completed in #296 Dec 13, 2023

ArvidJB mentioned this issue May 22, 2024

Slow error message formatting in _verify_new_chunk_reuse (PyInf#11487) #322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash collision checks (PyInf#11487) #294

Hash collision checks (PyInf#11487) #294

ArvidJB commented Nov 20, 2023

ArvidJB commented Nov 20, 2023

peytondmurray commented Nov 30, 2023

ArvidJB commented Nov 30, 2023

ArvidJB commented Nov 30, 2023

Hash collision checks (PyInf#11487) #294

Hash collision checks (PyInf#11487) #294

Comments

ArvidJB commented Nov 20, 2023

ArvidJB commented Nov 20, 2023

peytondmurray commented Nov 30, 2023

ArvidJB commented Nov 30, 2023

ArvidJB commented Nov 30, 2023