Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check of hashed data when writing new data #296

Merged
merged 1 commit into from
Dec 13, 2023

Conversation

peytondmurray
Copy link
Collaborator

@peytondmurray peytondmurray commented Nov 30, 2023

This PR adds a check when writing data to a file. This check verifies that chunks of data that already exist in the hashtable correspond to arrays that match the data we are trying to write.

This should incur a performance overhead when writing datasets that reuse chunks, but ensures we won't be corrupting hashtables in the future. Closes #294.

Note: Currently I'm seeing a test failure locally with test_delete_string_dataset on master. The failure happens when trying to open the file after running h5repack on the hdf5 file, with OSError: Unable to open file (bad object header version number). I'm opening this PR to see if CI sees it.

@ArvidJB
Copy link
Collaborator

ArvidJB commented Nov 30, 2023

The same check should also go into write_dataset_chunks.

Can you add a test where the hash function is mocked to be one of the previous data_version hash functions? The check should get triggered.

@peytondmurray
Copy link
Collaborator Author

peytondmurray commented Nov 30, 2023

Okay, this is really strange. The error in the test that I mentioned above is happening because h5repack segfaults when it is called in the subprocess for that test. Here's the file itself - all it has is a few strings in an array.

file.hdf5.zip

When I run h5repack -v 2 <test_data.hdf5> <repacked.hdf5.tmp> I can see that the segfault arises after handling prev_version:

Making new file ...
-----------------------------------------------------------------
 Type     Filter (Compression)        Timing read/write    Name
-----------------------------------------------------------------
 group                                                    /
 group                                                    /_version_data
 group                                                    /_version_data/foo
  attr                         1.470000e-06/1.889999e-06   largest_index
 dset                         0.000000e+00/5.443900e-05   /_version_data/foo/hash_table
 dset                         5.605000e-05/5.415900e-05   /_version_data/foo/raw_data
  attr                         1.119999e-06/1.580000e-06   chunks
 group                                                    /_version_data/versions
  attr                         9.500000e-07/1.270000e-06   data_version
  attr                         3.010000e-06/2.629999e-06   current_version
 group                                                    /_version_data/versions/__first_version__
  attr                         2.229999e-06/4.030000e-06   timestamp
 group                                                    /_version_data/versions/r1
  attr                         1.290000e-06/1.439999e-06   committed
  attr                         2.519999e-06/4.039999e-06   timestamp
  attr                         1.630000e-06/2.030001e-06   prev_version
Segmentation fault (core dumped)

@peytondmurray
Copy link
Collaborator Author

Looks like there must be a problem locally with h5repack that is responsible for the failing test I'm seeing, and anyway it isn't a failing test related to this change.


Changes

  • Removed a little dead code
  • Changed the logic slightly to make it more readable by removing the usage of setdefault. This let me get the clarity I needed to figure out how to carry out the check of the hashed data.
  • Added check of the hashed data
  • Moved some code inside the branch statement that only needed to be run on one branch

@peytondmurray peytondmurray marked this pull request as ready for review December 4, 2023 17:56
@ArvidJB
Copy link
Collaborator

ArvidJB commented Dec 5, 2023

Looking pretty good!

  • Can you add a test which uses the old hash functions from data_version 2, 3?
  • Add a test with NaN values, since I think assert_array_equal will not handle those correctly by default?
  • Add a test with strings to check we correctly compare encoded bytes vs. strings

@peytondmurray
Copy link
Collaborator Author

Okay, I think we have our bases covered with these tests.

Copy link
Collaborator

@ArvidJB ArvidJB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

versioned_hdf5/backend.py Show resolved Hide resolved
@peytondmurray peytondmurray merged commit 7662655 into deshaw:master Dec 13, 2023
7 checks passed
@peytondmurray peytondmurray deleted the add-hashtable-check branch December 13, 2023 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hash collision checks (PyInf#11487)
2 participants