Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hashtable rebuilding logic to not corrupt hashtables #282

Merged
merged 4 commits into from
Oct 19, 2023

Conversation

ArvidJB
Copy link
Collaborator

@ArvidJB ArvidJB commented Oct 18, 2023

The changes in b0e6d19
unfortunately corrupted the hashtables by picking raw chunks
in increasing order rather than picking the correct raw chunk.

Note that this broke all hashtables, not just dtype='O'
hashtables which were the target of that fix. Because of
that all data_version == 2 files are corrupted.

This commit fixes the rebuild hashtable logic by actually
looking up the raw data slice for each chunk. Furthermore

  1. rebuilding is now a separate command that needs to be
    called implicitly, rather than it happening silently.
  2. if it encounters a data_version == 2 file it raises
    an exception and asks the user to recreate the file.
  3. it only rebuilds hashtables for dtype='O'.

Fixes #280.

The changes in b0e6d19
unfortunately corrupted the hashtables by picking raw chunks
in increasing order rather than picking the correct raw chunk.

Note that this broke *all* hashtables, not just dtype='O'
hashtables which were the target of that fix. Because of
that all `data_version == 2` files are corrupted.

This commit fixes the rebuild hashtable logic by actually
looking up the raw data slice for each chunk. Furthermore
1. rebuilding is now a separate command that needs to be
   called implicitly, rather than it happening silently.
2. if it encounters a `data_version == 2` file it raises
   an exception and asks the user to recreate the file.
3. it only rebuilds hashtables for dtype='O'.

Fixes deshaw#280.
@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Oct 19, 2023

This needs more tests:

  • add write chunk reuse tests to other rebuild tests
  • multi-dimensional arrays
  • non-ascii unicode characters

Copy link
Collaborator

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArvidJB Thanks for looking into this! I'm glad there's a path forward where the hash tables can be fixed - I know in the issue you mentioned it might just be worth making old data versions incompatible.

@ArvidJB ArvidJB merged commit d382673 into deshaw:master Oct 19, 2023
7 checks passed
@ArvidJB ArvidJB deleted the rebuild_hashtable_fixes branch October 19, 2023 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupted hash_table after rebuilding - PyInf#11257
2 participants