Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix chunk reuse verification for string dtype arrays #348

Merged

Conversation

peytondmurray
Copy link
Collaborator

@peytondmurray peytondmurray commented Jun 21, 2024

This PR fixes an issue with string datasets where reused chunks were not correctly verified.

Previously, chunks that were written to the dataset and then reused contained bytes elements, but chunks that were pending a write but being reused (e.g. by some other chunk in the pending write operation) could contain str elements, causing problems for the array comparison. With this change, both the chunk that the user is trying to write and the chunk to be reused are coerced to object dtype arrays of bytes before the comparison is completed.

Additionally multidimensional string datasets are now correctly verified as well, closes #339 and closes #338.

Copy link
Collaborator

@ArvidJB ArvidJB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

It seems like I was wrong about the issue in #339 - it seems like the problem was not that the array has more than one dimension?

@peytondmurray
Copy link
Collaborator Author

No, I think the call to vectorize should broadcast across all dimensions. I think the exception in that issue happens because of the way that we detect whether we need to cast each element of the array as a bytes object. We can't use the dtype of the array because string arrays are read out of the file as object dtype arrays, so instead do this by looking at the type of the first element of the array:

if len(arr) > 0 and isinstance(arr.flatten()[0], bytes):
#                                     ^
#                            multidimensional datasets need to be flattened first!

Previously we just weren't flattening the multidimensional arrays, which meant we ended up trying to call bytes on a bytes object, which fails.

@peytondmurray peytondmurray merged commit d7721dc into deshaw:master Jun 26, 2024
7 checks passed
@peytondmurray peytondmurray deleted the 338-fix-string-reuse-verification branch June 26, 2024 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants