Fix chunk reuse verification for string dtype arrays #348

peytondmurray · 2024-06-21T23:37:54Z

This PR fixes an issue with string datasets where reused chunks were not correctly verified.

Previously, chunks that were written to the dataset and then reused contained bytes elements, but chunks that were pending a write but being reused (e.g. by some other chunk in the pending write operation) could contain str elements, causing problems for the array comparison. With this change, both the chunk that the user is trying to write and the chunk to be reused are coerced to object dtype arrays of bytes before the comparison is completed.

Additionally multidimensional string datasets are now correctly verified as well, closes #339 and closes #338.

ArvidJB

Looks good.

It seems like I was wrong about the issue in #339 - it seems like the problem was not that the array has more than one dimension?

peytondmurray · 2024-06-26T15:34:58Z

No, I think the call to vectorize should broadcast across all dimensions. I think the exception in that issue happens because of the way that we detect whether we need to cast each element of the array as a bytes object. We can't use the dtype of the array because string arrays are read out of the file as object dtype arrays, so instead do this by looking at the type of the first element of the array:

if len(arr) > 0 and isinstance(arr.flatten()[0], bytes):
#                                     ^
#                            multidimensional datasets need to be flattened first!

Previously we just weren't flattening the multidimensional arrays, which meant we ended up trying to call bytes on a bytes object, which fails.

Fix chunk reuse verification for string dtype arrays

40f699f

peytondmurray requested a review from ArvidJB June 21, 2024 23:37

ArvidJB approved these changes Jun 26, 2024

View reviewed changes

peytondmurray merged commit d7721dc into deshaw:master Jun 26, 2024
7 checks passed

peytondmurray deleted the 338-fix-string-reuse-verification branch June 26, 2024 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix chunk reuse verification for string dtype arrays #348

Fix chunk reuse verification for string dtype arrays #348

peytondmurray commented Jun 21, 2024 •

edited

Loading

ArvidJB left a comment

peytondmurray commented Jun 26, 2024

Fix chunk reuse verification for string dtype arrays #348

Fix chunk reuse verification for string dtype arrays #348

Conversation

peytondmurray commented Jun 21, 2024 • edited Loading

ArvidJB left a comment

Choose a reason for hiding this comment

peytondmurray commented Jun 26, 2024

peytondmurray commented Jun 21, 2024 •

edited

Loading