-
Notifications
You must be signed in to change notification settings - Fork 39
Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222
Conversation
There's something still taking time within assigning starts and stops. |
Ah, CI is failing due to needing to have pyarrow deeper into the mix... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it works, the only "request changes" I have is to not depend on pyarrow at top level. That transitively makes uproot depend on pyarrow, which is a much heavier dependency than users expect.
I got your comments after writing mine. Let me know if you decide to cancel this PR. |
maybe take the offsets passthrough from mine, where I handle the soft type check? |
1a1f6ca
to
b08f5b7
Compare
7f52c3d
to
2586acf
Compare
I'll rebase this on master when #221 gets merged. |
Ah, there appears to be a large string/unicode/bytes type as well. |
@jpivarski I think I can separate this PR entirely from Nick's. I'll fix that up, is there anything else you want before merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have preferences described inline but not hard rules, if you just want to get this done. I'd rather you discover use_large_index
as described inline, but I won't be a stickler about it.
I do think your if-statements in jagged.py conflict with @nsmith-'s PR, and that needs to be coordinated.
awkward/arrow.py
Outdated
return pyarrow.ListArray.from_arrays(obj.offsets, recurse(obj.content, mask)) | ||
arrow_type = pyarrow.ListArray | ||
if hasattr(pyarrow, 'LargeListArray') and use_large_index: | ||
arrow_type = pyarrow.LargeListArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making use_large_index
a user-configurable parameter, how about using it if hasattr(pyarrow, "LargeListArray") and (obj.starts.itemsize > 4 or obj.stops.itemsize > 4)
? There's a one-to-one relationship between pyarrow.LargeListArray
and awkward.JaggedArray
with 64-bit starts
and stops
. I don't think this choice should be in the user's hands: the starts
and stops
are either 64-bit or they're not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I did this thinking from the perspective that up to now all serialization to arrow turned 64bit indices into 32bit ones and then back.
But I think it's better to take the correct behavior that you've suggest.
awkward/array/jagged.py
Outdated
if hasattr(offsets.base, 'base') and str(type(offsets.base.base)) == "<class 'pyarrow.lib.Buffer'>": | ||
# special exception to prevent copy in awkward.fromarrow | ||
pass | ||
elif offsets.base is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if-statements above need to be coordinated with @nsmith-'s PR, right? I'm confused about which PR does what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some old commits from @nsmith- in this PR. I rebased them out and now the two PRs are completely separate, and this one just focused on the new arrow types.
I'll merge when you tell me whether you're going to make the suggested change or not (it's optional) and when you tell me that you've coordinated the if-statements in jagged.py with @nsmith-. |
6cb4c8e
to
6ac50fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that was a misclick, no need for a review.
This no longer makes changes to jagged.py, so there's no issue with merging. It doesn't replace the user-configurable |
@jpivarski I've implemented your requests locally but it cropped up some problems with parquet and arrow. The former a limitation, the latter may be a bug. |
Yep there is no validation for LargeListArray in arrow yet: But there is for binary array? It's just a completeness issue. Anyway, will comment that out until a fix is made.... I might contribute it. |
Ah, no that was false, there was a forced conversion to 32bit offsets for strings and binary. |
@jpivarski I've changed the code such that we never presently serialize into 64-bit offset types, but we may deserialize them when they are encountered. Once things are a bit more mature on the arrow side we can uncomment the serialization parts. |
Nice, thanks! I'll merge this as soon as the tests pass. |
There's now a 64-bit indexed ListArray, StringArray, and BinaryArray types in pyarrow.
I've put in a defaulted-false argument in
toarrow()
where the user can drive the index type used when serializing jagged arrays.