Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222

Merged
merged 6 commits into from
Dec 11, 2019

Conversation

lgray
Copy link
Contributor

@lgray lgray commented Dec 6, 2019

There's now a 64-bit indexed ListArray, StringArray, and BinaryArray types in pyarrow.

I've put in a defaulted-false argument in toarrow() where the user can drive the index type used when serializing jagged arrays.

@lgray
Copy link
Contributor Author

lgray commented Dec 6, 2019

There's something still taking time within assigning starts and stops.

@lgray
Copy link
Contributor Author

lgray commented Dec 6, 2019

Ah, CI is failing due to needing to have pyarrow deeper into the mix...
I'm fine with waiting for awkward1 for this one if it's too much of a hassle.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it works, the only "request changes" I have is to not depend on pyarrow at top level. That transitively makes uproot depend on pyarrow, which is a much heavier dependency than users expect.

awkward/__init__.py Outdated Show resolved Hide resolved
awkward/arrow.py Show resolved Hide resolved
awkward/arrow.py Show resolved Hide resolved
awkward/array/jagged.py Outdated Show resolved Hide resolved
awkward/array/jagged.py Outdated Show resolved Hide resolved
@jpivarski
Copy link
Member

I got your comments after writing mine. Let me know if you decide to cancel this PR.

@nsmith-
Copy link
Contributor

nsmith- commented Dec 6, 2019

maybe take the offsets passthrough from mine, where I handle the soft type check?

@lgray lgray changed the title First stab at making fromarrow really zero copy Add LargeListArray to pyarrow serialization options. Dec 7, 2019
@lgray
Copy link
Contributor Author

lgray commented Dec 7, 2019

I'll rebase this on master when #221 gets merged.

@lgray
Copy link
Contributor Author

lgray commented Dec 7, 2019

Ah, there appears to be a large string/unicode/bytes type as well.
I will get to those too.

@lgray lgray changed the title Add LargeListArray to pyarrow serialization options. Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. Dec 7, 2019
@lgray
Copy link
Contributor Author

lgray commented Dec 9, 2019

@jpivarski I think I can separate this PR entirely from Nick's. I'll fix that up, is there anything else you want before merge?

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have preferences described inline but not hard rules, if you just want to get this done. I'd rather you discover use_large_index as described inline, but I won't be a stickler about it.

I do think your if-statements in jagged.py conflict with @nsmith-'s PR, and that needs to be coordinated.

awkward/arrow.py Outdated
return pyarrow.ListArray.from_arrays(obj.offsets, recurse(obj.content, mask))
arrow_type = pyarrow.ListArray
if hasattr(pyarrow, 'LargeListArray') and use_large_index:
arrow_type = pyarrow.LargeListArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making use_large_index a user-configurable parameter, how about using it if hasattr(pyarrow, "LargeListArray") and (obj.starts.itemsize > 4 or obj.stops.itemsize > 4)? There's a one-to-one relationship between pyarrow.LargeListArray and awkward.JaggedArray with 64-bit starts and stops. I don't think this choice should be in the user's hands: the starts and stops are either 64-bit or they're not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I did this thinking from the perspective that up to now all serialization to arrow turned 64bit indices into 32bit ones and then back.

But I think it's better to take the correct behavior that you've suggest.

awkward/arrow.py Show resolved Hide resolved
awkward/arrow.py Show resolved Hide resolved
if hasattr(offsets.base, 'base') and str(type(offsets.base.base)) == "<class 'pyarrow.lib.Buffer'>":
# special exception to prevent copy in awkward.fromarrow
pass
elif offsets.base is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if-statements above need to be coordinated with @nsmith-'s PR, right? I'm confused about which PR does what.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some old commits from @nsmith- in this PR. I rebased them out and now the two PRs are completely separate, and this one just focused on the new arrow types.

@jpivarski
Copy link
Member

I'll merge when you tell me whether you're going to make the suggested change or not (it's optional) and when you tell me that you've coordinated the if-statements in jagged.py with @nsmith-.

Copy link
Contributor Author

@lgray lgray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was a misclick, no need for a review.

@jpivarski
Copy link
Member

This no longer makes changes to jagged.py, so there's no issue with merging. It doesn't replace the user-configurable use_large_index with a direct detection based on the starts, stops integer size. Do you want me to merge it anyway? (I.e. are you punting on that?)

@lgray
Copy link
Contributor Author

lgray commented Dec 9, 2019

@jpivarski I've implemented your requests locally but it cropped up some problems with parquet and arrow. The former a limitation, the latter may be a bug.

@lgray
Copy link
Contributor Author

lgray commented Dec 9, 2019

Yep there is no validation for LargeListArray in arrow yet:
https://github.com/apache/arrow/blob/e902b24e9de79f18d542e6d29a55ced26b2dc696/cpp/src/arrow/array/validate.cc#L78

But there is for binary array? It's just a completeness issue. Anyway, will comment that out until a fix is made.... I might contribute it.

@lgray
Copy link
Contributor Author

lgray commented Dec 9, 2019

Ah, no that was false, there was a forced conversion to 32bit offsets for strings and binary.
Removed that and now everything works!

@lgray
Copy link
Contributor Author

lgray commented Dec 11, 2019

@jpivarski I've changed the code such that we never presently serialize into 64-bit offset types, but we may deserialize them when they are encountered.

Once things are a bit more mature on the arrow side we can uncomment the serialization parts.

@jpivarski
Copy link
Member

Nice, thanks! I'll merge this as soon as the tests pass.

@jpivarski jpivarski merged commit 03c1ef4 into scikit-hep:master Dec 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants