Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222

lgray · 2019-12-06T22:45:18Z

There's now a 64-bit indexed ListArray, StringArray, and BinaryArray types in pyarrow.

I've put in a defaulted-false argument in toarrow() where the user can drive the index type used when serializing jagged arrays.

lgray · 2019-12-06T22:47:34Z

There's something still taking time within assigning starts and stops.

lgray · 2019-12-06T22:56:02Z

Ah, CI is failing due to needing to have pyarrow deeper into the mix...
I'm fine with waiting for awkward1 for this one if it's too much of a hassle.

jpivarski

As long as it works, the only "request changes" I have is to not depend on pyarrow at top level. That transitively makes uproot depend on pyarrow, which is a much heavier dependency than users expect.

awkward/__init__.py

awkward/arrow.py

awkward/array/jagged.py

jpivarski · 2019-12-06T22:57:41Z

I got your comments after writing mine. Let me know if you decide to cancel this PR.

nsmith- · 2019-12-06T23:01:41Z

maybe take the offsets passthrough from mine, where I handle the soft type check?

lgray · 2019-12-07T19:33:43Z

I'll rebase this on master when #221 gets merged.

lgray · 2019-12-07T19:48:15Z

Ah, there appears to be a large string/unicode/bytes type as well.
I will get to those too.

lgray · 2019-12-09T15:23:11Z

@jpivarski I think I can separate this PR entirely from Nick's. I'll fix that up, is there anything else you want before merge?

jpivarski

I have preferences described inline but not hard rules, if you just want to get this done. I'd rather you discover use_large_index as described inline, but I won't be a stickler about it.

I do think your if-statements in jagged.py conflict with @nsmith-'s PR, and that needs to be coordinated.

jpivarski · 2019-12-09T15:54:53Z

awkward/arrow.py

-            return pyarrow.ListArray.from_arrays(obj.offsets, recurse(obj.content, mask))
+            arrow_type = pyarrow.ListArray
+            if hasattr(pyarrow, 'LargeListArray') and use_large_index:
+                arrow_type = pyarrow.LargeListArray


Instead of making use_large_index a user-configurable parameter, how about using it if hasattr(pyarrow, "LargeListArray") and (obj.starts.itemsize > 4 or obj.stops.itemsize > 4)? There's a one-to-one relationship between pyarrow.LargeListArray and awkward.JaggedArray with 64-bit starts and stops. I don't think this choice should be in the user's hands: the starts and stops are either 64-bit or they're not.

Yeah, good point. I did this thinking from the perspective that up to now all serialization to arrow turned 64bit indices into 32bit ones and then back.

But I think it's better to take the correct behavior that you've suggest.

awkward/arrow.py

jpivarski · 2019-12-09T15:58:30Z

awkward/array/jagged.py

+        if hasattr(offsets.base, 'base') and str(type(offsets.base.base)) == "<class 'pyarrow.lib.Buffer'>":
+            # special exception to prevent copy in awkward.fromarrow
+            pass
+        elif offsets.base is not None:


The if-statements above need to be coordinated with @nsmith-'s PR, right? I'm confused about which PR does what.

I had some old commits from @nsmith- in this PR. I rebased them out and now the two PRs are completely separate, and this one just focused on the new arrow types.

jpivarski · 2019-12-09T16:03:03Z

I'll merge when you tell me whether you're going to make the suggested change or not (it's optional) and when you tell me that you've coordinated the if-statements in jagged.py with @nsmith-.

…ow.ChunkedArray

lgray

Sorry, that was a misclick, no need for a review.

jpivarski · 2019-12-09T17:04:18Z

This no longer makes changes to jagged.py, so there's no issue with merging. It doesn't replace the user-configurable use_large_index with a direct detection based on the starts, stops integer size. Do you want me to merge it anyway? (I.e. are you punting on that?)

lgray · 2019-12-09T17:06:52Z

@jpivarski I've implemented your requests locally but it cropped up some problems with parquet and arrow. The former a limitation, the latter may be a bug.

lgray · 2019-12-09T17:21:02Z

Yep there is no validation for LargeListArray in arrow yet:
https://github.com/apache/arrow/blob/e902b24e9de79f18d542e6d29a55ced26b2dc696/cpp/src/arrow/array/validate.cc#L78

But there is for binary array? It's just a completeness issue. Anyway, will comment that out until a fix is made.... I might contribute it.

lgray · 2019-12-09T17:27:37Z

Ah, no that was false, there was a forced conversion to 32bit offsets for strings and binary.
Removed that and now everything works!

…onversion

…ation if encountered

lgray · 2019-12-11T16:20:47Z

@jpivarski I've changed the code such that we never presently serialize into 64-bit offset types, but we may deserialize them when they are encountered.

Once things are a bit more mature on the arrow side we can uncomment the serialization parts.

jpivarski · 2019-12-11T16:25:51Z

Nice, thanks! I'll merge this as soon as the tests pass.

jpivarski suggested changes Dec 6, 2019

View reviewed changes

awkward/__init__.py Outdated Show resolved Hide resolved

awkward/arrow.py Show resolved Hide resolved

awkward/arrow.py Show resolved Hide resolved

awkward/array/jagged.py Outdated Show resolved Hide resolved

awkward/array/jagged.py Outdated Show resolved Hide resolved

lgray force-pushed the less_arrow_copying branch from 1a1f6ca to b08f5b7 Compare December 7, 2019 14:35

lgray changed the title ~~First stab at making fromarrow really zero copy~~ Add LargeListArray to pyarrow serialization options. Dec 7, 2019

lgray force-pushed the less_arrow_copying branch from 7f52c3d to 2586acf Compare December 7, 2019 19:08

lgray changed the title ~~Add LargeListArray to pyarrow serialization options.~~ Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. Dec 7, 2019

jpivarski reviewed Dec 9, 2019

View reviewed changes

jpivarski approved these changes Dec 9, 2019

View reviewed changes

lgray added 4 commits December 9, 2019 10:40

add LargeListArray to arrow serialization options

4f42205

add support for binary arrays and fix behavior of fromarrow for pyarr…

a17fd06

…ow.ChunkedArray

bring in the last of the large indexed types

f9ec255

make sure to use right binary encoding type

6ac50fc

lgray force-pushed the less_arrow_copying branch from 6cb4c8e to 6ac50fc Compare December 9, 2019 16:40

lgray commented Dec 9, 2019

View reviewed changes

lgray added 2 commits December 9, 2019 11:32

all tests function, had to add default-false flag to handle parquet c…

40c80bb

…onversion

remove serialization of large-offset types but allow their deserializ…

5a513fb

…ation if encountered

jpivarski merged commit 03c1ef4 into scikit-hep:master Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222

Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222

lgray commented Dec 6, 2019 •

edited

Loading

lgray commented Dec 6, 2019

lgray commented Dec 6, 2019

jpivarski left a comment

jpivarski commented Dec 6, 2019

nsmith- commented Dec 6, 2019

lgray commented Dec 7, 2019

lgray commented Dec 7, 2019

lgray commented Dec 9, 2019 •

edited

Loading

jpivarski left a comment

jpivarski Dec 9, 2019

lgray Dec 9, 2019

jpivarski Dec 9, 2019

lgray Dec 9, 2019

jpivarski commented Dec 9, 2019

lgray left a comment •

edited

Loading

jpivarski commented Dec 9, 2019

lgray commented Dec 9, 2019

lgray commented Dec 9, 2019

lgray commented Dec 9, 2019

lgray commented Dec 11, 2019

jpivarski commented Dec 11, 2019

Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222

Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222

Conversation

lgray commented Dec 6, 2019 • edited Loading

lgray commented Dec 6, 2019

lgray commented Dec 6, 2019

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski commented Dec 6, 2019

nsmith- commented Dec 6, 2019

lgray commented Dec 7, 2019

lgray commented Dec 7, 2019

lgray commented Dec 9, 2019 • edited Loading

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski Dec 9, 2019

Choose a reason for hiding this comment

lgray Dec 9, 2019

Choose a reason for hiding this comment

jpivarski Dec 9, 2019

Choose a reason for hiding this comment

lgray Dec 9, 2019

Choose a reason for hiding this comment

jpivarski commented Dec 9, 2019

lgray left a comment • edited Loading

Choose a reason for hiding this comment

jpivarski commented Dec 9, 2019

lgray commented Dec 9, 2019

lgray commented Dec 9, 2019

lgray commented Dec 9, 2019

lgray commented Dec 11, 2019

jpivarski commented Dec 11, 2019

lgray commented Dec 6, 2019 •

edited

Loading

lgray commented Dec 9, 2019 •

edited

Loading

lgray left a comment •

edited

Loading