ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696

jorisvandenbossche · 2024-09-03T16:39:26Z

For now, this adds the most basic method, just converting an Arrow tabular object, without exposing any keyword arguments (or without exposing it in pd.DataFrame() directly)

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…h Arrow PyCapsule interface

jorisvandenbossche · 2024-09-03T16:40:50Z

cc @kylebarron

jorisvandenbossche · 2024-09-03T16:44:17Z

pandas/core/frame.py

@@ -1746,6 +1746,52 @@ def __rmatmul__(self, other) -> DataFrame:
    # ----------------------------------------------------------------------
    # IO methods (to / from other formats)

+    @classmethod
+    def from_arrow(cls, data) -> DataFrame:


We might want to type data through some typing protocol? (@kylebarron like you have ArrayStreamExportable in https://kylebarron.dev/arro3/latest/api/core/table/#arro3.core.Table.from_arrow)

I am not super familiar with typing, but I can just copy paste https://github.com/kylebarron/arro3/blob/main/arro3-core/python/arro3/core/types.py ?

Indeed, you can copy from there. Those come originally from the part of the spec here: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints

I wanted to add a bit more documentation on them so that the docs website would be friendly.

WillAyd · 2024-09-03T17:00:56Z

pandas/core/frame.py

+                    "'_arrow_c_array__' or '__arrow_c_stream__' method), got "
+                    f"'{type(data).__name__}' instead."
+                )
+            data = pa.table(data)


Does this actually work for things that only expose __arrow_c_array__?

In [28]: arr = pa.array([1, 2, 3]) In [29]: hasattr(arr, "__arrow_c_array__") Out[29]: True In [30]: pa.table(arr) --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) Cell In[30], line 1 ----> 1 pa.table(arr) File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:6022, in pyarrow.lib.table() File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:5841, in pyarrow.lib.record_batch() File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:3886, in pyarrow.lib.RecordBatch._import_from_c_device_capsule() File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status() File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status() ArrowInvalid: Cannot import schema: ArrowSchema describes non-struct type int64

Exposing __arrow_c_array__ is necessary but not sufficient. Both Array and RecordBatch expose the same __arrow_c_array__ interface. It's overloaded to be able to interpret a RecordBatch as the same as an Array with type Struct.

And to be fair, RecordBatch has both __arrow_c_array__ and __arrow_c_stream__ dunder methods, so just testing with RecordBatch does not actually prove that pa.table(..) works with objects that only implement the array version. But because I wrap the record batch in the tests in a dummy object that only exposes __arrow_c_array__, the tests should cover this and assert DataFrame.from_arrow() works with both dunder methods.

It's overloaded to be able to interpret a RecordBatch as the same as an Array with type Struct.

Ah OK that's good to know. So essentially its up to the producer to be able to determine if this makes sense right?

I think there is still a consistency problem with how we as a consumer then work. A RecordBatch can be read through both the array and stream interface, but a Table can only be read through the latter (unless it is forced to consolidate chunks and produce an Array).

I'm sure PyArrow has that covered well, but unless something gets clarified in the spec expecting array to work a certain way, that might make push libraries into making the (assumedly poor) decision that their streams should also produce consolidated array data

It's say it's up to the consumer to decide if the input makes sense. The producer just says "here's my data".

But I think the key added part is user intention. A struct array can represent either one array or a full RecordBatch, and we need a hint from the user for which is which. This is why I couldn't add PyCapsule Interface support to polars.from_arrow, because it's missing the user intention of "this object is a series" or "this object is a DataFrame".

I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.

I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.

A theoretical example is a library that produces Arrow data thinking that they need to implement __arrow_c_array__ for their "Table" equivalent since they did so for their RecordBatch equivalent. If the Table contained multiple chunks of data, I assume they would need to combine all of the chunks to pass data on through the __arrow_c_array__ interface

Maybe the spec should be more explicit about when to implement which interface. I think it's implicit that a RecordBatch can implement both, because both are zero copy, but a Table should only implement the stream interface, because only the stream interface is always zero copy.

I raised an issue a while ago to discuss consumer implications, if you haven't seen it: apache/arrow#40648

Ah OK great - thanks for sharing. I'll track that issue upstream

kylebarron · 2024-09-04T13:19:05Z

pandas/_typing.py

+    libraries.
+
+    .. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
+    .. _Arrow C Data Interface: https://arrow.apache.org/docs/format/CDataInterface.html


Nit: maybe this should link to the stream interface page instead? https://arrow.apache.org/docs/format/CStreamInterface.html

WillAyd

lgtm ex @kylebarron feedback

github-actions · 2024-10-05T00:06:57Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jorisvandenbossche added 3 commits September 3, 2024 18:32

ENH: add basic DataFrame.from_arrow class method for importing throug…

b63e601

…h Arrow PyCapsule interface

add validation

6901e6d

add return type

6af237c

jorisvandenbossche added the Arrow pyarrow functionality label Sep 3, 2024

jorisvandenbossche requested a review from WillAyd September 3, 2024 16:40

jorisvandenbossche commented Sep 3, 2024

View reviewed changes

WillAyd reviewed Sep 3, 2024

View reviewed changes

add type hints and protocol definitions

fad6bb1

jorisvandenbossche requested a review from Dr-Irv as a code owner September 4, 2024 08:57

kylebarron reviewed Sep 4, 2024

View reviewed changes

WillAyd approved these changes Sep 4, 2024

View reviewed changes

kylebarron mentioned this pull request Oct 2, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

github-actions bot added the Stale label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696

ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696

jorisvandenbossche commented Sep 3, 2024 •

edited

Loading

jorisvandenbossche commented Sep 3, 2024

jorisvandenbossche Sep 3, 2024

kylebarron Sep 3, 2024 •

edited

Loading

WillAyd Sep 3, 2024

kylebarron Sep 3, 2024

jorisvandenbossche Sep 4, 2024

WillAyd Sep 4, 2024 •

edited

Loading

kylebarron Sep 4, 2024

WillAyd Sep 4, 2024 •

edited

Loading

kylebarron Sep 4, 2024

WillAyd Sep 4, 2024

kylebarron Sep 4, 2024

WillAyd left a comment

github-actions bot commented Oct 5, 2024

ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696

Are you sure you want to change the base?

ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696

Conversation

jorisvandenbossche commented Sep 3, 2024 • edited Loading

jorisvandenbossche commented Sep 3, 2024

Choose a reason for hiding this comment

kylebarron Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 5, 2024

jorisvandenbossche commented Sep 3, 2024 •

edited

Loading

kylebarron Sep 3, 2024 •

edited

Loading

WillAyd Sep 4, 2024 •

edited

Loading

WillAyd Sep 4, 2024 •

edited

Loading