Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 3, 2024

See #59631

For now, this adds the most basic method, just converting an Arrow tabular object, without exposing any keyword arguments (or without exposing it in pd.DataFrame() directly)

@jorisvandenbossche jorisvandenbossche added the Arrow pyarrow functionality label Sep 3, 2024
@jorisvandenbossche
Copy link
Member Author

cc @kylebarron

@@ -1746,6 +1746,52 @@ def __rmatmul__(self, other) -> DataFrame:
# ----------------------------------------------------------------------
# IO methods (to / from other formats)

@classmethod
def from_arrow(cls, data) -> DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to type data through some typing protocol? (@kylebarron like you have ArrayStreamExportable in https://kylebarron.dev/arro3/latest/api/core/table/#arro3.core.Table.from_arrow)

I am not super familiar with typing, but I can just copy paste https://github.com/kylebarron/arro3/blob/main/arro3-core/python/arro3/core/types.py ?

Copy link
Contributor

@kylebarron kylebarron Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, you can copy from there. Those come originally from the part of the spec here: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints

I wanted to add a bit more documentation on them so that the docs website would be friendly.

"'_arrow_c_array__' or '__arrow_c_stream__' method), got "
f"'{type(data).__name__}' instead."
)
data = pa.table(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work for things that only expose __arrow_c_array__?

In [28]: arr = pa.array([1, 2, 3])

In [29]: hasattr(arr, "__arrow_c_array__")
Out[29]: True

In [30]: pa.table(arr)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[30], line 1
----> 1 pa.table(arr)

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:6022, in pyarrow.lib.table()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:5841, in pyarrow.lib.record_batch()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:3886, in pyarrow.lib.RecordBatch._import_from_c_device_capsule()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: Cannot import schema: ArrowSchema describes non-struct type int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing __arrow_c_array__ is necessary but not sufficient. Both Array and RecordBatch expose the same __arrow_c_array__ interface. It's overloaded to be able to interpret a RecordBatch as the same as an Array with type Struct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to be fair, RecordBatch has both __arrow_c_array__ and __arrow_c_stream__ dunder methods, so just testing with RecordBatch does not actually prove that pa.table(..) works with objects that only implement the array version. But because I wrap the record batch in the tests in a dummy object that only exposes __arrow_c_array__, the tests should cover this and assert DataFrame.from_arrow() works with both dunder methods.

Copy link
Member

@WillAyd WillAyd Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's overloaded to be able to interpret a RecordBatch as the same as an Array with type Struct.

Ah OK that's good to know. So essentially its up to the producer to be able to determine if this makes sense right?

I think there is still a consistency problem with how we as a consumer then work. A RecordBatch can be read through both the array and stream interface, but a Table can only be read through the latter (unless it is forced to consolidate chunks and produce an Array).

I'm sure PyArrow has that covered well, but unless something gets clarified in the spec expecting array to work a certain way, that might make push libraries into making the (assumedly poor) decision that their streams should also produce consolidated array data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's say it's up to the consumer to decide if the input makes sense. The producer just says "here's my data".

But I think the key added part is user intention. A struct array can represent either one array or a full RecordBatch, and we need a hint from the user for which is which. This is why I couldn't add PyCapsule Interface support to polars.from_arrow, because it's missing the user intention of "this object is a series" or "this object is a DataFrame".

I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.

Copy link
Member

@WillAyd WillAyd Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.

A theoretical example is a library that produces Arrow data thinking that they need to implement __arrow_c_array__ for their "Table" equivalent since they did so for their RecordBatch equivalent. If the Table contained multiple chunks of data, I assume they would need to combine all of the chunks to pass data on through the __arrow_c_array__ interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the spec should be more explicit about when to implement which interface. I think it's implicit that a RecordBatch can implement both, because both are zero copy, but a Table should only implement the stream interface, because only the stream interface is always zero copy.

I raised an issue a while ago to discuss consumer implications, if you haven't seen it: apache/arrow#40648

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK great - thanks for sharing. I'll track that issue upstream

libraries.

.. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
.. _Arrow C Data Interface: https://arrow.apache.org/docs/format/CDataInterface.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe this should link to the stream interface page instead? https://arrow.apache.org/docs/format/CStreamInterface.html

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex @kylebarron feedback

Copy link
Contributor

github-actions bot commented Oct 5, 2024

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants