Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add basic DataFrame.from_arrow class method for importing through Arrow PyCapsule interface #59696

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 46 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1746,6 +1746,52 @@ def __rmatmul__(self, other) -> DataFrame:
# ----------------------------------------------------------------------
# IO methods (to / from other formats)

@classmethod
def from_arrow(cls, data) -> DataFrame:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to type data through some typing protocol? (@kylebarron like you have ArrayStreamExportable in https://kylebarron.dev/arro3/latest/api/core/table/#arro3.core.Table.from_arrow)

I am not super familiar with typing, but I can just copy paste https://github.com/kylebarron/arro3/blob/main/arro3-core/python/arro3/core/types.py ?

Copy link
Contributor

@kylebarron kylebarron Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, you can copy from there. Those come originally from the part of the spec here: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints

I wanted to add a bit more documentation on them so that the docs website would be friendly.

"""
Construct a DataFrame from a tabular Arrow object.

This function accepts any tabular Arrow object implementing
the `Arrow PyCapsule Protocol`_ (i.e. having an ``__arrow_c_array__``
or ``__arrow_c_stream__`` method).

This function currently relies on ``pyarrow`` to convert the tabular
object in Arrow format to pandas.

.. _Arrow PyCapsule Protocol: https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html

.. versionadded:: 3.0

Parameters
----------
data : pyarrow.Table or Arrow-compatible table
Any tabular object implementing the Arrow PyCapsule Protocol
(i.e. has an ``__arrow_c_array__`` or ``__arrow_c_stream__``
method).

Returns
-------
DataFrame

"""
pa = import_optional_dependency("pyarrow", min_version="14.0.0")
if not isinstance(data, pa.Table):
if not (
hasattr(data, "__arrow_c_array__")
or hasattr(data, "__arrow_c_stream__")
):
# explicitly test this, because otherwise we would accept variour other
# input types through the pa.table(..) call
raise TypeError(
"Expected an Arrow-compatible tabular object (i.e. having an "
"'_arrow_c_array__' or '__arrow_c_stream__' method), got "
f"'{type(data).__name__}' instead."
)
data = pa.table(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work for things that only expose __arrow_c_array__?

In [28]: arr = pa.array([1, 2, 3])

In [29]: hasattr(arr, "__arrow_c_array__")
Out[29]: True

In [30]: pa.table(arr)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[30], line 1
----> 1 pa.table(arr)

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:6022, in pyarrow.lib.table()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:5841, in pyarrow.lib.record_batch()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/table.pxi:3886, in pyarrow.lib.RecordBatch._import_from_c_device_capsule()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/mambaforge/envs/scratchpad/lib/python3.12/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: Cannot import schema: ArrowSchema describes non-struct type int64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing __arrow_c_array__ is necessary but not sufficient. Both Array and RecordBatch expose the same __arrow_c_array__ interface. It's overloaded to be able to interpret a RecordBatch as the same as an Array with type Struct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to be fair, RecordBatch has both __arrow_c_array__ and __arrow_c_stream__ dunder methods, so just testing with RecordBatch does not actually prove that pa.table(..) works with objects that only implement the array version. But because I wrap the record batch in the tests in a dummy object that only exposes __arrow_c_array__, the tests should cover this and assert DataFrame.from_arrow() works with both dunder methods.

Copy link
Member

@WillAyd WillAyd Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's overloaded to be able to interpret a RecordBatch as the same as an Array with type Struct.

Ah OK that's good to know. So essentially its up to the producer to be able to determine if this makes sense right?

I think there is still a consistency problem with how we as a consumer then work. A RecordBatch can be read through both the array and stream interface, but a Table can only be read through the latter (unless it is forced to consolidate chunks and produce an Array).

I'm sure PyArrow has that covered well, but unless something gets clarified in the spec expecting array to work a certain way, that might make push libraries into making the (assumedly poor) decision that their streams should also produce consolidated array data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's say it's up to the consumer to decide if the input makes sense. The producer just says "here's my data".

But I think the key added part is user intention. A struct array can represent either one array or a full RecordBatch, and we need a hint from the user for which is which. This is why I couldn't add PyCapsule Interface support to polars.from_arrow, because it's missing the user intention of "this object is a series" or "this object is a DataFrame".

I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.

Copy link
Member

@WillAyd WillAyd Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the rest of your comment @WillAyd. A stream never needs to concatenate data before starting the stream.

A theoretical example is a library that produces Arrow data thinking that they need to implement __arrow_c_array__ for their "Table" equivalent since they did so for their RecordBatch equivalent. If the Table contained multiple chunks of data, I assume they would need to combine all of the chunks to pass data on through the __arrow_c_array__ interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the spec should be more explicit about when to implement which interface. I think it's implicit that a RecordBatch can implement both, because both are zero copy, but a Table should only implement the stream interface, because only the stream interface is always zero copy.

I raised an issue a while ago to discuss consumer implications, if you haven't seen it: apache/arrow#40648

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK great - thanks for sharing. I'll track that issue upstream


df = data.to_pandas()
return df

@classmethod
def from_dict(
cls,
Expand Down
44 changes: 44 additions & 0 deletions pandas/tests/frame/test_arrow_interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import pandas.util._test_decorators as td

import pandas as pd
import pandas._testing as tm

pa = pytest.importorskip("pyarrow")

Expand Down Expand Up @@ -47,3 +48,46 @@ def test_dataframe_to_arrow():
table = pa.RecordBatchReader.from_stream(df, schema=schema).read_all()
expected = expected.cast(schema)
assert table.equals(expected)


class ArrowArrayWrapper:
def __init__(self, batch):
self.array = batch

def __arrow_c_array__(self, requested_schema=None):
return self.array.__arrow_c_array__(requested_schema)


class ArrowStreamWrapper:
def __init__(self, table):
self.stream = table

def __arrow_c_stream__(self, requested_schema=None):
return self.stream.__arrow_c_stream__(requested_schema)


@td.skip_if_no("pyarrow", min_version="14.0")
def test_dataframe_from_arrow():
# objects with __arrow_c_stream__
table = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})

result = pd.DataFrame.from_arrow(table)
expected = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
tm.assert_frame_equal(result, expected)

# not only pyarrow object are supported
result = pd.DataFrame.from_arrow(ArrowStreamWrapper(table))
tm.assert_frame_equal(result, expected)

# objects with __arrow_c_array__
batch = pa.record_batch([[1, 2, 3], ["a", "b", "c"]], names=["a", "b"])

result = pd.DataFrame.from_arrow(table)
tm.assert_frame_equal(result, expected)

result = pd.DataFrame.from_arrow(ArrowArrayWrapper(batch))
tm.assert_frame_equal(result, expected)

# only accept actual Arrow objects
with pytest.raises(TypeError, match="Expected an Arrow-compatible tabular object"):
pd.DataFrame.from_arrow({"a": [1, 2, 3], "b": ["a", "b", "c"]})
Loading