[Python]: Support PyCapsule Interface Objects as input in more places #43410

kylebarron · 2024-07-24T15:59:14Z

Describe the enhancement requested

Now that the PyCapsule Interface is starting to gain more traction (#39195), I think it would be great if some of pyarrow's functional APIs accepted any PyCapsule Interface object, and not just pyarrow objects.

Do people have opinions on what functions should or should not check for these objects? I'd argue that file format writers should check for them, because it's only a couple lines of code, and the input stream will be fully iterated over regardless. E.g. looking at the Parquet writer: the high level API doesn't currently accept a RecordBatchReader either, so support for both can come at the same time.

from dataclasses import dataclass
from typing import Any

import pyarrow as pa
import pyarrow.parquet as pq


@dataclass
class ArrowCStream:
    obj: Any

    def __arrow_c_stream__(self, requested_schema=None):
        return self.obj.__arrow_c_stream__(requested_schema=requested_schema)


table = pa.table({"a": [1, 2, 3, 4]})
pq.write_table(table, "test.parquet")  # works

reader = pa.RecordBatchReader.from_stream(table)
pq.write_table(reader, "test.parquet")  # fails
pq.write_table(ArrowCStream(table), "test.parquet")  # fails

I'd argue that the writer should be generalized to accept any object with an __arrow_c_stream__ dunder, and to ensure the stream is not materialized as a table.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-08-20T15:35:16Z

Specifically for pq.write_table(), this might be a bit trickier (without consuming the stream) because this currently uses parquet::arrow::FileWriter::WriteTable which is explicitly requiring a table input. The FileWriter interface has support for writing record batches as well, so we could rewrite the code a bit to iterate over the batches of the stream (but at that point, should that be done in something called write_table?)

But in general, certainly +1 on more widely supporting the interface.

Some other possible areas:

The dataset API for writing. In this case, pyarrow.dataset.write_dataset already does accept a record batch reader, so this should be straightforward to extend
Compute functions from pyarrow.compute ? Those could certainly accept objects with __arrow_c_array__, and in theory also __arrow_c_stream__, but those will fully consume the stream and return a materialized result, so not sure if that will be expected? (although, if you know those functions, that is kind of expected, so maybe this just requires good documentation)
Many of the methods on the Array/RecordBatch/Table classes accept similar objects (e.g. arr.take(..)). Not sure if we want to make those work with interface objects as well. Although currently what we exactly support as input is a bit inconsistent (only strictly a pyarrow array, or also a numpy array, a list, anything array-like or any sequence or collection? So if we would harmonize that with some helper, then we could at once also easily add support for any arrow-array-like object)

…ite_dataset

jorisvandenbossche · 2024-08-20T16:12:05Z

Started with exploring write_dataset -> #43771

kylebarron · 2024-08-20T16:47:22Z

That sounds awesome.

For reference in my own experiments in https://github.com/kylebarron/arro3, I created an ArrayReader class, essentially just a RecordBatchReader but generalized to yield generic Arrays. Then for example cast is overloaded. So if it sees an object with __arrow_c_array__ it will immediately return an arro3.Array with the result. If it sees an object with __arrow_c_stream__ it will create a new ArrayReader holding an iterator with the compute function. So it will lazily yield casted chunks.

…w-pycapsule-dataset

kylebarron added the Type: enhancement label Jul 24, 2024

github-actions bot added the Component: Python label Jul 24, 2024

jorisvandenbossche mentioned this issue Aug 6, 2024

GH-25118: [Python] Make NumPy an optional runtime dependency #41904

Merged

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Aug 20, 2024

apacheGH-43410: [Python] Support Arrow PyCapsule stream objects in wr…

563b30c

…ite_dataset

github-actions bot mentioned this issue Aug 20, 2024

GH-43410: [Python] Support Arrow PyCapsule stream objects in write_dataset #43771

Open

github-actions bot assigned jorisvandenbossche Aug 20, 2024

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Sep 23, 2024

Merge remote-tracking branch 'upstream/main' into apachegh-43410-arro…

0083e02

…w-pycapsule-dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python]: Support PyCapsule Interface Objects as input in more places #43410

[Python]: Support PyCapsule Interface Objects as input in more places #43410

kylebarron commented Jul 24, 2024

jorisvandenbossche commented Aug 20, 2024

jorisvandenbossche commented Aug 20, 2024

kylebarron commented Aug 20, 2024

[Python]: Support PyCapsule Interface Objects as input in more places #43410

[Python]: Support PyCapsule Interface Objects as input in more places #43410

Comments

kylebarron commented Jul 24, 2024

Describe the enhancement requested

Component(s)

jorisvandenbossche commented Aug 20, 2024

jorisvandenbossche commented Aug 20, 2024

kylebarron commented Aug 20, 2024