Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python]: Support PyCapsule Interface Objects as input in more places #43410

Open
kylebarron opened this issue Jul 24, 2024 · 3 comments
Open

Comments

@kylebarron
Copy link
Contributor

Describe the enhancement requested

Now that the PyCapsule Interface is starting to gain more traction (#39195), I think it would be great if some of pyarrow's functional APIs accepted any PyCapsule Interface object, and not just pyarrow objects.

Do people have opinions on what functions should or should not check for these objects? I'd argue that file format writers should check for them, because it's only a couple lines of code, and the input stream will be fully iterated over regardless. E.g. looking at the Parquet writer: the high level API doesn't currently accept a RecordBatchReader either, so support for both can come at the same time.

from dataclasses import dataclass
from typing import Any

import pyarrow as pa
import pyarrow.parquet as pq


@dataclass
class ArrowCStream:
    obj: Any

    def __arrow_c_stream__(self, requested_schema=None):
        return self.obj.__arrow_c_stream__(requested_schema=requested_schema)


table = pa.table({"a": [1, 2, 3, 4]})
pq.write_table(table, "test.parquet")  # works

reader = pa.RecordBatchReader.from_stream(table)
pq.write_table(reader, "test.parquet")  # fails
pq.write_table(ArrowCStream(table), "test.parquet")  # fails

I'd argue that the writer should be generalized to accept any object with an __arrow_c_stream__ dunder, and to ensure the stream is not materialized as a table.

Component(s)

Python

@jorisvandenbossche
Copy link
Member

Specifically for pq.write_table(), this might be a bit trickier (without consuming the stream) because this currently uses parquet::arrow::FileWriter::WriteTable which is explicitly requiring a table input. The FileWriter interface has support for writing record batches as well, so we could rewrite the code a bit to iterate over the batches of the stream (but at that point, should that be done in something called write_table?)

But in general, certainly +1 on more widely supporting the interface.

Some other possible areas:

  • The dataset API for writing. In this case, pyarrow.dataset.write_dataset already does accept a record batch reader, so this should be straightforward to extend
  • Compute functions from pyarrow.compute ? Those could certainly accept objects with __arrow_c_array__, and in theory also __arrow_c_stream__, but those will fully consume the stream and return a materialized result, so not sure if that will be expected? (although, if you know those functions, that is kind of expected, so maybe this just requires good documentation)
  • Many of the methods on the Array/RecordBatch/Table classes accept similar objects (e.g. arr.take(..)). Not sure if we want to make those work with interface objects as well. Although currently what we exactly support as input is a bit inconsistent (only strictly a pyarrow array, or also a numpy array, a list, anything array-like or any sequence or collection? So if we would harmonize that with some helper, then we could at once also easily add support for any arrow-array-like object)

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Aug 20, 2024
@jorisvandenbossche
Copy link
Member

Started with exploring write_dataset -> #43771

@kylebarron
Copy link
Contributor Author

That sounds awesome.

For reference in my own experiments in https://github.com/kylebarron/arro3, I created an ArrayReader class, essentially just a RecordBatchReader but generalized to yield generic Arrays. Then for example cast is overloaded. So if it sees an object with __arrow_c_array__ it will immediately return an arro3.Array with the result. If it sees an object with __arrow_c_stream__ it will create a new ArrayReader holding an iterator with the compute function. So it will lazily yield casted chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants