Read Arrow C Stream from Arrow PyCapsule Interface #501

kylebarron · 2024-07-23T02:00:23Z

Closes #498.

This is essentially just a vendor of these lines of code (and then downported to pyo3 0.20).

jonmmease · 2024-07-23T16:08:20Z

Thanks @kylebarron, could you add a quick comment somewhere in the code for how we could replace the vendored code after updating arrow and pulling in pyo3-arrow?

kylebarron · 2024-07-23T16:27:00Z

vegafusion-common/src/data/table.rs

+    // pub fn from_arrow_c_stream(table: pyo3_arrow::PyTable) -> PyResult<Self> {
+    //     let (batches, schema) = table.into_inner();
+    //     Ok(VegaFusionTable::try_new(schema, batches)?)
+    // }


Only a couple lines of code with pyo3_arrow!

Aside: I never know whether there's consistency in the ecosystem between an ordering of (schema, batches) or (batches, schema) 🫠

kylebarron · 2024-08-08T17:06:49Z

@jonmmease are you able to see why the CI checks failed? It just says "job failed" for me

jonmmease · 2024-08-08T21:47:59Z

I haven't had a chance to dig into it, but it's fine to leave things here. I'll finish it up for the next VegaFusion release

jonmmease · 2024-08-13T11:49:41Z

Quick update. I'm slogging through updating DataFusion and arrow-rs over in #504. Once that lands we can rebase this and drop the vendored code.

kylebarron · 2024-08-13T16:35:02Z

Awesome. And I just published pyo3-arrow 0.2 last night

…read-arrow-c-stream

jonmmease · 2024-08-15T00:29:24Z

@kylebarron I made some updates to this PR after updating arrow to version 52, and updating pyo3-arrow to 0.2.0. If you have a chance to see if what I have here makes sense, that would be great.

I tested this using Polars, and it seems to be working in general, but I ran into an issue immediately that string columns from Polars are failing to convert with:

C Data interface error: The datatype ""vu"" is still not supported in Rust implementation

Looks like arrow-rs isn't supporting the new UTF8View type in https://github.com/apache/arrow-rs/blob/ca5fe7df5ca0bdcfd6f732cc3f10da511b753c5f/arrow/src/datatypes/ffi.rs#L30. Does that track your understanding?

Seems that work may be needed in arrow-rs, but I don't think I can route Polars through this path until this support is added.

Update: Oh, this was merged two weeks ago in apache/arrow-rs#6171, but arrow-rs 52.2.0 was released 3 weeks ago. So we'll just need to wait a bit I guess. We can still merge this as a fallback behind the dataframe protocol and the reorder later.

kylebarron · 2024-08-16T02:22:30Z

vegafusion-common/src/data/table.rs

+    pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
+        let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;


If you keep this method the same to only support PyTable, PyTable implements FromPyObject, so you can make it the parameter directly.

Suggested change

pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {

let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;

pub fn from_arrow_c_stream(table: pyo3_arrow::PyTable) -> PyResult<Self> {

kylebarron · 2024-08-16T02:22:54Z

vegafusion-common/src/data/table.rs

+    pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
+        let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;
+        let (batches, schema) = pytable.into_inner();
+        Ok(VegaFusionTable::try_new(schema, batches)?)


I never know whether batches, schema or schema, batches is a better ordering 🥲

kylebarron · 2024-08-16T02:28:48Z

vegafusion-common/src/data/table.rs

@@ -271,6 +272,13 @@ impl VegaFusionTable {
        }
    }

+    #[cfg(feature = "pyarrow")]
+    pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {


Is this a public API or is this an internal API called by your Python code? If it's a public API, it might be preferred to overload the existing from_pyarrow method.

You can check for both the pycapsule interface and fall back to pyarrow in a single method. That's a recommended approach here.

You can do this by first trying to extract a PyTable; this will work for any recent pyarrow table object (or pyarrow record batch reader). And then you can fallback to your existing from_pyarrow code.

pub fn from_arrow(table: &Bound<PyAny>) -> PyResult<Self> { if let Ok(table) = data.extract::<PyTable>() { let (batches, schema) = table.into_inner(); Ok(VegaFusionTable::try_new(schema, batches)?) } else { Self::from_pyarrow(table) } }

kylebarron · 2024-08-16T02:36:11Z

C Data interface error: The datatype ""vu"" is still not supported in Rust implementation

Ah yes, of course we'd hit this.

Correct, arrow-rs isn't supporting the new StringView type until arrow 53. (arrow-rs didn't merge any breaking changes in 52.1 or 52.2). I believe everything is set for this support in arrow 53 though (in september).

The PyCapsule Interface provides a mechanism to handle this: schema requests. It's a way for consumers to ask producers to cast their data export to a specific data type. I'd like to do a follow up PR to polars to implement this, but I haven't yet: pola-rs/polars#17676 (comment). So Polars will currently ignore schema requests.

So the two routes are either:

Wait for arrow 53
Implement schema requests in polars; wait for new polars release; implement schema request logic on the consumer side here (to ask for normal string types instead of string views)

jonmmease · 2024-08-16T13:04:18Z

Thanks for the context @kylebarron, I'm fine to wait for Arrow 53. But this will also require waiting for a DataFusion release based on Arrow 53, so I guess that could end up being a month later.

Separately, does the Arrow PyCapsule interface provide a way to query for whether extracting the arrow data is a cheap operation? Ibis is one scenario I have in mind. I believe Ibis supports the PyCapsule interface, but this must require running the SQL query and pulling down every possible column, which can be very expensive. For Ibis, it's much more efficient to request only the specific columns that we actually need. For a library that already has the Arrow representation materialized in memory, I would expect it to be cheaper to return the full representation rather than build a new representation that has only the needed columns.

It would be nice if there were a way to say, please return these columns, but it's fine to return more if that's cheaper.

kylebarron · 2024-08-16T14:44:22Z

does the Arrow PyCapsule interface provide a way to query for whether extracting the arrow data is a cheap operation?

No it doesn't, but I think this largely comes into user semantics. So in the case of ibis, you might have an ibis Table representing a full SQL table. So passing that as arrow_func(ibis_table) will materialize the entire table, but the user can consciously call arrow_func(ibis_table.select(column_names)), which gives the user the ability to limit the amount of data collected and passed to arrow_func.

In my mind the interface is mostly around efficiently communicating between libraries that already have an Arrow representation

kylebarron · 2024-08-16T14:48:26Z

this will also require waiting for a DataFusion release based on Arrow 53, so I guess that could end up being a month later.

Right. I would like to implement schema requests in polars regardless, but it's a lower priority than getting it implemented to start with.

kylebarron · 2024-09-09T21:57:23Z

As an update here, lamentably arrow 53 supports StringView/BinaryView but not through C FFI, see apache/arrow-rs#6368. So if you're restricted to using a released version of arrow (e.g. not a git tag), we'll have to wait here until arrow 53.1

jonmmease · 2024-09-09T22:47:31Z

I think we are restricted to using a released crate because we publish the Rust crates to crates.io so that conda-forge can rebuild the sdist from source and I don't think this will work with a tagged dependency.

kylebarron · 2024-09-10T06:54:48Z

If you publish to crates.io, then yes, you can only use released versions. I might be able to get around this in arro3 specifically by vendoring the relevant parts of the FFI fixes

kylebarron · 2024-10-06T14:42:50Z

Arrow 53.1 was released, which appears to fully support string view and binary view: kylebarron/arro3#219. I'd love to get this merged! Let me know if you need any help!

jonmmease · 2024-10-06T16:33:45Z

Awesome, I'm working on updating to DataFusion 42 and Arrow 53.1 over in #513. I've started a v2 branch and I'm planning to retarget this PR to v2 after merging that one.

For VegaFusion 2, I want to make pandas and pyarrow optional and this PR is a crucial part of that plan.

jonmmease · 2024-10-07T12:59:26Z

I started trying to port this to my v2 branch that's updated to arrow 53.1, but ran into a roadblock. The pyarrow feature in the arrow crate requires PyO3 0.22, which is what I've updated VegaFusion to use in v2. But it looks like pyo3-arrow requires PyO3 0.21.

Are there any blockers for you in updating to PyO3 0.22? I may be able to drop the arrow/pyarrow feature flag eventually, but there's more to figure out here.

In particular, some of VegaFusion's logic currently returns Arrow data from Rust to Python as pyarrow, and we'll need to think about what to do here in order for pyarrow to be an optional dependency. I suppose we could return the PyCapsule object from Rust and that create PyArrow or Polars from that.

kylebarron · 2024-10-07T13:21:57Z

Are there any blockers for you in updating to PyO3 0.22?

Yes. I need the numpy crate to support 0.22, which should happen any day now: PyO3/rust-numpy#442, PyO3/rust-numpy#435.

In particular, some of VegaFusion's logic currently returns Arrow data from Rust to Python as pyarrow, and we'll need to think about what to do here in order for pyarrow to be an optional dependency.

All of the relevant pyo3_arrow objects have a to_pyarrow method. Currently, this works in terms of the PyCapsule Interface, so it has a pretty high pyarrow minimum version (pyarrow v14+ for Table), but in theory we could implement the older C interface approach with pyarrow for older versions.

jonmmease · 2024-10-07T13:43:34Z

All of the relevant pyo3_arrow objects have a to_pyarrow method. Currently, this works in terms of the PyCapsule Interface, so it has a pretty high pyarrow minimum version (pyarrow v14+ for Table), but in theory we could implement the older C interface approach with pyarrow for older versions.

Ok, so I could remove the arrow/pyarrow feature flag and use these methods for the conversion back to pyarrow? I'm ok with the higher pyarrow minimum version since this will be part of a major version update in VegaFusion, and will hopefully be optional.

kylebarron · 2024-10-07T13:45:07Z

Ok, so I could remove the arrow/pyarrow feature flag and use these methods for the conversion back to pyarrow?

Yep!

kylebarron · 2024-10-07T20:25:42Z

Looks like it's scheduled for Wednesday: PyO3/rust-numpy#453

kylebarron · 2024-10-11T19:55:14Z

I published pyo3-arrow 0.5 using pyo3 and numpy 0.22: https://crates.io/crates/pyo3-arrow

jonmmease · 2024-10-12T12:51:38Z

Starting to work on updating to pyo3-arrow 0.5, and it looks like the minimum Python version is now 3.11 (at least I get compilation errors with abi3-py310 and below). Is that intentional? I guess this is technically exactly in line with spec-0000, but Altair is generally a bit more generous here, following closer to the Python EOL schedule, which would still require Python 3.9 support.

Here are the compilation errors with abi3-py310

   Compiling pyo3-arrow v0.5.0
error[E0432]: unresolved import `pyo3::buffer`
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:16:11
    |
16  | use pyo3::buffer::{ElementType, PyBuffer};
    |           ^^^^^^ could not find `buffer` in `pyo3`
    |
note: found an item that was configured out
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.22.4/src/lib.rs:442:9
    |
442 | pub mod buffer;
    |         ^^^^^^

error[E0412]: cannot find type `Py_buffer` in module `ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:49:25
   |
49 |         view: *mut ffi::Py_buffer,
   |                         ^^^^^^^^^ not found in `ffi`

error[E0425]: cannot find function, tuple struct or tuple variant `PyBuffer_FillInfo` in module `ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:54:28
   |
54 |             let ret = ffi::PyBuffer_FillInfo(
   |                            ^^^^^^^^^^^^^^^^^ not found in `ffi`

error[E0412]: cannot find type `Py_buffer` in module `ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:71:75
   |
71 |     unsafe fn __releasebuffer__(mut slf: PyRefMut<Self>, _view: *mut ffi::Py_buffer) {
   |                                                                           ^^^^^^^^^ not found in `ffi`

error[E0412]: cannot find type `Py_buffer` in module `pyo3::ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::ffi`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0425]: cannot find function `getbufferproc` in module `pyo3::impl_::trampoline`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::impl_::trampoline`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0412]: cannot find type `getbufferproc` in module `pyo3::ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::ffi`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0425]: cannot find function `releasebufferproc` in module `pyo3::impl_::trampoline`
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
    |
34  | #[pymethods]
    | ^^^^^^^^^^^^ not found in `pyo3::impl_::trampoline`
    |
note: found an item that was configured out
   --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.22.4/src/impl_/trampoline.rs:132:15
    |
132 | pub unsafe fn releasebufferproc(
    |               ^^^^^^^^^^^^^^^^^
    = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0412]: cannot find type `releasebufferproc` in module `pyo3::ffi`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/buffer.rs:34:1
   |
34 | #[pymethods]
   | ^^^^^^^^^^^^ not found in `pyo3::ffi`
   |
   = note: this error originates in the attribute macro `pymethods` (in Nightly builds, run with -Z macro-backtrace for more info)

kylebarron · 2024-10-12T17:20:08Z

The minimum Python version does not need to be 3.11. You have three options:

use abi3-py311
Remove any abi3* flag
Take off the buffer_protocol feature flag with no-default-features

pyo3-arrow now supports buffer-protocol input (like numpy arrays) with zero copy, as long as they are C contiguous. But the buffer protocol only became part of Python's ABI-stable subset as of Python 3.11.

Because of this, arro3 builds per-Python version wheels now. https://pypi.org/project/arro3-core/#files

jonmmease · 2024-10-12T20:27:12Z

Thanks for the context @kylebarron! When I disable the default features, I'm seeing this build error (regardless of Python version).

   Compiling pyo3-arrow v0.5.0
error[E0432]: unresolved import `crate::PyArrowBuffer`
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/array.rs:31:13
   |
31 | use crate::{PyArrowBuffer, PyDataType, PyField};
   |             ^^^^^^^^^^^^^ no `PyArrowBuffer` in the root
   |
note: found an item that was configured out
  --> /Users/jonmmease/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-arrow-0.5.0/src/lib.rs:25:17
   |
25 | pub use buffer::PyArrowBuffer;
   |                 ^^^^^^^^^^^^^
   = note: the item is gated behind the `buffer_protocol` feature

For more information about this error, try `rustc --explain E0432`.
error: could not compile `pyo3-arrow` (lib) due to 1 previous error

It looks like the import here might need to be conditioned on the buffer_protocol. Does that make sense?

https://github.com/kylebarron/arro3/blob/523585bc95fd2a0ee010c5f47a798e808e8f388d/pyo3-arrow/src/array.rs#L31

kylebarron · 2024-10-12T20:28:16Z

Ah I need to add a ci check for no default features :/

Read from Arrow C Data interface

0c39d5d

kylebarron changed the title ~~Read from Arrow C Data interface~~ Read Arrow C Stream from Arrow PyCapsule Interface Jul 23, 2024

add comment

3db45e0

kylebarron commented Jul 23, 2024

View reviewed changes

kylebarron mentioned this pull request Jul 23, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

jonmmease mentioned this pull request Aug 13, 2024

[Enh]: Pass through the Arrow PyCapsule Interface narwhals-dev/narwhals#784

Closed

MarcoGorelli mentioned this pull request Aug 14, 2024

feat(python!): Use Altair in DataFrame.plot pola-rs/polars#17995

Merged

jonmmease added 4 commits August 14, 2024 18:13

Merge remote-tracking branch 'origin/main' into fork/kylebarron/kyle/…

9014157

…read-arrow-c-stream

Finish merge

98aae9c

update Cargo.toml

d690caa

fmt

26340fa

kylebarron commented Aug 16, 2024

View reviewed changes

kylebarron mentioned this pull request Sep 4, 2024

Support Arrow PyCapsule Interface vega/altair#3568

Open

jonmmease mentioned this pull request Oct 5, 2024

Update to DataFusion 42, Arrow 53.1 (RC), remove DataFrameDataSource Python logic #513

Merged

jonmmease mentioned this pull request Oct 12, 2024

Add initial Arrow PyCapsule support #517

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read Arrow C Stream from Arrow PyCapsule Interface #501

Read Arrow C Stream from Arrow PyCapsule Interface #501

kylebarron commented Jul 23, 2024 •

edited

Loading

jonmmease commented Jul 23, 2024

kylebarron Jul 23, 2024

kylebarron Jul 23, 2024

kylebarron commented Aug 8, 2024

jonmmease commented Aug 8, 2024

jonmmease commented Aug 13, 2024

kylebarron commented Aug 13, 2024

jonmmease commented Aug 15, 2024 •

edited

Loading

kylebarron Aug 16, 2024

kylebarron Aug 16, 2024

kylebarron Aug 16, 2024

kylebarron commented Aug 16, 2024

jonmmease commented Aug 16, 2024

kylebarron commented Aug 16, 2024

kylebarron commented Aug 16, 2024

kylebarron commented Sep 9, 2024

jonmmease commented Sep 9, 2024

kylebarron commented Sep 10, 2024

kylebarron commented Oct 6, 2024 •

edited

Loading

jonmmease commented Oct 6, 2024

jonmmease commented Oct 7, 2024

kylebarron commented Oct 7, 2024 •

edited

Loading

jonmmease commented Oct 7, 2024

kylebarron commented Oct 7, 2024

kylebarron commented Oct 7, 2024

kylebarron commented Oct 11, 2024

jonmmease commented Oct 12, 2024

kylebarron commented Oct 12, 2024

jonmmease commented Oct 12, 2024

kylebarron commented Oct 12, 2024

		pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
		let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;

	pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> {
	let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?;
	pub fn from_arrow_c_stream(table: pyo3_arrow::PyTable) -> PyResult<Self> {

Read Arrow C Stream from Arrow PyCapsule Interface #501

Are you sure you want to change the base?

Read Arrow C Stream from Arrow PyCapsule Interface #501

Conversation

kylebarron commented Jul 23, 2024 • edited Loading

jonmmease commented Jul 23, 2024

kylebarron Jul 23, 2024

Choose a reason for hiding this comment

kylebarron Jul 23, 2024

Choose a reason for hiding this comment

kylebarron commented Aug 8, 2024

jonmmease commented Aug 8, 2024

jonmmease commented Aug 13, 2024

kylebarron commented Aug 13, 2024

jonmmease commented Aug 15, 2024 • edited Loading

kylebarron Aug 16, 2024

Choose a reason for hiding this comment

kylebarron Aug 16, 2024

Choose a reason for hiding this comment

kylebarron Aug 16, 2024

Choose a reason for hiding this comment

kylebarron commented Aug 16, 2024

jonmmease commented Aug 16, 2024

kylebarron commented Aug 16, 2024

kylebarron commented Aug 16, 2024

kylebarron commented Sep 9, 2024

jonmmease commented Sep 9, 2024

kylebarron commented Sep 10, 2024

kylebarron commented Oct 6, 2024 • edited Loading

jonmmease commented Oct 6, 2024

jonmmease commented Oct 7, 2024

kylebarron commented Oct 7, 2024 • edited Loading

jonmmease commented Oct 7, 2024

kylebarron commented Oct 7, 2024

kylebarron commented Oct 7, 2024

kylebarron commented Oct 11, 2024

jonmmease commented Oct 12, 2024

kylebarron commented Oct 12, 2024

jonmmease commented Oct 12, 2024

kylebarron commented Oct 12, 2024

kylebarron commented Jul 23, 2024 •

edited

Loading

jonmmease commented Aug 15, 2024 •

edited

Loading

kylebarron commented Oct 6, 2024 •

edited

Loading

kylebarron commented Oct 7, 2024 •

edited

Loading