GH-39984: [Python] Add ChunkedArray import/export to/from C #39985

paleolimbot · 2024-02-07T18:19:52Z

Rationale for this change

ChunkedArrays have an unambiguous representation as a stream of arrays. #39455 added the ability to import/export in C++...this PR wires up the new functions in pyarrow.

What changes are included in this PR?

Added __arrow_c_stream__() and _import_from_c_capsule() to the ChunkedArray

Are these changes tested?

Yes! Tests were added.

Are there any user-facing changes?

Yes! But I'm not sure where the protocol methods are documented.

import pyarrow as pa
import nanoarrow as na
chunked = pa.chunked_array([pa.array([0, 1, 2]), pa.array([3, 4, 5])])
[na.c_array_view(item) for item in na.c_array_stream(chunked)]

[<nanoarrow.c_lib.CArrayView>
 - storage_type: 'int64'
 - length: 3
 - offset: 0
 - null_count: 0
 - buffers[2]:
   - <bool validity[0 b] >
   - <int64 data[24 b] 0 1 2>
 - dictionary: NULL
 - children[0]:,
 <nanoarrow.c_lib.CArrayView>
 - storage_type: 'int64'
 - length: 3
 - offset: 0
 - null_count: 0
 - buffers[2]:
   - <bool validity[0 b] >
   - <int64 data[24 b] 3 4 5>
 - dictionary: NULL
 - children[0]:]

stream_capsule = chunked.__arrow_c_stream__()
chunked2 = chunked._import_from_c_capsule(stream_capsule)
chunked2

<pyarrow.lib.ChunkedArray object at 0x105bb70b0>
[
  [
    0,
    1,
    2
  ],
  [
    3,
    4,
    5
  ]
]

Closes: [Python] Add ChunkedArray import/export to/from C #39984

github-actions · 2024-02-07T18:20:18Z

⚠️ GitHub issue #39984 has been automatically assigned in GitHub to PR creator.

pitrou · 2024-02-12T15:41:00Z

python/pyarrow/table.pxi

+        if requested_schema is not None:
+            out_schema = Schema._import_from_c_capsule(requested_schema)
+            if self.schema != out_schema:
+                table = self.cast(out_schema)


Not strictly necessary, but it would be nicer (both for memory consumption and for latency) to cast each batch when required, rather than all the table up front.

You could simply use RecordBatchReader.from_batches with a generator that casts each batch in turn. Something like:

batches = table.to_batches() if requested_schema is not None: out_schema = Schema._import_from_c_capsule(requested_schema) if self.schema != out_schema: batches = (batch.cast(out_schema) for batch in batches) return RecordBatchReader.from_batches(batches)

(or you can fold the functionality directly in PyRecordBatchReader)

pitrou · 2024-02-12T15:41:17Z

python/pyarrow/table.pxi

@@ -4932,7 +4994,13 @@ cdef class Table(_Tabular):
        -------
        PyCapsule
        """
-        return self.to_reader().__arrow_c_stream__(requested_schema)
+        cdef Table table = self


This is probably not required.

pitrou · 2024-02-12T15:41:56Z

python/pyarrow/table.pxi

+        if requested_schema is not None:
+            out_type = DataType._import_from_c_capsule(requested_schema)
+            if self.type != out_type:
+                chunked = self.cast(out_type)


Same remark as in Table.__arrow_c_stream__.

Is this as much of a concern as with a Table? I can't think of any clean way to do this lazily on a per-chunk basis, although I'm happy to remove the feature if it's that bad of an idea.

A quick check suggests casting up front is faster...perhaps because it obliterates the chunks:

import pyarrow as pa import numpy as np n = int(1e6) n_chunks = 1000 per_chunk = n // n_chunks chunks = [np.random.random(per_chunk) for i in range(n_chunks)] chunked = pa.chunked_array(chunks) def roundtrip_chunked(): stream_capsule = chunked.__arrow_c_stream__() chunked._import_from_c_capsule(stream_capsule) %timeit roundtrip_chunked() #> 3.72 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) def roundtrip_chunked_cast(): stream_capsule = chunked.cast(pa.float()).__arrow_c_stream__() chunked._import_from_c_capsule(stream_capsule) %timeit roundtrip_chunked_cast() #> 1.52 ms ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Well, I gave a possible solution above.

Also, your benchmark is bizarre: where is the casting in roundtrip_chunked?

Nowhere! A roundtrip without casting is almost certainly faster than a roundtrip with casting, and it is already twice as slow as the cast + roundtrip. I'm sure there's room to make a better benchmark (I'm also running a C++ debug build), but I'm personally convinced that the cast + export solution is not so bad that it should not be attempted.

Well, I gave a possible solution above.

For RecordBatches? I don't think we have a way to do that for a stream of Array in Arrow C++ or in pyarrow?

I'm happy to remove the feature as well and leave to be implemented properly later...I didn't anticipate it being controversial.

A roundtrip without casting is almost certainly faster than a roundtrip with casting, and it is already twice as slow as the cast + roundtrip.

If it doesn't make sense then there's something wrong.

I'm happy to remove the feature as well and leave to be implemented properly later...I didn't anticipate it being controversial.

It's not controversial. The implementation is. I'm sure for simple benchmarks with a small dataset, an otherwise idle machine, and enough RAM to hold multiple copies of the dataset, casting everything at once can seem slightly faster because it saves some overhead. That doesn't make it a viable strategy in the general case.

Got it! I dropped the cast and we can circle back and do it right. I opened a PR to do it properly for the batch-wise RecordBatchReader export and that might serve as a template for how this should work, too.

pitrou · 2024-02-12T16:03:59Z

python/pyarrow/tests/test_cffi.py

+    batch = make_batch()
+    requested_schema = pa.schema([('ints', pa.list_(pa.int64()))])
+    requested_capsule = requested_schema.__arrow_c_schema__()
+    # RecordBatch has no cast() method


This should be fixed instead of working around it.

pitrou · 2024-02-12T16:04:30Z

@paleolimbot I think it would be nice to add a test_nanoarrow.py to exercise nanoarrow integration.

paleolimbot

Not strictly necessary, but it would be nicer (both for memory consumption and for latency) to cast each batch when required, rather than all the table up front.

I removed table casting for now...I'm still game to try in #40066 , maybe with a CastingRecordBatchReader : public RecordBatchReader that would solve the issue for streams, too.

I think it would be nice to add a test_nanoarrow.py to exercise nanoarrow integration.

I opened #40065 ...probably best done when nanoarrow's Python is a little more stable (i.e., when there is a nanoarrow.Array and nanoarrow.ArrayStream, neither of which currently exist).

paleolimbot · 2024-02-13T17:57:06Z

python/pyarrow/table.pxi

+        if requested_schema is not None:
+            out_type = DataType._import_from_c_capsule(requested_schema)
+            if self.type != out_type:
+                chunked = self.cast(out_type)


Is this as much of a concern as with a Table? I can't think of any clean way to do this lazily on a per-chunk basis, although I'm happy to remove the feature if it's that bad of an idea.

A quick check suggests casting up front is faster...perhaps because it obliterates the chunks:

import pyarrow as pa import numpy as np n = int(1e6) n_chunks = 1000 per_chunk = n // n_chunks chunks = [np.random.random(per_chunk) for i in range(n_chunks)] chunked = pa.chunked_array(chunks) def roundtrip_chunked(): stream_capsule = chunked.__arrow_c_stream__() chunked._import_from_c_capsule(stream_capsule) %timeit roundtrip_chunked() #> 3.72 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) def roundtrip_chunked_cast(): stream_capsule = chunked.cast(pa.float()).__arrow_c_stream__() chunked._import_from_c_capsule(stream_capsule) %timeit roundtrip_chunked_cast() #> 1.52 ms ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

pitrou

Thank you @paleolimbot !

conbench-apache-arrow · 2024-02-14T13:44:44Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 91bf1c9.

There was 1 benchmark result with an error:

Commit Run on ursa-i9-9960x at 2024-02-14 11:17:50Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-19, scale_factor=10

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…ache#39985) ### Rationale for this change ChunkedArrays have an unambiguous representation as a stream of arrays. apache#39455 added the ability to import/export in C++...this PR wires up the new functions in pyarrow. ### What changes are included in this PR? - Added `__arrow_c_stream__()` and `_import_from_c_capsule()` to the `ChunkedArray` ### Are these changes tested? Yes! Tests were added. ### Are there any user-facing changes? Yes! But I'm not sure where the protocol methods are documented. ```python import pyarrow as pa import nanoarrow as na chunked = pa.chunked_array([pa.array([0, 1, 2]), pa.array([3, 4, 5])]) [na.c_array_view(item) for item in na.c_array_stream(chunked)] ``` [<nanoarrow.c_lib.CArrayView> - storage_type: 'int64' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - <bool validity[0 b] > - <int64 data[24 b] 0 1 2> - dictionary: NULL - children[0]:, <nanoarrow.c_lib.CArrayView> - storage_type: 'int64' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - <bool validity[0 b] > - <int64 data[24 b] 3 4 5> - dictionary: NULL - children[0]:] ```python stream_capsule = chunked.__arrow_c_stream__() chunked2 = chunked._import_from_c_capsule(stream_capsule) chunked2 ``` <pyarrow.lib.ChunkedArray object at 0x105bb70b0> [ [ 0, 1, 2 ], [ 3, 4, 5 ] ] * Closes: apache#39984 Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

draft

f2aa624

github-actions bot added Component: Python awaiting committer review Awaiting committer review labels Feb 7, 2024

paleolimbot added 3 commits February 8, 2024 16:02

fix export

75892d5

fix casting

1d5e5e7

add tests

e56d3d0

github-actions bot added the Component: Documentation label Feb 8, 2024

paleolimbot added 3 commits February 8, 2024 17:04

test batch/table requested schema

7260610

fix long line

15b0f68

remove accidental addition

fc6bbb7

github-actions bot removed the Component: Documentation label Feb 9, 2024

paleolimbot marked this pull request as ready for review February 9, 2024 13:52

paleolimbot requested a review from jorisvandenbossche February 9, 2024 13:52

pitrou reviewed Feb 12, 2024

View reviewed changes

This was referenced Feb 13, 2024

[Python] Add nanoarrow integration test #40065

Open

[Python] Arrow PyCapsule Protocol: standard way to get the schema of a "data" (array of stream) object? #39689

Open

[Python] Support requested_schema in __arrow_c_stream__ implementations #40066

Closed

remove casting for tables

a8c9be6

paleolimbot commented Feb 13, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 13, 2024

remove cast option

e9966b4

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 13, 2024

paleolimbot added 2 commits February 13, 2024 20:58

fix accidental reformat

5e24366

maybe actually fix reformat

6ac69ab

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 14, 2024

pitrou approved these changes Feb 14, 2024

View reviewed changes

pitrou merged commit 91bf1c9 into apache:main Feb 14, 2024
12 of 13 checks passed

pitrou removed the awaiting changes Awaiting changes label Feb 14, 2024

github-actions bot added the awaiting committer review Awaiting committer review label Feb 14, 2024

paleolimbot deleted the python-chunked-array-dunder branch February 16, 2024 15:27

llama90 mentioned this pull request Feb 18, 2024

[Python] Error when executing the command to build pyarrow #40117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39984: [Python] Add ChunkedArray import/export to/from C #39985

GH-39984: [Python] Add ChunkedArray import/export to/from C #39985

paleolimbot commented Feb 7, 2024 •

edited

Loading

github-actions bot commented Feb 7, 2024

pitrou Feb 12, 2024

pitrou Feb 12, 2024

pitrou Feb 12, 2024

paleolimbot Feb 13, 2024

pitrou Feb 13, 2024

paleolimbot Feb 13, 2024

pitrou Feb 13, 2024

pitrou Feb 13, 2024

paleolimbot Feb 14, 2024

pitrou Feb 12, 2024

pitrou commented Feb 12, 2024

paleolimbot left a comment

paleolimbot Feb 13, 2024

pitrou left a comment

conbench-apache-arrow bot commented Feb 14, 2024

GH-39984: [Python] Add ChunkedArray import/export to/from C #39985

GH-39984: [Python] Add ChunkedArray import/export to/from C #39985

Conversation

paleolimbot commented Feb 7, 2024 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Feb 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Feb 12, 2024

paleolimbot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Feb 14, 2024

paleolimbot commented Feb 7, 2024 •

edited

Loading