Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39984: [Python] Add ChunkedArray import/export to/from C #39985

Merged
merged 11 commits into from
Feb 14, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 7, 2024

Rationale for this change

ChunkedArrays have an unambiguous representation as a stream of arrays. #39455 added the ability to import/export in C++...this PR wires up the new functions in pyarrow.

What changes are included in this PR?

  • Added __arrow_c_stream__() and _import_from_c_capsule() to the ChunkedArray

Are these changes tested?

Yes! Tests were added.

Are there any user-facing changes?

Yes! But I'm not sure where the protocol methods are documented.

import pyarrow as pa
import nanoarrow as na
chunked = pa.chunked_array([pa.array([0, 1, 2]), pa.array([3, 4, 5])])
[na.c_array_view(item) for item in na.c_array_stream(chunked)]
[<nanoarrow.c_lib.CArrayView>
 - storage_type: 'int64'
 - length: 3
 - offset: 0
 - null_count: 0
 - buffers[2]:
   - <bool validity[0 b] >
   - <int64 data[24 b] 0 1 2>
 - dictionary: NULL
 - children[0]:,
 <nanoarrow.c_lib.CArrayView>
 - storage_type: 'int64'
 - length: 3
 - offset: 0
 - null_count: 0
 - buffers[2]:
   - <bool validity[0 b] >
   - <int64 data[24 b] 3 4 5>
 - dictionary: NULL
 - children[0]:]
stream_capsule = chunked.__arrow_c_stream__()
chunked2 = chunked._import_from_c_capsule(stream_capsule)
chunked2
<pyarrow.lib.ChunkedArray object at 0x105bb70b0>
[
  [
    0,
    1,
    2
  ],
  [
    3,
    4,
    5
  ]
]

Copy link

github-actions bot commented Feb 7, 2024

⚠️ GitHub issue #39984 has been automatically assigned in GitHub to PR creator.

@paleolimbot paleolimbot marked this pull request as ready for review February 9, 2024 13:52
if requested_schema is not None:
out_schema = Schema._import_from_c_capsule(requested_schema)
if self.schema != out_schema:
table = self.cast(out_schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but it would be nicer (both for memory consumption and for latency) to cast each batch when required, rather than all the table up front.

You could simply use RecordBatchReader.from_batches with a generator that casts each batch in turn. Something like:

        batches = table.to_batches()
        if requested_schema is not None:
            out_schema = Schema._import_from_c_capsule(requested_schema)
            if self.schema != out_schema:
                batches = (batch.cast(out_schema) for batch in batches)

        return RecordBatchReader.from_batches(batches)

(or you can fold the functionality directly in PyRecordBatchReader)

@@ -4932,7 +4994,13 @@ cdef class Table(_Tabular):
-------
PyCapsule
"""
return self.to_reader().__arrow_c_stream__(requested_schema)
cdef Table table = self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not required.

if requested_schema is not None:
out_type = DataType._import_from_c_capsule(requested_schema)
if self.type != out_type:
chunked = self.cast(out_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark as in Table.__arrow_c_stream__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this as much of a concern as with a Table? I can't think of any clean way to do this lazily on a per-chunk basis, although I'm happy to remove the feature if it's that bad of an idea.

A quick check suggests casting up front is faster...perhaps because it obliterates the chunks:

import pyarrow as pa
import numpy as np

n = int(1e6)
n_chunks = 1000
per_chunk = n // n_chunks
chunks = [np.random.random(per_chunk) for i in range(n_chunks)]

chunked = pa.chunked_array(chunks)

def roundtrip_chunked():
    stream_capsule = chunked.__arrow_c_stream__()
    chunked._import_from_c_capsule(stream_capsule)

%timeit roundtrip_chunked()
#> 3.72 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
def roundtrip_chunked_cast():
    stream_capsule = chunked.cast(pa.float()).__arrow_c_stream__()
    chunked._import_from_c_capsule(stream_capsule)

%timeit roundtrip_chunked_cast()
#> 1.52 ms ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I gave a possible solution above.

Also, your benchmark is bizarre: where is the casting in roundtrip_chunked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere! A roundtrip without casting is almost certainly faster than a roundtrip with casting, and it is already twice as slow as the cast + roundtrip. I'm sure there's room to make a better benchmark (I'm also running a C++ debug build), but I'm personally convinced that the cast + export solution is not so bad that it should not be attempted.

Well, I gave a possible solution above.

For RecordBatches? I don't think we have a way to do that for a stream of Array in Arrow C++ or in pyarrow?

I'm happy to remove the feature as well and leave to be implemented properly later...I didn't anticipate it being controversial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A roundtrip without casting is almost certainly faster than a roundtrip with casting, and it is already twice as slow as the cast + roundtrip.

If it doesn't make sense then there's something wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to remove the feature as well and leave to be implemented properly later...I didn't anticipate it being controversial.

It's not controversial. The implementation is. I'm sure for simple benchmarks with a small dataset, an otherwise idle machine, and enough RAM to hold multiple copies of the dataset, casting everything at once can seem slightly faster because it saves some overhead. That doesn't make it a viable strategy in the general case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I dropped the cast and we can circle back and do it right. I opened a PR to do it properly for the batch-wise RecordBatchReader export and that might serve as a template for how this should work, too.

batch = make_batch()
requested_schema = pa.schema([('ints', pa.list_(pa.int64()))])
requested_capsule = requested_schema.__arrow_c_schema__()
# RecordBatch has no cast() method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fixed instead of working around it.

@pitrou
Copy link
Member

pitrou commented Feb 12, 2024

@paleolimbot I think it would be nice to add a test_nanoarrow.py to exercise nanoarrow integration.

Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly necessary, but it would be nicer (both for memory consumption and for latency) to cast each batch when required, rather than all the table up front.

I removed table casting for now...I'm still game to try in #40066 , maybe with a CastingRecordBatchReader : public RecordBatchReader that would solve the issue for streams, too.

I think it would be nice to add a test_nanoarrow.py to exercise nanoarrow integration.

I opened #40065 ...probably best done when nanoarrow's Python is a little more stable (i.e., when there is a nanoarrow.Array and nanoarrow.ArrayStream, neither of which currently exist).

if requested_schema is not None:
out_type = DataType._import_from_c_capsule(requested_schema)
if self.type != out_type:
chunked = self.cast(out_type)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this as much of a concern as with a Table? I can't think of any clean way to do this lazily on a per-chunk basis, although I'm happy to remove the feature if it's that bad of an idea.

A quick check suggests casting up front is faster...perhaps because it obliterates the chunks:

import pyarrow as pa
import numpy as np

n = int(1e6)
n_chunks = 1000
per_chunk = n // n_chunks
chunks = [np.random.random(per_chunk) for i in range(n_chunks)]

chunked = pa.chunked_array(chunks)

def roundtrip_chunked():
    stream_capsule = chunked.__arrow_c_stream__()
    chunked._import_from_c_capsule(stream_capsule)

%timeit roundtrip_chunked()
#> 3.72 ms ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
def roundtrip_chunked_cast():
    stream_capsule = chunked.cast(pa.float()).__arrow_c_stream__()
    chunked._import_from_c_capsule(stream_capsule)

%timeit roundtrip_chunked_cast()
#> 1.52 ms ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 13, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 13, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 14, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @paleolimbot !

@pitrou pitrou merged commit 91bf1c9 into apache:main Feb 14, 2024
12 of 13 checks passed
@pitrou pitrou removed the awaiting changes Awaiting changes label Feb 14, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Feb 14, 2024
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 91bf1c9.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@paleolimbot paleolimbot deleted the python-chunked-array-dunder branch February 16, 2024 15:27
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ache#39985)

### Rationale for this change

ChunkedArrays have an unambiguous representation as a stream of arrays. apache#39455 added the ability to import/export in C++...this PR wires up the new functions in pyarrow.

### What changes are included in this PR?

- Added `__arrow_c_stream__()` and `_import_from_c_capsule()` to the `ChunkedArray`

### Are these changes tested?

Yes! Tests were added.

### Are there any user-facing changes?

Yes! But I'm not sure where the protocol methods are documented.

```python
import pyarrow as pa
import nanoarrow as na
chunked = pa.chunked_array([pa.array([0, 1, 2]), pa.array([3, 4, 5])])
[na.c_array_view(item) for item in na.c_array_stream(chunked)]
```

    [<nanoarrow.c_lib.CArrayView>
     - storage_type: 'int64'
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers[2]:
       - <bool validity[0 b] >
       - <int64 data[24 b] 0 1 2>
     - dictionary: NULL
     - children[0]:,
     <nanoarrow.c_lib.CArrayView>
     - storage_type: 'int64'
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers[2]:
       - <bool validity[0 b] >
       - <int64 data[24 b] 3 4 5>
     - dictionary: NULL
     - children[0]:]

```python
stream_capsule = chunked.__arrow_c_stream__()
chunked2 = chunked._import_from_c_capsule(stream_capsule)
chunked2
```

    <pyarrow.lib.ChunkedArray object at 0x105bb70b0>
    [
      [
        0,
        1,
        2
      ],
      [
        3,
        4,
        5
      ]
    ]

* Closes: apache#39984

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…ache#39985)

### Rationale for this change

ChunkedArrays have an unambiguous representation as a stream of arrays. apache#39455 added the ability to import/export in C++...this PR wires up the new functions in pyarrow.

### What changes are included in this PR?

- Added `__arrow_c_stream__()` and `_import_from_c_capsule()` to the `ChunkedArray`

### Are these changes tested?

Yes! Tests were added.

### Are there any user-facing changes?

Yes! But I'm not sure where the protocol methods are documented.

```python
import pyarrow as pa
import nanoarrow as na
chunked = pa.chunked_array([pa.array([0, 1, 2]), pa.array([3, 4, 5])])
[na.c_array_view(item) for item in na.c_array_stream(chunked)]
```

    [<nanoarrow.c_lib.CArrayView>
     - storage_type: 'int64'
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers[2]:
       - <bool validity[0 b] >
       - <int64 data[24 b] 0 1 2>
     - dictionary: NULL
     - children[0]:,
     <nanoarrow.c_lib.CArrayView>
     - storage_type: 'int64'
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers[2]:
       - <bool validity[0 b] >
       - <int64 data[24 b] 3 4 5>
     - dictionary: NULL
     - children[0]:]

```python
stream_capsule = chunked.__arrow_c_stream__()
chunked2 = chunked._import_from_c_capsule(stream_capsule)
chunked2
```

    <pyarrow.lib.ChunkedArray object at 0x105bb70b0>
    [
      [
        0,
        1,
        2
      ],
      [
        3,
        4,
        5
      ]
    ]

* Closes: apache#39984

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…ache#39985)

### Rationale for this change

ChunkedArrays have an unambiguous representation as a stream of arrays. apache#39455 added the ability to import/export in C++...this PR wires up the new functions in pyarrow.

### What changes are included in this PR?

- Added `__arrow_c_stream__()` and `_import_from_c_capsule()` to the `ChunkedArray`

### Are these changes tested?

Yes! Tests were added.

### Are there any user-facing changes?

Yes! But I'm not sure where the protocol methods are documented.

```python
import pyarrow as pa
import nanoarrow as na
chunked = pa.chunked_array([pa.array([0, 1, 2]), pa.array([3, 4, 5])])
[na.c_array_view(item) for item in na.c_array_stream(chunked)]
```

    [<nanoarrow.c_lib.CArrayView>
     - storage_type: 'int64'
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers[2]:
       - <bool validity[0 b] >
       - <int64 data[24 b] 0 1 2>
     - dictionary: NULL
     - children[0]:,
     <nanoarrow.c_lib.CArrayView>
     - storage_type: 'int64'
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers[2]:
       - <bool validity[0 b] >
       - <int64 data[24 b] 3 4 5>
     - dictionary: NULL
     - children[0]:]

```python
stream_capsule = chunked.__arrow_c_stream__()
chunked2 = chunked._import_from_c_capsule(stream_capsule)
chunked2
```

    <pyarrow.lib.ChunkedArray object at 0x105bb70b0>
    [
      [
        0,
        1,
        2
      ],
      [
        3,
        4,
        5
      ]
    ]

* Closes: apache#39984

Lead-authored-by: Dewey Dunnington <[email protected]>
Co-authored-by: Dewey Dunnington <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Add ChunkedArray import/export to/from C
2 participants