Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arrow] Support producing an "arrow_array_stream" PyCapsule #13418

Merged
merged 4 commits into from
Aug 15, 2024

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Aug 14, 2024

This PR implements #10716

Through DuckDBPyRelation.__arrow_c_stream__ we can now produce an arrow_array_stream PyCapsule.

Some things to note:
The ArrowArrayStream contains a QueryResult, if this is a StreamQueryResult and a new query is executed before the full stream has been exhausted the result will be invalidated and chunks can no longer be fetched from the stream anymore.

This currently produces a materialized result, meaning it is standalone from the connection and will not be affected by running other queries - but that does mean it doesn't support larger than memory result sets

@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 14, 2024 14:44
@Tishj Tishj marked this pull request as ready for review August 14, 2024 18:56
@duckdb-draftbot duckdb-draftbot marked this pull request as draft August 15, 2024 09:26
@Tishj Tishj marked this pull request as ready for review August 15, 2024 12:14
@Mytherin Mytherin merged commit c6ab646 into duckdb:main Aug 15, 2024
17 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Aug 15, 2024
Merge pull request duckdb/duckdb#13433 from lnkuiper/jemalloc_32bit
Merge pull request duckdb/duckdb#13418 from Tishj/produce_arrow_pycapsule

https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html
)";
m.def("__arrow_c_stream__", &DuckDBPyRelation::ToArrowCapsule, capsule_docs);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely familiar with the syntax here, but my reading is that this method has no keyword arguments?

It would be good to add the requested_schema keyword, even if you simply ignore it for now (which is fine, because the spec states that the handling of keyword is "best effort" anyway). But not having that keyword will give errors in consumers that pass that keyword (like pyarrow.table(..) will always do, even if it is None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this does fail for pyarrow.table:

import duckdb
import pyarrow as pa
import polars as pl

df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["a", "b", "c", "d"]})
con = duckdb.connect()
sql = "SELECT * from df"
query = con.query(sql)

test = pa.table(query)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /Users/kyle/tmp/duckdb/tmp3.py:[1](https://file+.vscode-resource.vscode-cdn.net/Users/kyle/tmp/duckdb/tmp3.py:1)
----> 1 test = pa.table(query)

File ~/tmp/duckdb/.venv/lib/python3.11/site-packages/pyarrow/table.pxi:6009, in pyarrow.lib.table()

TypeError: __arrow_c_stream__(): incompatible function arguments. The following argument types are supported:
    1. (self: duckdb.duckdb.DuckDBPyRelation) -> object

Invoked with: ┌───────┬─────────┐
│   a   │    b    │
│ int64 │ varchar │
├───────┼─────────┤
│     1 │ a       │
│     2 │ b       │
│     3 │ c       │
│     4 │ d       │
└───────┴─────────┘
, None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -412,6 +412,7 @@ class DuckDBPyRelation:
def list(self, column: str, groups: str = ..., window_spec: str = ..., projected_columns: str = ...) -> DuckDBPyRelation: ...

def arrow(self, batch_size: int = ...) -> pyarrow.lib.Table: ...
def __arrow_c_stream__(self) -> object: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with @jorisvandenbossche 's comment, this should be updated to match the spec https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints

    def __arrow_c_stream__(self, requested_schema: object | None = None) -> object: ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants