-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arrow] Support consuming an "arrow_array_stream" PyCapsule #13386
Conversation
… still valid before initiating the scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Tishj, thanks for picking up this improvement. It is looking good and I've added some minor comments.
table_function.function = make_uniq<FunctionExpression>("arrow_scan", std::move(children)); | ||
if (type == PyArrowObjectType::PyCapsule) { | ||
// Disable projection+filter pushdown | ||
table_function.function = make_uniq<FunctionExpression>("arrow_scan_dumb", std::move(children)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe arrow_no_pushdown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"arrow_scan_dumb" already exists, I believe we can enable projection pushdown, but I didn't want to add the overhead of testing that, that sounds more like a future optimization if this functionality becomes popular/used a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've seen it, it still is not a great name name since it conveys very little of the difference between a regular arrow_scan
and a "dumb" one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I agree, but it's used by duckdb-java, see the initial PR and the JDBC Client source code
I'm hesitant to rename something that's used by third party clients
…le interface without pyarrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks! |
Merge pull request duckdb/duckdb#13383 from Tmonster/fix_vector_sizes_nightly_failure Merge pull request duckdb/duckdb#13386 from Tishj/pycapsule_test
This PR implements #10716
This allows consuming arrow array streams from PyCapsules, which do not necessarily have to be created by
pyarrow
.For regular arrow scans we use
pyarrow
to perform filter pushdown on the stream, constructing a scanner to do so.When scanning from a PyCapsule we disable this so no interaction with
pyarrow
is required.