-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Support intersect as a DataFrame API #3134
base: main
Are you sure you want to change the base?
Conversation
src/daft-dsl/src/lit.rs
Outdated
StructArray::new(struct_field, values, None).into_series() | ||
} | ||
} | ||
} | ||
|
||
pub fn to_series(&self) -> Series { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous to_series
doesn't work with Struct, as struct has its own field names but to_series
always generate field with "literal".
src/daft-dsl/src/lit.rs
Outdated
#[cfg(feature = "python")] | ||
DataType::Python => { | ||
use pyo3::prelude::*; | ||
Self::Python(PyObjectWrapper(Python::with_gil(|py| py.None()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether None
is the right choice for Python type.
CodSpeed Performance ReportMerging #3134 will not alter performanceComparing Summary
|
@kevinzwang @universalmind303 appreciated if you can take a look at this. |
51da893
to
f0ed52b
Compare
daft/expressions/expressions.py
Outdated
@@ -133,6 +134,39 @@ def lit(value: object) -> Expression: | |||
return Expression._from_pyexpr(lit_value) | |||
|
|||
|
|||
def zero_lit(dt: DataType) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to expose this to Python side as well.
I can remove this if it's not desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @advancedxy, thank you for working on this! Really appreciate the work that you've done.
I don't have specific comments about the code yet, but from a cursory look at the PR, I don't see the need for the zero_lit
expression. If you take a look at our join functions, we use arrow2's build_multi_array_is_equal
to construct an equality check, and that function takes an argument nulls_equal
.
Daft/src/daft-table/src/ops/joins/hash_join.rs
Lines 55 to 60 in 975c09e
let is_equal = build_multi_array_is_equal( | |
lkeys.columns.as_slice(), | |
rkeys.columns.as_slice(), | |
false, | |
false, | |
)?; |
I believe if we propagate a variable to set that, a null-safe join would automatically work, since the hashes used by the probe are already properly constructed for nulls as well.
Could you give that a try?
Hey @kevinzwang kevin, that was exactly my first thought as well. When I created the original issue, I think we can add null equal safe joins first and then leverage that to support the intersect operation. However, passing the parameter from the python side all the way down to the Rust's physical join plan, it seems it might touch a lot of code and I referenced other(a.k.a Spark) query engine's implementation and noticed that null safe equality could be effective rewrote as
Of course, and taking a step back from here, I think we can add null safe equal in joins in the Rust side first, then leverage that to support this PR's intersection operator and then finish Python side's API. How does that sound to you? |
Yep, that sounds like a good plan. Thank you again for taking this on! |
692704e
to
31af0e7
Compare
Hey @kevinzwang @universalmind303 PTAL at this after #3161 is merged, thanks. |
31af0e7
to
8b9fda5
Compare
8b9fda5
to
82e545d
Compare
The CI failure seems unrelated. Close and re-open to trigger a new CI run. |
82e545d
to
dcddfb5
Compare
Since #3161 is merged, I think this is ready for review. |
This commit leverages null safe equal support in joins(see #3069 and #3161) to support intersect API.
Partially fixes #3122.