Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support intersect as a DataFrame API #3134

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Oct 28, 2024

This commit leverages null safe equal support in joins(see #3069 and #3161) to support intersect API.

Partially fixes #3122.

@github-actions github-actions bot added the enhancement New feature or request label Oct 28, 2024
StructArray::new(struct_field, values, None).into_series()
}
}
}

pub fn to_series(&self) -> Series {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous to_series doesn't work with Struct, as struct has its own field names but to_series always generate field with "literal".

#[cfg(feature = "python")]
DataType::Python => {
use pyo3::prelude::*;
Self::Python(PyObjectWrapper(Python::with_gil(|py| py.None())))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether None is the right choice for Python type.

Copy link

codspeed-hq bot commented Oct 28, 2024

CodSpeed Performance Report

Merging #3134 will not alter performance

Comparing advancedxy:intersect_operation (dcddfb5) with main (baca61e)

Summary

✅ 17 untouched benchmarks

@advancedxy
Copy link
Contributor Author

@kevinzwang @universalmind303 appreciated if you can take a look at this.

@@ -133,6 +134,39 @@ def lit(value: object) -> Expression:
return Expression._from_pyexpr(lit_value)


def zero_lit(dt: DataType) -> Expression:
Copy link
Contributor Author

@advancedxy advancedxy Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to expose this to Python side as well.

I can remove this if it's not desired.

@kevinzwang kevinzwang self-requested a review October 29, 2024 17:04
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @advancedxy, thank you for working on this! Really appreciate the work that you've done.

I don't have specific comments about the code yet, but from a cursory look at the PR, I don't see the need for the zero_lit expression. If you take a look at our join functions, we use arrow2's build_multi_array_is_equal to construct an equality check, and that function takes an argument nulls_equal.

let is_equal = build_multi_array_is_equal(
lkeys.columns.as_slice(),
rkeys.columns.as_slice(),
false,
false,
)?;

I believe if we propagate a variable to set that, a null-safe join would automatically work, since the hashes used by the probe are already properly constructed for nulls as well.

Could you give that a try?

@advancedxy
Copy link
Contributor Author

I believe if we propagate a variable to set that, a null-safe join would automatically work, since the hashes used by the probe are already properly constructed for nulls as well.

Hey @kevinzwang kevin, that was exactly my first thought as well. When I created the original issue, I think we can add null equal safe joins first and then leverage that to support the intersect operation. However, passing the parameter from the python side all the way down to the Rust's physical join plan, it seems it might touch a lot of code and I referenced other(a.k.a Spark) query engine's implementation and noticed that null safe equality could be effective rewrote as nvl + is_null, thus this PR. It might have other benefits to open more optimization opportunities when the null safe equality is rewritten. However, that's not the case for Arrow/Daft right now. I'm ok to not introduce that expression then.

Could you give that a try?

Of course, and taking a step back from here, I think we can add null safe equal in joins in the Rust side first, then leverage that to support this PR's intersection operator and then finish Python side's API. How does that sound to you?

@kevinzwang
Copy link
Member

Of course, and taking a step back from here, I think we can add null safe equal in joins in the Rust side first, then leverage that to support this PR's intersection operator and then finish Python side's API. How does that sound to you?

Yep, that sounds like a good plan. Thank you again for taking this on!

@advancedxy
Copy link
Contributor Author

Hey @kevinzwang @universalmind303 PTAL at this after #3161 is merged, thanks.

@advancedxy
Copy link
Contributor Author

The CI failure seems unrelated. Close and re-open to trigger a new CI run.

@advancedxy
Copy link
Contributor Author

Since #3161 is merged, I think this is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: Add INTERSECT support
3 participants