Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add Expression.list.contains #1174

Closed
wants to merge 8 commits into from
Closed

[FEAT] Add Expression.list.contains #1174

wants to merge 8 commits into from

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Jul 19, 2023

Adds an Expression.list.contains function to check whether an element is contained within lists in another column

Related to: #993

@github-actions github-actions bot added the enhancement New feature or request label Jul 19, 2023
@codecov
Copy link

codecov bot commented Jul 19, 2023

Codecov Report

Merging #1174 (0ea30c7) into main (8fe5597) will increase coverage by 0.97%.
Report is 1 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 0ea30c7 differs from pull request most recent head 1b7ab89. Consider uploading reports for the commit 1b7ab89 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1174      +/-   ##
==========================================
+ Coverage   87.45%   88.43%   +0.97%     
==========================================
  Files          62       54       -8     
  Lines        5955     5558     -397     
==========================================
- Hits         5208     4915     -293     
+ Misses        747      643     -104     
Files Changed Coverage Δ
daft/expressions/expressions.py 91.64% <100.00%> (-0.20%) ⬇️

... and 34 files with indirect coverage changes

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 20, 2023
@jaychia jaychia requested a review from samster25 July 20, 2023 23:18
@@ -35,6 +36,31 @@ fn join_arrow_list_of_utf8s(
})
}

fn arrow_list_contains(
list_elements: impl Iterator<Item = Option<Box<dyn arrow2::array::Array>>>,
elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I'm using Box<dyn arrow2::array::Array> as a bootleg "AnyValue" here.

Technically, a single-element Series could be used as a bootleg AnyValue too 😛

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lol

@@ -35,6 +36,31 @@ fn join_arrow_list_of_utf8s(
})
}

fn arrow_list_contains(
list_elements: impl Iterator<Item = Option<Box<dyn arrow2::array::Array>>>,
elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah lol

list_elements: impl Iterator<Item = Option<Box<dyn arrow2::array::Array>>>,
elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>,
) -> DaftResult<arrow2::array::BooleanArray> {
let mut contains_elements: Vec<Option<bool>> = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for intermediate memory here

elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>,
) -> DaftResult<arrow2::array::BooleanArray> {
let mut contains_elements: Vec<Option<bool>> = Vec::new();
for (arr, element) in list_elements.zip(elements) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can run this as a map and give an iterator then then create BooleanArray with from_iter: https://github.com/jorgecarleitao/arrow2/blob/main/src/array/boolean/from.rs#L12

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I was going to do that and forgot... Thanks!

Copy link
Contributor Author

@jaychia jaychia Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually, I'm unable to do this because the code that executes in the Iterator can potentially return a DaftError (we run a .equals in that code).

The iterator that returns thus can't be piped into BooleanArray. The only way I could find to make it work still requires intermediate memory:

let results: Result<Vec<_>> = list_elements.zip(elements).map(...some code that throws errors).collect();
Ok(BooleanArray::from(results?))

Is there a better way? I guess we could just panic in there instead of throwing an error.

Box::new(std::iter::repeat(elements.to_arrow()))
} else {
assert!(elements.len() == self.len());
Box::new((0..self.len()).map(|i| elements.inner.to_arrow().sliced(i, 1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you actually want as_arrow: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/array/ops/as_arrow.rs#L21. to_arrow is for exporting through ffi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also once you call as_arrow you should be able to call .into_iter() on the concrete type https://github.com/jorgecarleitao/arrow2/blob/main/src/array/fixed_size_list/iterator.rs

Copy link
Contributor Author

@jaychia jaychia Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't do as_arrow because it is only defined over DataArray, and I can't get the DataArray for elements because I don't know the concrete type of elements.inner without doing daft_match... statement on the dtype.

Is there a better workaround here?

Copy link
Contributor Author

@jaychia jaychia Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could add an Series::as_arrow(), but that feels like it would be pretty confusing vs Series::to_arrow()?

Edit: Maybe we could do:

  1. Add Series::as_arrow() which delegates to self.inner.downcast().as_arrow() by matching over the Series' dtype
  2. Rename Series::to_arrow() to something like Series::export_arrow_for_ffi()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably go that route since that would be correct. you can do a with_match_physical after converting it to the physical array. and Since you know that self.child and elements are the same dtype, it should be just be 1 match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably go that route

Are you referring to my suggestion: " I can't get the DataArray for elements because I don't know the concrete type of elements.inner without doing daft_match... statement on the dtype."?

I just tried your suggestion, but am running into more problems

let elements = elements.as_physical()?;
with_match_physical_daft_types!(elements.data_type(), |$T| {
    let elements = elements.downcast::<$T>()?;
    elements.as_arrow() // <- Failing here because of ExtensionType, which seems to be a PhysicalType but I guess it doesn't map directly to an arrow2 concrete type array?
})

In general our codebase feels really hard to work with when we have more than one level of types and the nested types are unknown... Super gnarly.

Box::new(std::iter::repeat(elements.to_arrow()))
} else {
assert!(elements.len() == self.len());
Box::new((0..self.len()).map(|i| elements.inner.to_arrow().sliced(i, 1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above: use as_arrow and user the into_iter method

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants