-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add Expression.list.contains #1174
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1174 +/- ##
==========================================
+ Coverage 87.45% 88.43% +0.97%
==========================================
Files 62 54 -8
Lines 5955 5558 -397
==========================================
- Hits 5208 4915 -293
+ Misses 747 643 -104
|
@@ -35,6 +36,31 @@ fn join_arrow_list_of_utf8s( | |||
}) | |||
} | |||
|
|||
fn arrow_list_contains( | |||
list_elements: impl Iterator<Item = Option<Box<dyn arrow2::array::Array>>>, | |||
elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like I'm using Box<dyn arrow2::array::Array>
as a bootleg "AnyValue" here.
Technically, a single-element Series could be used as a bootleg AnyValue too 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah lol
@@ -35,6 +36,31 @@ fn join_arrow_list_of_utf8s( | |||
}) | |||
} | |||
|
|||
fn arrow_list_contains( | |||
list_elements: impl Iterator<Item = Option<Box<dyn arrow2::array::Array>>>, | |||
elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah lol
list_elements: impl Iterator<Item = Option<Box<dyn arrow2::array::Array>>>, | ||
elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>, | ||
) -> DaftResult<arrow2::array::BooleanArray> { | ||
let mut contains_elements: Vec<Option<bool>> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for intermediate memory here
elements: impl Iterator<Item = Box<dyn arrow2::array::Array>>, | ||
) -> DaftResult<arrow2::array::BooleanArray> { | ||
let mut contains_elements: Vec<Option<bool>> = Vec::new(); | ||
for (arr, element) in list_elements.zip(elements) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can run this as a map and give an iterator then then create BooleanArray with from_iter
: https://github.com/jorgecarleitao/arrow2/blob/main/src/array/boolean/from.rs#L12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I was going to do that and forgot... Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually, I'm unable to do this because the code that executes in the Iterator can potentially return a DaftError (we run a .equals
in that code).
The iterator that returns thus can't be piped into BooleanArray. The only way I could find to make it work still requires intermediate memory:
let results: Result<Vec<_>> = list_elements.zip(elements).map(...some code that throws errors).collect();
Ok(BooleanArray::from(results?))
Is there a better way? I guess we could just panic in there instead of throwing an error.
Box::new(std::iter::repeat(elements.to_arrow())) | ||
} else { | ||
assert!(elements.len() == self.len()); | ||
Box::new((0..self.len()).map(|i| elements.inner.to_arrow().sliced(i, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you actually want as_arrow
: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/array/ops/as_arrow.rs#L21. to_arrow is for exporting through ffi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also once you call as_arrow
you should be able to call .into_iter()
on the concrete type https://github.com/jorgecarleitao/arrow2/blob/main/src/array/fixed_size_list/iterator.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't do as_arrow
because it is only defined over DataArray
, and I can't get the DataArray
for elements
because I don't know the concrete type of elements.inner
without doing daft_match...
statement on the dtype.
Is there a better workaround here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could add an Series::as_arrow()
, but that feels like it would be pretty confusing vs Series::to_arrow()
?
Edit: Maybe we could do:
- Add
Series::as_arrow()
which delegates toself.inner.downcast().as_arrow()
by matching over the Series' dtype - Rename
Series::to_arrow()
to something likeSeries::export_arrow_for_ffi()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably go that route since that would be correct. you can do a with_match_physical after converting it to the physical array. and Since you know that self.child and elements are the same dtype, it should be just be 1 match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably go that route
Are you referring to my suggestion: " I can't get the DataArray for elements because I don't know the concrete type of elements.inner without doing daft_match... statement on the dtype."?
I just tried your suggestion, but am running into more problems
let elements = elements.as_physical()?;
with_match_physical_daft_types!(elements.data_type(), |$T| {
let elements = elements.downcast::<$T>()?;
elements.as_arrow() // <- Failing here because of ExtensionType, which seems to be a PhysicalType but I guess it doesn't map directly to an arrow2 concrete type array?
})
In general our codebase feels really hard to work with when we have more than one level of types and the nested types are unknown... Super gnarly.
Box::new(std::iter::repeat(elements.to_arrow())) | ||
} else { | ||
assert!(elements.len() == self.len()); | ||
Box::new((0..self.len()).map(|i| elements.inner.to_arrow().sliced(i, 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above: use as_arrow
and user the into_iter method
0ea30c7
to
1b7ab89
Compare
Adds an
Expression.list.contains
function to check whether an element is contained within lists in another columnRelated to: #993