-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] is_in expression #1811
[FEAT] is_in expression #1811
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1811 +/- ##
==========================================
- Coverage 84.73% 84.26% -0.47%
==========================================
Files 55 55
Lines 5659 5734 +75
==========================================
+ Hits 4795 4832 +37
- Misses 864 902 +38
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not as familiar with the lit
machinery, @samster25 should take a look there
daft/expressions/expressions.py
Outdated
@@ -387,6 +387,22 @@ def not_null(self) -> Expression: | |||
expr = self._expr.not_null() | |||
return Expression._from_pyexpr(expr) | |||
|
|||
def is_in(self, items: object) -> Expression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have items
be well-typed as a union of list[Any] | Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Cargo.lock
Outdated
@@ -1224,6 +1224,7 @@ dependencies = [ | |||
name = "daft-dsl" | |||
version = "0.2.0-dev0" | |||
dependencies = [ | |||
"arrow2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a need for this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed anymore, removed
// match items.data_type() with the datatypes of LiteralValue because items is a List(Vec<LiteralValue>), | ||
// attept to cast self to the same datatype as items, then check if self is in items | ||
match items.data_type() { | ||
crate::datatypes::DataType::Null => self.is_null(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can do a use crate::datatypes::DataType
on the top of the file to make these references to DataType
less verbose
} | ||
// match items.data_type() with the datatypes of LiteralValue because items is a List(Vec<LiteralValue>), | ||
// attept to cast self to the same datatype as items, then check if self is in items | ||
match items.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be the other way around?
- Matching on
self.data_type()
- Attempt to coerce the type of
items
intoself.data_type()
for comparison
This is because the data in self
is probably always much "better typed", and the data in items
is passed in by the user via the Python .is_in(...)
API and will need to be coerced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep makes sense, i changed my logic to this: https://github.com/Eventual-Inc/Daft/pull/1811/files#diff-2330210aef0d266cd0fdcb54102535bd0bdcf08337fe0da49758239a28730532R18-R63, which follows the same typecasting logic that binary ops use: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/series/array_impl/binary_ops.rs#L115
src/daft-dsl/src/lit.rs
Outdated
@@ -57,6 +57,8 @@ pub enum LiteralValue { | |||
Date(i32), | |||
/// A 64-bit floating point number. | |||
Float64(f64), | |||
/// A list | |||
List(Vec<LiteralValue>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a Vec<LiteralValue>
, We should instead just have a SeriesLiteral. This will allow you to avoid perform all the parsing and validation yourself. In the python side you should just be able to do Series.from_*
to support a variety of input types such as a list, series, arrow_array, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome! just did this in latest commit
src/daft-dsl/src/python.rs
Outdated
@@ -71,6 +73,55 @@ pub fn lit(item: &PyAny) -> PyResult<PyExpr> { | |||
} else if let Ok(pybytes) = item.downcast::<PyBytes>() { | |||
let bytes = pybytes.as_bytes(); | |||
Ok(crate::lit(bytes).into()) | |||
} else if let Ok(pylist) = item.downcast::<PyList>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about using a SeriesLiteral. You can then avoid all this parsing code. Espically if you end up with some input like [False, 1, 1.0]
src/daft-core/src/array/ops/is_in.rs
Outdated
|
||
fn is_in(&self, rhs: &$arr) -> Self::Output { | ||
// collect to vec because floats don't implement Hash | ||
let vec = rhs.as_arrow().iter().collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use BTreeSet
which has log(n) lookups and doesn't require Hash. However you may have to implement Ord
on a struct FloatWrapper
that wraps the f32/f64.
https://doc.rust-lang.org/std/collections/struct.BTreeSet.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just realised that we have a hashable_float_wrapper: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/utils/hashable_float_wrapper.rs , I could implement Eq
and PartialEq
for this wrapper and be able to use HashSet, or should we go the safer route with BTreeSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would favor BTreeSet for this since floats don't hold the invariant that something that is equal will give the same hash value.
daft/expressions/expressions.py
Outdated
if not (isinstance(items, Expression) or isinstance(items, list)): | ||
raise TypeError(f"expected a python list or Daft Expression, got {type(items)}") | ||
|
||
if isinstance(items, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should support all the types we see here
Line 151 in 5814922
if isinstance(v, list): |
You can probably factor the inner loop into a helper function that takes in an item and always returns a series.
daft/utils.py
Outdated
left_pylist: list, | ||
right_pylist: list, | ||
) -> list: | ||
return [elem in right_pylist for elem in left_pylist] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Favor using a set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will first try set and fallback to list if the objects are not hashable, implemented in latest commit
src/daft-core/src/array/from.rs
Outdated
@@ -144,6 +144,14 @@ impl From<(&str, &[u8])> for BinaryArray { | |||
} | |||
} | |||
|
|||
impl From<(&str, &[&[u8]])> for BinaryArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, thanks for the catch!
src/daft-core/src/array/ops/is_in.rs
Outdated
.iter() | ||
.map(|x| hashset.contains(&x)) | ||
.collect::<Vec<_>>(); | ||
Ok(BooleanArray::from(($self.name(), result.as_slice()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write a BooleanArray::from_iter
which then doesn't need to materialize the Vec
pub fn from_iter( |
src/daft-core/src/array/ops/is_in.rs
Outdated
let result = $self | ||
.as_arrow() | ||
.iter() | ||
.map(|x| hashset.contains(&x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if x is None
? maybe this should be x.map(|v| hashset.contains(&v))
|
||
impl Series { | ||
pub fn is_in(&self, items: &Self) -> DaftResult<Series> { | ||
let default = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should only create default
if you actually want to return it. Otherwise you always have this allocation
src/daft-core/src/series/ops/mod.rs
Outdated
@@ -88,3 +89,36 @@ macro_rules! py_binary_op_utilfn { | |||
} | |||
#[cfg(feature = "python")] | |||
pub(super) use py_binary_op_utilfn; | |||
|
|||
#[cfg(feature = "python")] | |||
macro_rules! py_membership_op_utilfn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this have to be a macro? Can this be a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to function
src/daft-dsl/src/lit.rs
Outdated
@@ -108,6 +117,7 @@ impl Display for LiteralValue { | |||
Date(val) => write!(f, "{}", display_date32(*val)), | |||
Timestamp(val, tu, tz) => write!(f, "{}", display_timestamp(*val, tu, tz)), | |||
Float64(val) => write!(f, "{val:.1}"), | |||
Series(series) => write!(f, "{}", series), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this repr look like in python? I believe that this might be multiline and not look great in our expression repr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep you're right, i made a separate display function for series literals, which will end up looking like: lit([1, 2, 3])
. I could also make it more detailed, like: `lit(Series(int64): [1, 2, 3]), what do you think?
"[{}]", | ||
(0..series.len()) | ||
.map(|i| series.str_value(i).unwrap()) | ||
.collect::<Vec<_>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe itertools has a join
that works without collecting to a vec first
@@ -0,0 +1,43 @@ | |||
use std::cmp::Ordering; | |||
|
|||
pub struct FloatWrapper<T>(pub T); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consolidate these impls on the one we use for the hash_float_wrapper
} | ||
} | ||
|
||
impl Ord for FloatWrapper<f32> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use a macro for this double impl
Closes #993
The
is_in
expression checks whether the values of a series are contained in a given list of items, and produces a series of boolean values as the results of this membership test.Changes:
is_in
expression and kernel