[FEAT] is_in expression #1811

colin-ho · 2024-01-23T01:34:30Z

Closes #993

The is_in expression checks whether the values of a series are contained in a given list of items, and produces a series of boolean values as the results of this membership test.

Changes:

Added a Literal Series so that Series can be passed into the expression
Added is_in expression and kernel
Added tests

codecov · 2024-01-23T03:44:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4bc8952) 84.73% compared to head (c725306) 84.26%.
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1811      +/-   ##
==========================================
- Coverage   84.73%   84.26%   -0.47%     
==========================================
  Files          55       55              
  Lines        5659     5734      +75     
==========================================
+ Hits         4795     4832      +37     
- Misses        864      902      +38

Files	Coverage Δ
daft/expressions/expressions.py	`92.87% <100.00%> (+0.50%)`	⬆️
daft/series.py	`92.84% <100.00%> (+0.26%)`	⬆️
daft/table/table.py	`81.87% <100.00%> (-2.01%)`	⬇️
daft/utils.py	`87.75% <100.00%> (+1.70%)`	⬆️

... and 11 files with indirect coverage changes

jaychia

I'm not as familiar with the lit machinery, @samster25 should take a look there

jaychia · 2024-01-23T03:36:58Z

daft/expressions/expressions.py

@@ -387,6 +387,22 @@ def not_null(self) -> Expression:
        expr = self._expr.not_null()
        return Expression._from_pyexpr(expr)

+    def is_in(self, items: object) -> Expression:


We should have items be well-typed as a union of list[Any] | Series

jaychia · 2024-01-23T05:04:51Z

Cargo.lock

@@ -1224,6 +1224,7 @@ dependencies = [
 name = "daft-dsl"
 version = "0.2.0-dev0"
 dependencies = [
+ "arrow2",


Why is there a need for this dependency?

not needed anymore, removed

jaychia · 2024-01-23T05:07:33Z

src/daft-core/src/series/ops/is_in.rs

+        // match items.data_type() with the datatypes of LiteralValue because items is a List(Vec<LiteralValue>),
+        // attept to cast self to the same datatype as items, then check if self is in items
+        match items.data_type() {
+            crate::datatypes::DataType::Null => self.is_null(),


nit: you can do a use crate::datatypes::DataType on the top of the file to make these references to DataType less verbose

jaychia · 2024-01-23T05:08:55Z

src/daft-core/src/series/ops/is_in.rs

+        }
+        // match items.data_type() with the datatypes of LiteralValue because items is a List(Vec<LiteralValue>),
+        // attept to cast self to the same datatype as items, then check if self is in items
+        match items.data_type() {


Should this not be the other way around?

Matching on self.data_type()

Attempt to coerce the type of items into self.data_type() for comparison

This is because the data in self is probably always much "better typed", and the data in items is passed in by the user via the Python .is_in(...) API and will need to be coerced.

yep makes sense, i changed my logic to this: https://github.com/Eventual-Inc/Daft/pull/1811/files#diff-2330210aef0d266cd0fdcb54102535bd0bdcf08337fe0da49758239a28730532R18-R63, which follows the same typecasting logic that binary ops use: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/series/array_impl/binary_ops.rs#L115

samster25 · 2024-01-23T06:50:54Z

src/daft-dsl/src/lit.rs

@@ -57,6 +57,8 @@ pub enum LiteralValue {
    Date(i32),
    /// A 64-bit floating point number.
    Float64(f64),
+    /// A list
+    List(Vec<LiteralValue>),


Instead of having a Vec<LiteralValue>, We should instead just have a SeriesLiteral. This will allow you to avoid perform all the parsing and validation yourself. In the python side you should just be able to do Series.from_* to support a variety of input types such as a list, series, arrow_array, etc

awesome! just did this in latest commit

samster25 · 2024-01-23T06:52:01Z

src/daft-dsl/src/python.rs

@@ -71,6 +73,55 @@ pub fn lit(item: &PyAny) -> PyResult<PyExpr> {
    } else if let Ok(pybytes) = item.downcast::<PyBytes>() {
        let bytes = pybytes.as_bytes();
        Ok(crate::lit(bytes).into())
+    } else if let Ok(pylist) = item.downcast::<PyList>() {


See comment above about using a SeriesLiteral. You can then avoid all this parsing code. Espically if you end up with some input like [False, 1, 1.0]

samster25 · 2024-01-24T22:10:59Z

src/daft-core/src/array/ops/is_in.rs

+
+            fn is_in(&self, rhs: &$arr) -> Self::Output {
+                // collect to vec because floats don't implement Hash
+                let vec = rhs.as_arrow().iter().collect::<Vec<_>>();


You can use BTreeSet which has log(n) lookups and doesn't require Hash. However you may have to implement Ord on a struct FloatWrapper that wraps the f32/f64.

https://doc.rust-lang.org/std/collections/struct.BTreeSet.html

just realised that we have a hashable_float_wrapper: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/utils/hashable_float_wrapper.rs , I could implement Eq and PartialEq for this wrapper and be able to use HashSet, or should we go the safer route with BTreeSet

I would favor BTreeSet for this since floats don't hold the invariant that something that is equal will give the same hash value.

samster25 · 2024-01-24T22:13:09Z

daft/expressions/expressions.py

+        if not (isinstance(items, Expression) or isinstance(items, list)):
+            raise TypeError(f"expected a python list or Daft Expression, got {type(items)}")
+
+        if isinstance(items, list):


We should support all the types we see here

Daft/daft/table/table.py

Line 151 in 5814922

if isinstance(v, list):

You can probably factor the inner loop into a helper function that takes in an item and always returns a series.

samster25 · 2024-01-24T22:13:43Z

daft/utils.py

+    left_pylist: list,
+    right_pylist: list,
+) -> list:
+    return [elem in right_pylist for elem in left_pylist]


Favor using a set

Will first try set and fallback to list if the objects are not hashable, implemented in latest commit

samster25 · 2024-01-24T22:13:53Z

src/daft-core/src/array/from.rs

@@ -144,6 +144,14 @@ impl From<(&str, &[u8])> for BinaryArray {
    }
 }

+impl From<(&str, &[&[u8]])> for BinaryArray {


Is this still used?

nope, thanks for the catch!

samster25 · 2024-01-24T22:14:56Z

src/daft-core/src/array/ops/is_in.rs

+            .iter()
+            .map(|x| hashset.contains(&x))
+            .collect::<Vec<_>>();
+        Ok(BooleanArray::from(($self.name(), result.as_slice())))


You can write a BooleanArray::from_iter which then doesn't need to materialize the Vec

Daft/src/daft-core/src/array/from_iter.rs

Line 9 in 5814922

pub fn from_iter(

samster25 · 2024-01-24T22:16:34Z

src/daft-core/src/array/ops/is_in.rs

+        let result = $self
+            .as_arrow()
+            .iter()
+            .map(|x| hashset.contains(&x))


what if x is None? maybe this should be x.map(|v| hashset.contains(&v))

samster25 · 2024-01-24T22:20:01Z

src/daft-core/src/series/ops/is_in.rs

+
+impl Series {
+    pub fn is_in(&self, items: &Self) -> DaftResult<Series> {
+        let default =


you should only create default if you actually want to return it. Otherwise you always have this allocation

samster25 · 2024-01-24T22:22:07Z

src/daft-core/src/series/ops/mod.rs

@@ -88,3 +89,36 @@ macro_rules! py_binary_op_utilfn {
 }
 #[cfg(feature = "python")]
 pub(super) use py_binary_op_utilfn;
+
+#[cfg(feature = "python")]
+macro_rules! py_membership_op_utilfn {


why does this have to be a macro? Can this be a function?

changed to function

samster25 · 2024-01-24T22:24:53Z

src/daft-dsl/src/lit.rs

@@ -108,6 +117,7 @@ impl Display for LiteralValue {
            Date(val) => write!(f, "{}", display_date32(*val)),
            Timestamp(val, tu, tz) => write!(f, "{}", display_timestamp(*val, tu, tz)),
            Float64(val) => write!(f, "{val:.1}"),
+            Series(series) => write!(f, "{}", series),


How does this repr look like in python? I believe that this might be multiline and not look great in our expression repr.

yep you're right, i made a separate display function for series literals, which will end up looking like: lit([1, 2, 3]). I could also make it more detailed, like: `lit(Series(int64): [1, 2, 3]), what do you think?

samster25 · 2024-01-25T19:10:19Z

src/daft-core/src/utils/display_table.rs

+            "[{}]",
+            (0..series.len())
+                .map(|i| series.str_value(i).unwrap())
+                .collect::<Vec<_>>()


i believe itertools has a join that works without collecting to a vec first

samster25 · 2024-01-25T19:11:14Z

src/daft-core/src/utils/orderable_float_wrapper.rs

@@ -0,0 +1,43 @@
+use std::cmp::Ordering;
+
+pub struct FloatWrapper<T>(pub T);


We should consolidate these impls on the one we use for the hash_float_wrapper

samster25 · 2024-01-25T19:13:06Z

src/daft-core/src/utils/orderable_float_wrapper.rs

+    }
+}
+
+impl Ord for FloatWrapper<f32> {


you could use a macro for this double impl

colin-ho added 2 commits January 22, 2024 17:14

is_in initial impl

3dfdf16

some comments

c2484b6

github-actions bot added the enhancement New feature or request label Jan 23, 2024

make literal list a listarray with one elem

c986c1c

jaychia reviewed Jan 23, 2024

View reviewed changes

samster25 reviewed Jan 23, 2024

View reviewed changes

colin-ho added 3 commits January 23, 2024 16:51

changes

4e46fe5

macrooos

b88ff18

fix nulls

a05dd61

colin-ho requested review from samster25 and jaychia January 24, 2024 18:58

add test for doing is_in with another col

26cb43b

samster25 reviewed Jan 24, 2024

View reviewed changes

colin-ho added 2 commits January 25, 2024 10:01

address commnets

7950d44

add back

3004166

colin-ho requested a review from samster25 January 25, 2024 18:13

samster25 approved these changes Jan 25, 2024

View reviewed changes

colin-ho added 2 commits January 25, 2024 11:46

consolidate on hashable float wrapper

ef43b76

make codecov happy by adding test for unhashable class

c725306

colin-ho merged commit 21cb2b5 into main Jan 25, 2024
42 checks passed

colin-ho deleted the colin/is_in branch January 25, 2024 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] is_in expression #1811

[FEAT] is_in expression #1811

colin-ho commented Jan 23, 2024 •

edited

Loading

codecov bot commented Jan 23, 2024 •

edited

Loading

jaychia left a comment

jaychia Jan 23, 2024

colin-ho Jan 24, 2024

jaychia Jan 23, 2024

colin-ho Jan 24, 2024

jaychia Jan 23, 2024

jaychia Jan 23, 2024

colin-ho Jan 24, 2024

samster25 Jan 23, 2024

colin-ho Jan 24, 2024

samster25 Jan 23, 2024

samster25 Jan 24, 2024

colin-ho Jan 24, 2024 •

edited

Loading

samster25 Jan 25, 2024

samster25 Jan 24, 2024

samster25 Jan 24, 2024

colin-ho Jan 25, 2024

samster25 Jan 24, 2024

colin-ho Jan 25, 2024

samster25 Jan 24, 2024

samster25 Jan 24, 2024

samster25 Jan 24, 2024

samster25 Jan 24, 2024

colin-ho Jan 25, 2024

samster25 Jan 24, 2024

colin-ho Jan 25, 2024

samster25 Jan 25, 2024

samster25 Jan 25, 2024

samster25 Jan 25, 2024

		@@ -0,0 +1,43 @@
		use std::cmp::Ordering;

		pub struct FloatWrapper<T>(pub T);

[FEAT] is_in expression #1811

[FEAT] is_in expression #1811

Conversation

colin-ho commented Jan 23, 2024 • edited Loading

codecov bot commented Jan 23, 2024 • edited Loading

Codecov Report

jaychia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho commented Jan 23, 2024 •

edited

Loading

codecov bot commented Jan 23, 2024 •

edited

Loading

colin-ho Jan 24, 2024 •

edited

Loading