Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] is_in expression #1811

Merged
merged 11 commits into from
Jan 25, 2024
Merged

[FEAT] is_in expression #1811

merged 11 commits into from
Jan 25, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Jan 23, 2024

Closes #993

The is_in expression checks whether the values of a series are contained in a given list of items, and produces a series of boolean values as the results of this membership test.

Changes:

  • Added a Literal Series so that Series can be passed into the expression
  • Added is_in expression and kernel
  • Added tests

@github-actions github-actions bot added the enhancement New feature or request label Jan 23, 2024
Copy link

codecov bot commented Jan 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4bc8952) 84.73% compared to head (c725306) 84.26%.
Report is 5 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1811      +/-   ##
==========================================
- Coverage   84.73%   84.26%   -0.47%     
==========================================
  Files          55       55              
  Lines        5659     5734      +75     
==========================================
+ Hits         4795     4832      +37     
- Misses        864      902      +38     
Files Coverage Δ
daft/expressions/expressions.py 92.87% <100.00%> (+0.50%) ⬆️
daft/series.py 92.84% <100.00%> (+0.26%) ⬆️
daft/table/table.py 81.87% <100.00%> (-2.01%) ⬇️
daft/utils.py 87.75% <100.00%> (+1.70%) ⬆️

... and 11 files with indirect coverage changes

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not as familiar with the lit machinery, @samster25 should take a look there

@@ -387,6 +387,22 @@ def not_null(self) -> Expression:
expr = self._expr.not_null()
return Expression._from_pyexpr(expr)

def is_in(self, items: object) -> Expression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have items be well-typed as a union of list[Any] | Series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Cargo.lock Outdated
@@ -1224,6 +1224,7 @@ dependencies = [
name = "daft-dsl"
version = "0.2.0-dev0"
dependencies = [
"arrow2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a need for this dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore, removed

// match items.data_type() with the datatypes of LiteralValue because items is a List(Vec<LiteralValue>),
// attept to cast self to the same datatype as items, then check if self is in items
match items.data_type() {
crate::datatypes::DataType::Null => self.is_null(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can do a use crate::datatypes::DataType on the top of the file to make these references to DataType less verbose

}
// match items.data_type() with the datatypes of LiteralValue because items is a List(Vec<LiteralValue>),
// attept to cast self to the same datatype as items, then check if self is in items
match items.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be the other way around?

  1. Matching on self.data_type()
  2. Attempt to coerce the type of items into self.data_type() for comparison

This is because the data in self is probably always much "better typed", and the data in items is passed in by the user via the Python .is_in(...) API and will need to be coerced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -57,6 +57,8 @@ pub enum LiteralValue {
Date(i32),
/// A 64-bit floating point number.
Float64(f64),
/// A list
List(Vec<LiteralValue>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a Vec<LiteralValue>, We should instead just have a SeriesLiteral. This will allow you to avoid perform all the parsing and validation yourself. In the python side you should just be able to do Series.from_* to support a variety of input types such as a list, series, arrow_array, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome! just did this in latest commit

@@ -71,6 +73,55 @@ pub fn lit(item: &PyAny) -> PyResult<PyExpr> {
} else if let Ok(pybytes) = item.downcast::<PyBytes>() {
let bytes = pybytes.as_bytes();
Ok(crate::lit(bytes).into())
} else if let Ok(pylist) = item.downcast::<PyList>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about using a SeriesLiteral. You can then avoid all this parsing code. Espically if you end up with some input like [False, 1, 1.0]


fn is_in(&self, rhs: &$arr) -> Self::Output {
// collect to vec because floats don't implement Hash
let vec = rhs.as_arrow().iter().collect::<Vec<_>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use BTreeSet which has log(n) lookups and doesn't require Hash. However you may have to implement Ord on a struct FloatWrapper that wraps the f32/f64.

https://doc.rust-lang.org/std/collections/struct.BTreeSet.html

Copy link
Contributor Author

@colin-ho colin-ho Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realised that we have a hashable_float_wrapper: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-core/src/utils/hashable_float_wrapper.rs , I could implement Eq and PartialEq for this wrapper and be able to use HashSet, or should we go the safer route with BTreeSet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would favor BTreeSet for this since floats don't hold the invariant that something that is equal will give the same hash value.

if not (isinstance(items, Expression) or isinstance(items, list)):
raise TypeError(f"expected a python list or Daft Expression, got {type(items)}")

if isinstance(items, list):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support all the types we see here

if isinstance(v, list):

You can probably factor the inner loop into a helper function that takes in an item and always returns a series.

daft/utils.py Outdated
left_pylist: list,
right_pylist: list,
) -> list:
return [elem in right_pylist for elem in left_pylist]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Favor using a set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will first try set and fallback to list if the objects are not hashable, implemented in latest commit

@@ -144,6 +144,14 @@ impl From<(&str, &[u8])> for BinaryArray {
}
}

impl From<(&str, &[&[u8]])> for BinaryArray {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, thanks for the catch!

.iter()
.map(|x| hashset.contains(&x))
.collect::<Vec<_>>();
Ok(BooleanArray::from(($self.name(), result.as_slice())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can write a BooleanArray::from_iter which then doesn't need to materialize the Vec

let result = $self
.as_arrow()
.iter()
.map(|x| hashset.contains(&x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if x is None? maybe this should be x.map(|v| hashset.contains(&v))


impl Series {
pub fn is_in(&self, items: &Self) -> DaftResult<Series> {
let default =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should only create default if you actually want to return it. Otherwise you always have this allocation

@@ -88,3 +89,36 @@ macro_rules! py_binary_op_utilfn {
}
#[cfg(feature = "python")]
pub(super) use py_binary_op_utilfn;

#[cfg(feature = "python")]
macro_rules! py_membership_op_utilfn {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have to be a macro? Can this be a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to function

@@ -108,6 +117,7 @@ impl Display for LiteralValue {
Date(val) => write!(f, "{}", display_date32(*val)),
Timestamp(val, tu, tz) => write!(f, "{}", display_timestamp(*val, tu, tz)),
Float64(val) => write!(f, "{val:.1}"),
Series(series) => write!(f, "{}", series),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this repr look like in python? I believe that this might be multiline and not look great in our expression repr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep you're right, i made a separate display function for series literals, which will end up looking like: lit([1, 2, 3]). I could also make it more detailed, like: `lit(Series(int64): [1, 2, 3]), what do you think?

"[{}]",
(0..series.len())
.map(|i| series.str_value(i).unwrap())
.collect::<Vec<_>>()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe itertools has a join that works without collecting to a vec first

@@ -0,0 +1,43 @@
use std::cmp::Ordering;

pub struct FloatWrapper<T>(pub T);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consolidate these impls on the one we use for the hash_float_wrapper

}
}

impl Ord for FloatWrapper<f32> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use a macro for this double impl

@colin-ho colin-ho merged commit 21cb2b5 into main Jan 25, 2024
42 checks passed
@colin-ho colin-ho deleted the colin/is_in branch January 25, 2024 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an "in" filter predicate
3 participants