Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an "in" filter predicate #993

Closed
jaychia opened this issue Jun 5, 2023 · 4 comments · Fixed by #1811
Closed

Add an "in" filter predicate #993

jaychia opened this issue Jun 5, 2023 · 4 comments · Fixed by #1811
Assignees
Labels
enhancement New feature or request expression A new expression

Comments

@jaychia
Copy link
Contributor

jaychia commented Jun 5, 2023

Is your feature request related to a problem? Please describe.

We should have an "in" filter predicate:

df.where(df["a"] in [1, 2, 3])
df.where(df["a"] in col("b"))

Currently, the fallback to this is to perform an inner join with another dataframe, but this is unnecessarily slow and requires a shuffle.

@jaychia jaychia added the enhancement New feature or request label Jun 5, 2023
@Subhasish-Behera
Copy link

@jaychia I would like to work on this issue as my first issue. Can you point to the files where I need to implement this?

@jaychia
Copy link
Contributor Author

jaychia commented Jul 17, 2023

Hi @Subhasish-Behera!

You can look at the implementation of Expression.__eq__ for some tips. Essentially, we are trying to add an Expression.__contains__ here, which is similar: Compare the left-hand-side expression with the right-hand-side expression (using the in predicate instead of the == predicate) and return a boolean Expression.

Here are the files you'll want to look at and I linked the implementations for == where applicable:

  1. Schema-resolution path: when a user does df["a"] in [1, 2, 3], this creates a new Expression object, which is a Python object that wraps a Rust object. Assigning df.with_column(df["a"] in [1, 2, 3]) triggers schema resolution.
    1. expressions.py Python class for expressions which is the user-facing API
    2. src/dsl/daft-dsl/src/python.rs Rust to Python bindings which binds Rust code to (1)
    3. DataType resolution logic -- this is where we define how datatypes get resolved (in this case, (List[T] | FixedSizeList[T]) contains T -> bool)
    4. Expression resolution logic this is where we trigger the type resolution from an expression
  2. Series/Array (when the dataframe is executed, this is the actual code that is run)
    1. DaftCompare trait -- we need to add a new DaftContains trait here, similar to DaftCompare. Then we can use this trait in places where we've used DaftCompare.
    2. Table evaluation -- this triggers the Series code during evaluation
    3. Series/Array code implementations of DaftContains for each concrete type of series. This is what gets triggered from the Table. Note that only the nested types (List and FixedSizeList) will implement DaftContains.

Feel free also to chat with us on our Slack if you have any other questions!

@jaychia
Copy link
Contributor Author

jaychia commented Jul 19, 2023

Hi @Subhasish-Behera

I actually just did #1174 which implements this functionality for List and FixedSizeList types. This should help this Issue quite a bit, as we now have implementations for both Strings and Lists.

However, we still need a way to enable the Python in syntax. Doing so would require overriding the __contains__ magicmethod on an Expression, and perhaps some refactors in the way we trigger the expression/have our code structured such that this would work for both string and list types.

I.e. it would be nice to go from:

(before)

df["x"].list.contains(1)
df["y"].str.contains("foo")

To

(after)

1 in df["x"]
"foo" in df["y"]

@jaychia jaychia added the expression A new expression label Sep 25, 2023
@colin-ho colin-ho self-assigned this Jan 17, 2024
@colin-ho
Copy link
Contributor

Tried it out and I don't think overriding the__contains__ magic method will work, as the in operator returns a boolean, so it will try to resolve the return value from __contains__ (which should be an Expression object) as a boolean.

See: https://stackoverflow.com/questions/53351829/is-it-possible-for-contains-to-return-non-boolean-value and https://docs.python.org/3.7/reference/expressions.html#membership-test-operations

colin-ho added a commit that referenced this issue Jan 25, 2024
Closes #993 

The `is_in` expression checks whether the values of a series are
contained in a given list of items, and produces a series of boolean
values as the results of this membership test.

Changes:
- Added a Literal Series so that Series can be passed into the
expression
- Added `is_in` expression and kernel
- Added tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request expression A new expression
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants