Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize queries that contain IN subqueries by rewriting as semi join #488

Closed
Tracked by #2248
Dandandan opened this issue Jun 3, 2021 · 7 comments
Closed
Tracked by #2248
Labels
enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently IN in queries is not supported.
The recently added semi hash join can be used for those queries.

Describe the solution you'd like
Add this to planner/optimizer.

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added the enhancement New feature or request label Jun 3, 2021
@Dandandan Dandandan changed the title Map in to semi join Map IN to semi join Jun 3, 2021
@msathis
Copy link
Contributor

msathis commented Jun 9, 2021

@Dandandan I would like to work on this ticket. Some pointers would be great, so i will be in the right direction. 😎

@Dandandan
Copy link
Contributor Author

Hey @msathis that would be great.

Effectively it means rewriting queries from:

SELECT a, b
FROM 
x
WHERE a in (select b from t)

Could be written as (minus SQL syntax)

SELECT a, b
FROM
x
SEMI JOIN t ON a=b

So the work will be

  • adding IN as option to the Expr enum and adding it to the sql/planner.
  • extracting applicable IN expression and transforming it to (left and right) columns
  • converting it to a semi join (a join with JoinType::Semi) either directly in the planner, and/or add a optimization rule (e.g. translating a cross join to a semi join). the first would be fine for now.

I think we can return an error in case the logical plan still contains a IN in a expression somewhere.

One complication I saw is that adding a LogicalPlan to the Expr (for encoding IN) is not trivial, because Expr has some derived Eq etc. which the logical plan does not have.

@msathis
Copy link
Contributor

msathis commented Jun 9, 2021

@Dandandan I was debugging to see what happens with the current code. Seems like all IN clauses are converted to Expr::InList. Should we use the InList itself? Or it is intended for different purpose, so better create Expr::In?

@Dandandan
Copy link
Contributor Author

@Dandandan I was debugging to see what happens with the current code. Seems like all IN clauses are converted to Expr::InList. Should we use the InList itself? Or it is intended for different purpose, so better create Expr::In?

The InList is for the IN (1, 2, 3) (a literal list) syntax, whereas this will be for IN (SELECT ....) (in subquery).

This currently results in an error if you try it (something like Unsupported ast node SQLExpr::InSubQuery) in sqltorel) which is defined here: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/sql/planner.rs#L1205

@msathis
Copy link
Contributor

msathis commented Jun 9, 2021

@Dandandan Got it. Now it's all clear 👍

@alamb
Copy link
Contributor

alamb commented Apr 8, 2022

Perhaps now duplicated by #1209

@andygrove andygrove mentioned this issue Apr 16, 2022
15 tasks
@andygrove andygrove changed the title Map IN to semi join Optimize queries that contain IN subqueries by rewriting as semi join Apr 27, 2022
@alamb
Copy link
Contributor

alamb commented May 9, 2022

I think this was done by @korowa in #2421 -- let's open a new ticket with additional specific work as we discover it.

@alamb alamb closed this as completed May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants