Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more expressions in equality join #4140

Closed
ygf11 opened this issue Nov 8, 2022 · 8 comments · Fixed by #4193
Closed

Support more expressions in equality join #4140

ygf11 opened this issue Nov 8, 2022 · 8 comments · Fixed by #4193
Labels
enhancement New feature or request

Comments

@ygf11
Copy link
Contributor

ygf11 commented Nov 8, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently some equality joins which contain normal expressions will run as cross join.
For example:

❯ explain select * from test0 as t0 inner join test1 as t1 on t0.c0 + 1 = t1.c0;
+---------------+-----------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                      |
+---------------+-----------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t0.c0, t1.c0                                                                                  |
|               |   Filter: CAST(t0.c0 AS Int64) + Int64(1) = CAST(t1.c0 AS Int64)                                          |
|               |     CrossJoin:                                                                                            |
|               |       SubqueryAlias: t0                                                                                   |
|               |         TableScan: test0 projection=[c0]                                                                  |
|               |       SubqueryAlias: t1                                                                                   |
|               |         TableScan: test1 projection=[c0]                                                                  |
| physical_plan | ProjectionExec: expr=[c0@0 as c0, c0@1 as c0]                                                             |
|               |   CoalesceBatchesExec: target_batch_size=4096                                                             |
|               |     FilterExec: CAST(c0@0 AS Int64) + 1 = CAST(c0@1 AS Int64)                                             |
|               |       CrossJoinExec                                                                                       |
|               |         RepartitionExec: partitioning=RoundRobinBatch(32)                                                 |
|               |           ParquetExec: limit=None, partitions=[test0.parquet], projection=[c0] |
|               |         RepartitionExec: partitioning=RoundRobinBatch(32)                                                 |
|               |           ParquetExec: limit=None, partitions=[test1.parquet], projection=[c0] |
|               |                                                                                                           |
+---------------+-----------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.008 seconds.

We can move these to hash-join to improve performance.

Describe the solution you'd like
Move these equality joins from cross join to join in logical and physical plan.

In addition, it also helps to fix:

Describe alternatives you've considered

Additional context

@ygf11 ygf11 added the enhancement New feature or request label Nov 8, 2022
@liukun4515
Copy link
Contributor

maybe it related to the subquery and the type coercion. @andygrove

@liukun4515 liukun4515 added bug Something isn't working and removed bug Something isn't working labels Nov 8, 2022
@ygf11
Copy link
Contributor Author

ygf11 commented Nov 9, 2022

maybe it related to the subquery and the type coercion.

Yes, it is relative to type coercion.

The issue is not very detailed, Let me add a little more.

Currently our LogicalPlan::Join(..) only supports column equality for join, so does physical plan.

https://github.com/apache/arrow-datafusion/blob/a32fb657d1caec634cb53979b2f7ef2fad224905/datafusion/expr/src/logical_plan/plan.rs#L1488

https://github.com/apache/arrow-datafusion/blob/a32fb657d1caec634cb53979b2f7ef2fad224905/datafusion/core/src/physical_plan/joins/hash_join.rs#L125

We can extend column to normal expression, then datafusion can support more equality joins.

That means:

  • If all columns of left side expression are in the left schema, and so does right side, parse as equality joins.
  • Other wise, parse as cross join.

For #2877, since the join unit is expression, we can wrap cast in type coercion phase when data types of two side is not same.

@Dandandan
Copy link
Contributor

Sounds like a good plan. Any expression of the form e1=e2 should be supported in equality joins.

@liukun4515
Copy link
Contributor

maybe it related to the subquery and the type coercion.

Yes, it is relative to type coercion.

The issue is not very detailed, Let me add a little more.

Currently our LogicalPlan::Join(..) only supports column equality for join, so does physical plan.

https://github.com/apache/arrow-datafusion/blob/a32fb657d1caec634cb53979b2f7ef2fad224905/datafusion/expr/src/logical_plan/plan.rs#L1488

https://github.com/apache/arrow-datafusion/blob/a32fb657d1caec634cb53979b2f7ef2fad224905/datafusion/core/src/physical_plan/joins/hash_join.rs#L125

We can extend column to normal expression, then datafusion can support more equality joins.

That means:

  • If all columns of left side expression are in the left schema, and so does right side, parse as equality joins.
  • Other wise, parse as cross join.

For #2877, since the join unit is expression, we can wrap cast in type coercion phase when data types of two side is not same.

cc @mingmwang

@mingmwang
Copy link
Contributor

I can take a look at the issue.
My current preference is to keep the equal join condition as columns, but add another projection to project the expression to normal columns and push down the projection.

The major reason is

  1. Complex equals join conditions will cause duplicate calculation:
    A inner join B on (udf1(A.a) = udf2(B.b))

The first time calculation happens when during the Repartition, Repartition expr will take the exprs now and need to evaluate.
The second time calculation happens when building the hashtable or comparing.

  1. Some guys might write wrong expressions as join conditions
    A inner join B on (A.a = B.b = B.c)
    This expression is actually a valid expression and can be evaluated, but could cause wrong join result, similar to cross join.
A.a = B.b -- result of evaluation is bool type
(A.a = B.b) = cast(B.c as bool)

Maybe we can also take a complex approach and check the expression's complexity

  1. if the expressions are trival cast, support them as equal join conditions

logical plan

A inner join B on (Cast(A.a) = Cast(B.b))
   TableScanA
   TableScanB

physical plan

A inner join B on (Cast(A.a) = Cast(B.b))
   Repartition(Cast(A.a))
       TableScanA
   Repartition(Cast(B.b))
       TableScanB
  1. if the expressions are complex, add projections

logical plan

A inner join B on (udf1(A.a) = udf2(B.b + xxxx))
   TableScanA
   TableScanB

physical plan

A inner join B on (col1 = col2)
   Repartition(col1)
   Projection(udf1(A.a) as col1, A.*)
       TableScanA
   Repartition(col2)
   Projection(udf2(B.b + xxxx) as col2, B.*)
       TableScanB
  1. if the expressions are suspicious, return error in logical plan
    return Error.
    A inner join B on (A.a = B.b = B.c)

@mingmwang
Copy link
Contributor

And I just added the EquivalenceProperties/EquivalentClass to the physical execution plan, if we plan to support expressions as equal join conditions, I need to enhance the EquivalenceProperties/EquivalentClass related logic as well.

@yahoNanJing @alamb

@alamb
Copy link
Contributor

alamb commented Nov 9, 2022

For what it is worth, the existing join logic contains hard coded assumptions that the join is between two columns in several places. Changing the join logic (which is already complicated and will likely only get more so) is likely to be quite challenging

So therefore I agree with @mingmwang's proposal of:

My current preference is to keep the equal join condition as columns, but add another projection to project the expression to normal columns and push down the projection.

if the expressions are trival cast, support them as equal join conditions

I don't think there would be any performance difference between casting in a Projection or in the Join itself and I think it would keep the Join significantly less complicated.

I would suggest not handling casts in the Join but instead work on improving the other optimizer simplification rules to remove the casts. Like the expression

|               |   Filter: CAST(t0.c0 AS Int64) + Int64(1) = CAST(t1.c0 AS Int64)                                          |

Can be rewritten into

|               |   Filter: t0.c0 + Int32(1) = t1.c0                                           |

Which we would still have to have a projection to evaluate t0.c0 + 1 but t1.c0 would need no processing

This type of unwrapping is already done in https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/unwrap_cast_in_comparison.rs

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 10, 2022

Thank you @mingmwang @alamb.

Agree with your suggestion to implement this with another projection. I will work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants