convert `outer join` to `inner join` to improve performance #1585

xudong963 · 2022-01-16T09:00:32Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Under certain conditions, convert outer join to inner join to improve performance.

Describe the solution you'd like
During predicate pushdown, if the filter condition (where) only references the non-nullable-side table, then it can be pushed down. But if the filter condition (where) references the nullable-side table, it can't be pushed down, things get bad, which leads to performance cracking.

Fortunately, if the filtering conditions are strict, we can convert the outer join to inner join, all filter conditions can be pushed down.

Broadly speaking, a function, operator, or expression is considered strict if the input argument is NULL and the output is either NULL or FALSE.

Describe alternatives you've considered
No
Additional context
AFAIK, Postgres has the feature.

The text was updated successfully, but these errors were encountered:

james727 · 2022-01-20T20:18:22Z

@xudong963 I'd like to pick this up once #1618 is (hopefully) eventually merged. Question re: implementation - do you think this should be a separate optimizer pass or something else?

Apologies in advance if this is all obvious - the reason I ask is that it seems that this rule benefits from filter pushdown occurring both before and after the join rewrite happens. For example, consider the following contrived query:

SELECT * FROM (
  SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.uid
)
WHERE t2.uid IS NOT NULL

If we do not push t2.uid IS NOT NULL into the subquery, we will not know that the join can be rewritten when we optimize the join node. Thus, we should push filters down as far as possible before rewriting the join, giving the following:

-- Push filter into subquery
SELECT * FROM (
  SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.uid
  WHERE t2.uid IS NOT NULL
)

-- Rewrite join
SELECT * FROM (
  SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.uid
  WHERE t2.uid IS NOT NULL
)

However, at this point we would benefit from another pass with the filter pushdown rule, as t2.uid IS NOT NULL can now be pushed down to the underlying scan on t2.

It looks like optimizer passes are sequenced in a vector defined in context.rs - so my question could be reframed as: how would you structure this change to facilitate first pushing filters down as far as possible, then rewriting joins where possible, then pushing down any filters that are now enabled due to outer joins becoming inner joins?

houqp · 2022-01-21T06:40:24Z

My opinion is it would be cleaner to manage it as a separate rule because join rewrite is not really related to predicate pushdowns.

I think we have a guiding rule that plan optimization run should be agnostic to the order of plan rules defined in context.rs. Perhaps now is the time for us to introduce a convergent optimization loop?

xudong963 · 2022-01-21T16:25:47Z

Sorry for the delay @james727 -- I agree with @houqp. We can manage a separate rule for join rewrite. In the future, I think we can introduce more join rewrite rules.

xudong963 · 2022-01-21T16:28:08Z

Perhaps now is the time for us to introduce a convergent optimization loop?

Could you explain more specifically about the convergent optimization loop or open an issue? 😁

alamb · 2022-01-24T20:51:14Z

Relevant discussion: #1618 (comment)

You can see a version of this code in Spark here: https://github.com/apache/spark/blob/aaf0e5e71509a2324e110e45366b753c7926c64b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L119-L135

xudong963 · 2022-11-22T01:05:08Z

I think the issue finished, close it!

xudong963 added the enhancement New feature or request label Jan 16, 2022

xudong963 mentioned this issue Jan 16, 2022

datafusion doesn't process predicate pushdown correctly when there is outer join #1586

Closed

james727 mentioned this issue Jan 20, 2022

Include join type in display implementation for logical plan #1620

Closed

xudong963 mentioned this issue Jan 21, 2022

Fix predicate pushdown for outer joins #1618

Merged

alamb mentioned this issue Jan 24, 2022

Query Optimizer: Add OUTER --> INNER join conversion #1670

Closed

xudong963 mentioned this issue Apr 6, 2022

Optimizer enhancement: convert outer join to inner join #2169

Closed

This was referenced Apr 17, 2022

[Optimizer] Infer is not null predicate from where clause #2254

Closed

Datafusion Optimizer Enhancement #2255

Closed

xudong963 closed this as completed Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert `outer join` to `inner join` to improve performance #1585

convert `outer join` to `inner join` to improve performance #1585

xudong963 commented Jan 16, 2022

james727 commented Jan 20, 2022

houqp commented Jan 21, 2022

xudong963 commented Jan 21, 2022

xudong963 commented Jan 21, 2022

alamb commented Jan 24, 2022

xudong963 commented Nov 22, 2022

convert outer join to inner join to improve performance #1585

convert outer join to inner join to improve performance #1585

Comments

xudong963 commented Jan 16, 2022

james727 commented Jan 20, 2022

houqp commented Jan 21, 2022

xudong963 commented Jan 21, 2022

xudong963 commented Jan 21, 2022

alamb commented Jan 24, 2022

xudong963 commented Nov 22, 2022

convert `outer join` to `inner join` to improve performance #1585

convert `outer join` to `inner join` to improve performance #1585