Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert outer join to inner join to improve performance #1585

Closed
Tracked by #2255
xudong963 opened this issue Jan 16, 2022 · 6 comments
Closed
Tracked by #2255

convert outer join to inner join to improve performance #1585

xudong963 opened this issue Jan 16, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@xudong963
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Under certain conditions, convert outer join to inner join to improve performance.

Describe the solution you'd like
During predicate pushdown, if the filter condition (where) only references the non-nullable-side table, then it can be pushed down. But if the filter condition (where) references the nullable-side table, it can't be pushed down, things get bad, which leads to performance cracking.

Fortunately, if the filtering conditions are strict, we can convert the outer join to inner join, all filter conditions can be pushed down.

Broadly speaking, a function, operator, or expression is considered strict if the input argument is NULL and the output is either NULL or FALSE.

Describe alternatives you've considered
No
Additional context
AFAIK, Postgres has the feature.

@james727
Copy link
Contributor

@xudong963 I'd like to pick this up once #1618 is (hopefully) eventually merged. Question re: implementation - do you think this should be a separate optimizer pass or something else?

Apologies in advance if this is all obvious - the reason I ask is that it seems that this rule benefits from filter pushdown occurring both before and after the join rewrite happens. For example, consider the following contrived query:

SELECT * FROM (
  SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.uid
)
WHERE t2.uid IS NOT NULL

If we do not push t2.uid IS NOT NULL into the subquery, we will not know that the join can be rewritten when we optimize the join node. Thus, we should push filters down as far as possible before rewriting the join, giving the following:

-- Push filter into subquery
SELECT * FROM (
  SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.uid
  WHERE t2.uid IS NOT NULL
)

-- Rewrite join
SELECT * FROM (
  SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.uid
  WHERE t2.uid IS NOT NULL
)

However, at this point we would benefit from another pass with the filter pushdown rule, as t2.uid IS NOT NULL can now be pushed down to the underlying scan on t2.

It looks like optimizer passes are sequenced in a vector defined in context.rs - so my question could be reframed as: how would you structure this change to facilitate first pushing filters down as far as possible, then rewriting joins where possible, then pushing down any filters that are now enabled due to outer joins becoming inner joins?

@houqp
Copy link
Member

houqp commented Jan 21, 2022

My opinion is it would be cleaner to manage it as a separate rule because join rewrite is not really related to predicate pushdowns.

I think we have a guiding rule that plan optimization run should be agnostic to the order of plan rules defined in context.rs. Perhaps now is the time for us to introduce a convergent optimization loop?

@xudong963
Copy link
Member Author

Sorry for the delay @james727 -- I agree with @houqp. We can manage a separate rule for join rewrite. In the future, I think we can introduce more join rewrite rules.

@xudong963
Copy link
Member Author

Perhaps now is the time for us to introduce a convergent optimization loop?

Could you explain more specifically about the convergent optimization loop or open an issue? 😁

@alamb
Copy link
Contributor

alamb commented Jan 24, 2022

Relevant discussion: #1618 (comment)

You can see a version of this code in Spark here: https://github.com/apache/spark/blob/aaf0e5e71509a2324e110e45366b753c7926c64b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala#L119-L135

@xudong963
Copy link
Member Author

I think the issue finished, close it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants