Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CommonSubexprEliminate rule with surely and conditionally evaluated stats #11357

Merged
merged 4 commits into from
Jul 12, 2024

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Jul 9, 2024

Which issue does this PR close?

Part of #11194.

Rationale for this change

Currently CommonSubexprEliminate doesn't recurse into short-circuit expression and misses extracing common expressions from surely evaluated legs. I.e. from the plan:

Projection (a + 1 OR b) AS c1, (a + 1 AND b) AS c2

the expression a + 1 is not extracted despite the fact that it is evaluated 2 times.

Also, it would make sense to extract such expressions that are surely evalueted only once but there is a chance that they are evaluated conditionally as well. I.e. from the plan:

Projection (a + 1 OR b) AS c1, (b AND a + 1) AS c2

it would make sense to extract a + 1.

What changes are included in this PR?

This PR:

  • Extends ExprStats with conditional evaluation counts.
  • Enhances ExprIdentifierVisitor to recurse into children of short-circuit expressions and maintain the state of beeing in a conditional expression branch with a new conditional flag in the visitor.
  • Treats expressions as common if they are surely evaluated at least 2 times or evaluated surely only once but also evaluated conditionally.
  • Fixes a bug in OptimizeProjections rule as it currently merges consecutive projections when there are multiple references to a certain column but they occur in 1 project expression.
    I.e. it currently merges projections:
    Projection: (__common_expr_1 OR random() = Float64(0)) AND (__common_expr_1 OR random() = Float64(1)) AS c1                                                                  
      Projection: t1.a = Float64(1) AS __common_expr_1                                                                                                                          |
        TableScan: t1 projection=[a]                                                                                                                                            |
    
    despite t1.a = Float64(1) is used 2 times.
    Without this bugfix in OptimizeProjections the effect of CommonSubexprEliminate would be reverted in the optimizer.
  • Adds new Expr:column_refs_counts() and Expr::add_column_ref_counts() APIs.

Are these changes tested?

Yes, added new UTs.

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jul 9, 2024
@peter-toth peter-toth force-pushed the improve-cse-with-conditional-occurrence branch from 90a2227 to 6ffee44 Compare July 9, 2024 13:26
@peter-toth
Copy link
Contributor Author

cc @alamb, @haohuaijin. This PR fixes #11197 (comment) / #11265 (comment).

@peter-toth peter-toth force-pushed the improve-cse-with-conditional-occurrence branch 3 times, most recently from 75e15d6 to 54eb229 Compare July 9, 2024 17:31
@alamb
Copy link
Contributor

alamb commented Jul 9, 2024

Thank you @peter-toth the CI clippy error was fixed in #11368 so if you merge up from main the tests should now pass

I will review this PR tomorrow

@peter-toth peter-toth force-pushed the improve-cse-with-conditional-occurrence branch from 54eb229 to dad557c Compare July 10, 2024 08:40
@peter-toth
Copy link
Contributor Author

Thank you @peter-toth the CI clippy error was fixed in #11368 so if you merge up from main the tests should now pass

I will review this PR tomorrow

Thank you @alamb, I've rebased the PR on the latest main, clippy looks good now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @peter-toth -- this PR is (like always) a joy to read and review.

I left some documentation suggestions, but the only thing I think that is needed prior to merge is some additional negative testing (I left suggestions)

I also reviewed the plan changes carefully and they looked great

15)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
16)------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/part.tbl]]}, projection=[p_partkey, p_type], has_header=false
04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]
05)--------ProjectionExec: expr=[l_extendedprice@0 * (Some(1),20,0 - l_discount@1) as __common_expr_1, p_type@2 as p_type]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its interesting here that this plan shows the evaluation done below the aggregate but the aggregate doesn't seem to reflect that fact (e.g. the aggr expres don't refer to __common_expr_1

Copy link
Contributor Author

@peter-toth peter-toth Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a good catch. I have no idea why because the logical plan looks good.
I will try to look into this after this PR, might be some kind of logical->physical plan conversion bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure that the logical plan looks good as lineitem.l_extendedprice and lineitem.l_discount disappeared from the optimized plan Projection.

Let me look into this before merging this PR.

Copy link
Contributor Author

@peter-toth peter-toth Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry, I was wrong. Those 2 appear only in aliases in Aggregate so the Projection below the Aggregate in the optimized logical plan seems correct.

01)Projection: t.y > Int32(0) AND Int64(1) / CAST(t.y AS Int64) < Int64(1) AS t.y > Int64(0) AND Int64(1) / t.y < Int64(1), t.x > Int32(0) AND t.y > Int32(0) AND Int64(1) / CAST(t.y AS Int64) < Int64(1) / CAST(t.x AS Int64) AS t.x > Int64(0) AND t.y > Int64(0) AND Int64(1) / t.y < Int64(1) / t.x
02)--TableScan: t projection=[x, y]
01)Projection: __common_expr_1 AND Int64(1) / CAST(t.y AS Int64) < Int64(1) AS t.y > Int64(0) AND Int64(1) / t.y < Int64(1), t.x > Int32(0) AND __common_expr_1 AND Int64(1) / CAST(t.y AS Int64) < Int64(1) / CAST(t.x AS Int64) AS t.x > Int64(0) AND t.y > Int64(0) AND Int64(1) / t.y < Int64(1) / t.x
02)--Projection: t.y > Int32(0) AS __common_expr_1, t.x, t.y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I verified that the common expressions do not include the 1 / y term which can potentially generate a runtime error

FROM t1
----
logical_plan
01)Projection: (__common_expr_1 OR random() = Float64(0)) AND __common_expr_1 AS c1, __common_expr_2 AND random() = Float64(0) OR __common_expr_2 AS c2, CASE WHEN __common_expr_3 = Float64(0) THEN __common_expr_3 ELSE Float64(0) END AS c3, CASE WHEN __common_expr_4 = Float64(0) THEN Int64(0) WHEN CAST(__common_expr_4 AS Boolean) THEN Int64(0) ELSE Int64(0) END AS c4, CASE WHEN __common_expr_5 = Float64(0) THEN Float64(0) WHEN random() = Float64(0) THEN __common_expr_5 ELSE Float64(0) END AS c5, CASE WHEN __common_expr_6 = Float64(0) THEN Float64(0) ELSE __common_expr_6 END AS c6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1401,6 +1401,41 @@ impl Expr {
.expect("traversal is infallable");
}

/// Return all references to columns and their occurrence counts in the expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this new API makes sense to me as a parallel set of APIs for column_refs / add_column_refs

/// Adds references to all columns and their occurrence counts in the expression to
/// the map.
///
/// See [`Self::column_refs`] for details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// See [`Self::column_refs`] for details
/// See [`Self::column_refs_counts`] for details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed in 17f33f7.

@@ -901,15 +903,15 @@ struct ExprIdentifierVisitor<'a, 'n> {
random_state: &'a RandomState,
// a flag to indicate that common expression found
found_common: bool,
// if we are in a conditional branch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to document more what is meant by 'conditional' means -- maybe like this

Suggested change
// if we are in a conditional branch
// if we are in a conditional branch. A conditional
// branch means that the expression **might** not be executed depending
// on the runtime values of other expressions, and thus can not be extracted
// as a common expression .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b79a9a6.

Ok(match expr {
// If we are already in a conditionally evaluated subtree then continue
// traversal.
_ if self.conditional => TreeNodeRecursion::Continue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fascinating construct that makes the condition handling uniform 👍

right,
}) => {
left.visit(self)?;
self.conditionally(|visitor| right.visit(visitor).map(|_| ()))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of conditionally makes reading this logic quite elegant. Nice work

} else {
*count += 1;
}
if *count > 1 || *count == 1 && *conditional_count > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer explict parenthesis to avoid confusion

In this case, I think this is the same:

Suggested change
if *count > 1 || *count == 1 && *conditional_count > 0 {
if *count > 1 || (*count == 1 && *conditional_count > 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed in b79a9a6.

@@ -171,3 +175,41 @@ logical_plan
physical_plan
01)ProjectionExec: expr=[a@0 = random() AND b@1 = 0 as c1, a@0 = random() AND b@1 = 1 as c2, a@0 = 2 + random() OR b@1 = 4 as c3, a@0 = 2 + random() OR b@1 = 5 as c4, CASE WHEN a@0 = 4 + random() THEN 0 ELSE 1 END as c5, CASE WHEN a@0 = 4 + random() THEN 0 ELSE 2 END as c6]
02)--MemoryExec: partitions=1, partition_sizes=[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe add some negative tests if they aren't already handled

For example, I think these should not be CSE'd:

    (random() = 0 OR a = 1) AND a = 1
    (random() = 0 AND a = 1) OR a = 1
    CASE 
      WHEN a + 10 = 0 THEN 0 
      WHEN random() > 0.5 THEN a+10 
      ELSE 0
    END
    CASE 
      WHEN random() > 0.5 THEN 0
      WHEN a + 10 = 0 THEN 0 
      ELSE a + 10
    END
    CASE 
      WHEN a + 10 = 0 THEN 0 
      WHEN random() > 0.5 
      WHEN random() > 0.5 THEN a+10 
      ELSE 0 
END

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea! I forgot about negative tests. But from the first and 3rd case when examples above we can extract a + 10 as it appears in the first when, so it is surely executed, and it also appear in the conditional subtrees.
I've added new tests in 6e3300f.

@alamb
Copy link
Contributor

alamb commented Jul 10, 2024

cc also @haohuaijin

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @peter-toth . ❤️ -- really nice

@alamb alamb merged commit d542cbd into apache:main Jul 12, 2024
23 checks passed
@peter-toth
Copy link
Contributor Author

Thanks for the review @alamb!

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 12, 2024
…valuated stats (apache#11357)

* Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats

* remove expression tree hashing as no longer needed

* address review comments

* add negative tests
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…valuated stats (apache#11357)

* Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats

* remove expression tree hashing as no longer needed

* address review comments

* add negative tests
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
…valuated stats (apache#11357)

* Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats

* remove expression tree hashing as no longer needed

* address review comments

* add negative tests
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
…valuated stats (apache#11357)

* Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats

* remove expression tree hashing as no longer needed

* address review comments

* add negative tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants