Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats #11357

peter-toth · 2024-07-09T13:00:27Z

Which issue does this PR close?

Part of #11194.

Rationale for this change

Currently CommonSubexprEliminate doesn't recurse into short-circuit expression and misses extracing common expressions from surely evaluated legs. I.e. from the plan:

Projection (a + 1 OR b) AS c1, (a + 1 AND b) AS c2

the expression a + 1 is not extracted despite the fact that it is evaluated 2 times.

Also, it would make sense to extract such expressions that are surely evalueted only once but there is a chance that they are evaluated conditionally as well. I.e. from the plan:

Projection (a + 1 OR b) AS c1, (b AND a + 1) AS c2

it would make sense to extract a + 1.

What changes are included in this PR?

This PR:

Extends ExprStats with conditional evaluation counts.
Enhances ExprIdentifierVisitor to recurse into children of short-circuit expressions and maintain the state of beeing in a conditional expression branch with a new conditional flag in the visitor.
Treats expressions as common if they are surely evaluated at least 2 times or evaluated surely only once but also evaluated conditionally.

Fixes a bug in OptimizeProjections rule as it currently merges consecutive projections when there are multiple references to a certain column but they occur in 1 project expression.
I.e. it currently merges projections:

Projection: (__common_expr_1 OR random() = Float64(0)) AND (__common_expr_1 OR random() = Float64(1)) AS c1                                                                  
  Projection: t1.a = Float64(1) AS __common_expr_1                                                                                                                          |
    TableScan: t1 projection=[a]                                                                                                                                            |

despite t1.a = Float64(1) is used 2 times.
Without this bugfix in OptimizeProjections the effect of CommonSubexprEliminate would be reverted in the optimizer.

Adds new Expr:column_refs_counts() and Expr::add_column_ref_counts() APIs.

Are these changes tested?

Yes, added new UTs.

Are there any user-facing changes?

No.

peter-toth · 2024-07-09T13:30:40Z

cc @alamb, @haohuaijin. This PR fixes #11197 (comment) / #11265 (comment).

alamb · 2024-07-09T21:26:57Z

Thank you @peter-toth the CI clippy error was fixed in #11368 so if you merge up from main the tests should now pass

I will review this PR tomorrow

…valuated stats

peter-toth · 2024-07-10T09:03:25Z

Thank you @peter-toth the CI clippy error was fixed in #11368 so if you merge up from main the tests should now pass

I will review this PR tomorrow

Thank you @alamb, I've rebased the PR on the latest main, clippy looks good now.

alamb

Thanks @peter-toth -- this PR is (like always) a joy to read and review.

I left some documentation suggestions, but the only thing I think that is needed prior to merge is some additional negative testing (I left suggestions)

I also reviewed the plan changes carefully and they looked great

alamb · 2024-07-10T15:50:36Z

datafusion/sqllogictest/test_files/tpch/q14.slt.part

-15)----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
-16)------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/tpch/data/part.tbl]]}, projection=[p_partkey, p_type], has_header=false
+04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(CASE WHEN part.p_type LIKE Utf8("PROMO%") THEN lineitem.l_extendedprice * Int64(1) - lineitem.l_discount ELSE Int64(0) END), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)]
+05)--------ProjectionExec: expr=[l_extendedprice@0 * (Some(1),20,0 - l_discount@1) as __common_expr_1, p_type@2 as p_type]


its interesting here that this plan shows the evaluation done below the aggregate but the aggregate doesn't seem to reflect that fact (e.g. the aggr expres don't refer to __common_expr_1

Wow, this is a good catch. I have no idea why because the logical plan looks good.
I will try to look into this after this PR, might be some kind of logical->physical plan conversion bug?

Actually, I'm not sure that the logical plan looks good as lineitem.l_extendedprice and lineitem.l_discount disappeared from the optimized plan Projection.

Let me look into this before merging this PR.

No sorry, I was wrong. Those 2 appear only in aliases in Aggregate so the Projection below the Aggregate in the optimized logical plan seems correct.

alamb · 2024-07-10T15:51:50Z

datafusion/sqllogictest/test_files/select.slt

-01)Projection: t.y > Int32(0) AND Int64(1) / CAST(t.y AS Int64) < Int64(1) AS t.y > Int64(0) AND Int64(1) / t.y < Int64(1), t.x > Int32(0) AND t.y > Int32(0) AND Int64(1) / CAST(t.y AS Int64) < Int64(1) / CAST(t.x AS Int64) AS t.x > Int64(0) AND t.y > Int64(0) AND Int64(1) / t.y < Int64(1) / t.x
-02)--TableScan: t projection=[x, y]
+01)Projection: __common_expr_1 AND Int64(1) / CAST(t.y AS Int64) < Int64(1) AS t.y > Int64(0) AND Int64(1) / t.y < Int64(1), t.x > Int32(0) AND __common_expr_1 AND Int64(1) / CAST(t.y AS Int64) < Int64(1) / CAST(t.x AS Int64) AS t.x > Int64(0) AND t.y > Int64(0) AND Int64(1) / t.y < Int64(1) / t.x
+02)--Projection: t.y > Int32(0) AS __common_expr_1, t.x, t.y


👍 I verified that the common expressions do not include the 1 / y term which can potentially generate a runtime error

alamb · 2024-07-10T15:52:48Z

datafusion/sqllogictest/test_files/cse.slt

+FROM t1
+----
+logical_plan
+01)Projection: (__common_expr_1 OR random() = Float64(0)) AND __common_expr_1 AS c1, __common_expr_2 AND random() = Float64(0) OR __common_expr_2 AS c2, CASE WHEN __common_expr_3 = Float64(0) THEN __common_expr_3 ELSE Float64(0) END AS c3, CASE WHEN __common_expr_4 = Float64(0) THEN Int64(0) WHEN CAST(__common_expr_4 AS Boolean) THEN Int64(0) ELSE Int64(0) END AS c4, CASE WHEN __common_expr_5 = Float64(0) THEN Float64(0) WHEN random() = Float64(0) THEN __common_expr_5 ELSE Float64(0) END AS c5, CASE WHEN __common_expr_6 = Float64(0) THEN Float64(0) ELSE __common_expr_6 END AS c6


alamb · 2024-07-10T15:56:43Z

datafusion/expr/src/expr.rs

@@ -1401,6 +1401,41 @@ impl Expr {
        .expect("traversal is infallable");
    }

+    /// Return all references to columns and their occurrence counts in the expression.


I think this new API makes sense to me as a parallel set of APIs for column_refs / add_column_refs

alamb · 2024-07-10T15:57:12Z

datafusion/expr/src/expr.rs

+    /// Adds references to all columns and their occurrence counts in the expression to
+    /// the map.
+    ///
+    /// See [`Self::column_refs`] for details


Suggested change

/// See [`Self::column_refs`] for details

/// See [`Self::column_refs_counts`] for details

Thanks, fixed in 17f33f7.

alamb · 2024-07-10T16:05:29Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

@@ -901,15 +903,15 @@ struct ExprIdentifierVisitor<'a, 'n> {
    random_state: &'a RandomState,
    // a flag to indicate that common expression found
    found_common: bool,
+    // if we are in a conditional branch


I think it would help to document more what is meant by 'conditional' means -- maybe like this

Suggested change

// if we are in a conditional branch

// if we are in a conditional branch. A conditional

// branch means that the expression **might** not be executed depending

// on the runtime values of other expressions, and thus can not be extracted

// as a common expression .

Fixed in b79a9a6.

alamb · 2024-07-10T16:06:16Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+        Ok(match expr {
+            // If we are already in a conditionally evaluated subtree then continue
+            // traversal.
+            _ if self.conditional => TreeNodeRecursion::Continue,


That is a fascinating construct that makes the condition handling uniform 👍

alamb · 2024-07-10T16:07:27Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+                right,
+            }) => {
+                left.visit(self)?;
+                self.conditionally(|visitor| right.visit(visitor).map(|_| ()))?;


the use of conditionally makes reading this logic quite elegant. Nice work

alamb · 2024-07-10T16:09:12Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+            } else {
+                *count += 1;
+            }
+            if *count > 1 || *count == 1 && *conditional_count > 0 {


I personally prefer explict parenthesis to avoid confusion

In this case, I think this is the same:

Suggested change

if *count > 1 || *count == 1 && *conditional_count > 0 {

if *count > 1 || (*count == 1 && *conditional_count > 0) {

Sure, fixed in b79a9a6.

alamb · 2024-07-10T16:14:30Z

datafusion/sqllogictest/test_files/cse.slt

@@ -171,3 +175,41 @@ logical_plan
 physical_plan
 01)ProjectionExec: expr=[a@0 = random() AND b@1 = 0 as c1, a@0 = random() AND b@1 = 1 as c2, a@0 = 2 + random() OR b@1 = 4 as c3, a@0 = 2 + random() OR b@1 = 5 as c4, CASE WHEN a@0 = 4 + random() THEN 0 ELSE 1 END as c5, CASE WHEN a@0 = 4 + random() THEN 0 ELSE 2 END as c6]
 02)--MemoryExec: partitions=1, partition_sizes=[0]
+


Could we maybe add some negative tests if they aren't already handled

For example, I think these should not be CSE'd:

(random() = 0 OR a = 1) AND a = 1

(random() = 0 AND a = 1) OR a = 1

CASE WHEN a + 10 = 0 THEN 0 WHEN random() > 0.5 THEN a+10 ELSE 0 END

CASE WHEN random() > 0.5 THEN 0 WHEN a + 10 = 0 THEN 0 ELSE a + 10 END

CASE WHEN a + 10 = 0 THEN 0 WHEN random() > 0.5 WHEN random() > 0.5 THEN a+10 ELSE 0 END

This is a good idea! I forgot about negative tests. But from the first and 3rd case when examples above we can extract a + 10 as it appears in the first when, so it is surely executed, and it also appear in the conditional subtrees.
I've added new tests in 6e3300f.

alamb · 2024-07-10T16:15:44Z

cc also @haohuaijin

alamb

Thank you @peter-toth . ❤️ -- really nice

peter-toth · 2024-07-12T12:49:24Z

Thanks for the review @alamb!

…valuated stats (apache#11357) * Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats * remove expression tree hashing as no longer needed * address review comments * add negative tests

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jul 9, 2024

peter-toth force-pushed the improve-cse-with-conditional-occurrence branch from 90a2227 to 6ffee44 Compare July 9, 2024 13:26

peter-toth force-pushed the improve-cse-with-conditional-occurrence branch 3 times, most recently from 75e15d6 to 54eb229 Compare July 9, 2024 17:31

alamb mentioned this pull request Jul 9, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 #11334

Closed

9 tasks

Improve CommonSubexprEliminate rule with surely and conditionally e…

dad557c

…valuated stats

peter-toth force-pushed the improve-cse-with-conditional-occurrence branch from 54eb229 to dad557c Compare July 10, 2024 08:40

alamb reviewed Jul 10, 2024

View reviewed changes

peter-toth added 3 commits July 11, 2024 16:50

remove expression tree hashing as no longer needed

17f33f7

address review comments

b79a9a6

add negative tests

6e3300f

alamb approved these changes Jul 11, 2024

View reviewed changes

alamb merged commit d542cbd into apache:main Jul 12, 2024
23 checks passed

alamb mentioned this pull request Jul 15, 2024

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 #11474

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats #11357

Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats #11357

peter-toth commented Jul 9, 2024 •

edited

Loading

peter-toth commented Jul 9, 2024

alamb commented Jul 9, 2024

peter-toth commented Jul 10, 2024

alamb left a comment

alamb Jul 10, 2024

peter-toth Jul 11, 2024 •

edited

Loading

peter-toth Jul 11, 2024

peter-toth Jul 11, 2024 •

edited

Loading

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

peter-toth Jul 11, 2024

alamb Jul 10, 2024

peter-toth Jul 11, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

alamb Jul 10, 2024

peter-toth Jul 11, 2024

alamb Jul 10, 2024

peter-toth Jul 11, 2024

alamb commented Jul 10, 2024

alamb left a comment

peter-toth commented Jul 12, 2024

	/// See [`Self::column_refs`] for details
	/// See [`Self::column_refs_counts`] for details

-    // if we are in a conditional branch
+    // if we are in a conditional branch. A conditional
+    // branch means that the expression **might** not be executed depending
+    // on the runtime values of other expressions, and thus can not be extracted
+    // as a common expression .

	if count > 1 \|\| count == 1 && *conditional_count > 0 {
	if count > 1 \|\| (count == 1 && *conditional_count > 0) {

Improve CommonSubexprEliminate rule with surely and conditionally evaluated stats #11357

Improve CommonSubexprEliminate rule with surely and conditionally evaluated stats #11357

Conversation

peter-toth commented Jul 9, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

peter-toth commented Jul 9, 2024

alamb commented Jul 9, 2024

peter-toth commented Jul 10, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 10, 2024

alamb left a comment

Choose a reason for hiding this comment

peter-toth commented Jul 12, 2024

Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats #11357

Improve `CommonSubexprEliminate` rule with surely and conditionally evaluated stats #11357

peter-toth commented Jul 9, 2024 •

edited

Loading

peter-toth Jul 11, 2024 •

edited

Loading

peter-toth Jul 11, 2024 •

edited

Loading