Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve volatile expression handling in CommonSubexprEliminate #11265

Merged
merged 8 commits into from
Jul 8, 2024

Conversation

peter-toth
Copy link
Contributor

Which issue does this PR close?

Part of #11194.

Rationale for this change

Curently CommonSubexprEliminate always skips expressions doesn't allow volatile expressions to be extracted, but this can be improved by allowing non-volatile subexpressions of volatile expressions to be extracted:

I.e. from the plan:

Projection (a + b) + rand() AS c1, (a + b) + rand() AS c2

the expression a + b can be extracted as common expression despite the fact that (a + b) + rand() is volatile.

What changes are included in this PR?

Changes the rule to handle volatile expressions better by adding a new boolean flag to VisitRecord::ExprItem and propagating up volatility (validity for CSE).

Are these changes tested?

Yes, added new UT.

Are there any user-facing changes?

No.

@peter-toth peter-toth changed the title Improve volatile expression handling in CommonSubexprEliminate rule Improve volatile expression handling in CommonSubexprEliminate Jul 4, 2024
@peter-toth peter-toth force-pushed the improve-cse-volatile-handling branch from 4b395e1 to 6532b97 Compare July 4, 2024 12:31
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Jul 4, 2024
@peter-toth
Copy link
Contributor Author

peter-toth commented Jul 4, 2024

cc @alamb, this the 2nd PR of #11194 and removes a TODO.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @peter-toth -- this looks quite cool. I had one question about the test and some comment suggestions. The test is the only thing I think is needed before merging this PR

cc @waynexia and @haohuaijin in case you have comments

It may also be worth writing a few .slt tests showing this in action and how it interacts with multiple subexpressions

datafusion/expr/src/expr.rs Show resolved Hide resolved
/// The flag is propagated up from children to parent. (E.g. volatile expressions
/// are not valid and can't be extracted, but non-volatile children of volatile
/// expressions can be extracted.)
ExprItem(Identifier<'n>, bool),
}

impl<'n> ExprIdentifierVisitor<'_, 'n> {
/// Find the first `EnterMark` in the stack, and accumulates every `ExprItem`
/// before it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you possibly update this doc string to explain the 4-tuple that is now returned? Specifically document the two different bool values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good idea, I've added the comment in 79b2e02

let plan = LogicalPlanBuilder::from(table_scan.clone())
.project(vec![
not_extracted_volatile_short_circuit_1.clone().alias("c1"),
not_extracted_volatile_short_circuit_1.alias("c2"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I was thinking -- why can't we extract the a = 0 part of a = 0 OR random() = 0 out?

It seems like it would be ok to rewrite

SELECT 
  a = 0 OR random() = 0 as c1,
  a = 0 OR random() = 0 as c2,
 ..

to something like this:

SELECT 
  __subexpr_1 OR random() as c1,
  __subexpr_1 OR random() as c2,
...
FROM (
  SELECT 
   a=0 as  __subexpr_1 
  FROM 
...

Copy link
Contributor Author

@peter-toth peter-toth Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question is similar to #11197 (comment). We can extract the surely evaluated expressions (1st legs) from those short circuiting Or, And and CaseWhen expressions, but we are not there yet with this PR.

My 3rd PR in the #11194 epic will implement that logic, but I want to add the improvements gradually as it will require to recurse into only certain children of a parent and that's not straightforward with the current treenode APIs (i.e. we need to stop recursion at the parent with a Jump and start a new recursion only on the interresting children).

let mut expr_id = None;
let mut is_valid = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please document what "valid" means in this context? I think it means "is valid for CSE" as in "this sub expression could potentially be removed via CSE" but I am not quite sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to the method in 79b2e02, I think it explains what "valid" means in this case, but let me know if we should document the variable as well.

@peter-toth
Copy link
Contributor Author

It may also be worth writing a few .slt tests showing this in action and how it interacts with multiple subexpressions

Sure, I will add some tests tomorrow.

Copy link
Contributor

@haohuaijin haohuaijin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @peter-toth and @alamb.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jul 6, 2024
@peter-toth
Copy link
Contributor Author

peter-toth commented Jul 6, 2024

I've added CSE tests in 1aa8848, it contais TODOs for #11265 (comment), but I will fix them in the next PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @peter-toth and @haohuaijin

I merged up from main and updated the unit test to avoid a new dependency on datafusion-functions.

@alamb
Copy link
Contributor

alamb commented Jul 8, 2024

Thanks again @peter-toth and @haohuaijin

@alamb alamb merged commit b6281b5 into apache:main Jul 8, 2024
23 checks passed
@peter-toth
Copy link
Contributor Author

Thanks @alamb and @haohuaijin for the review!

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…che#11265)

* Improve volatile expression handling in `CommonSubexprEliminate` rule

* fix volatile handling with short circuits

* fix comments

* add slt tests for CSE

* Avoid adding datafusion function dependency

* revert changes to datafusion-cli.lock

---------

Co-authored-by: Andrew Lamb <[email protected]>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
…che#11265)

* Improve volatile expression handling in `CommonSubexprEliminate` rule

* fix volatile handling with short circuits

* fix comments

* add slt tests for CSE

* Avoid adding datafusion function dependency

* revert changes to datafusion-cli.lock

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants