Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35078][SQL] Add tree traversal pruning in expression rules #32280

Closed
wants to merge 12 commits into from

Conversation

sigmod
Copy link
Contributor

@sigmod sigmod commented Apr 22, 2021

What changes were proposed in this pull request?

Added the following TreePattern enums:

  • AND_OR
  • BINARY_ARITHMETIC
  • BINARY_COMPARISON
  • CASE_WHEN
  • CAST
  • CONCAT
  • COUNT
  • IF
  • LIKE_FAMLIY
  • NOT
  • NULL_CHECK
  • UNARY_POSITIVE
  • UPPER_OR_LOWER

Used them in the following rules:

  • ConstantPropagation
  • ReorderAssociativeOperator
  • BooleanSimplification
  • SimplifyBinaryComparison
  • SimplifyCaseConversionExpressions
  • SimplifyConditionals
  • PushFoldableIntoBranches
  • LikeSimplification
  • NullPropagation
  • SimplifyCasts
  • RemoveDispensableExpressions
  • CombineConcats

Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the SQL label Apr 22, 2021
@sigmod sigmod changed the title [WIP][SPARK-35078] Support tree traversal pruning in expression rules [WIP][SPARK-35078] Add tree traversal pruning in expression rules Apr 22, 2021
@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42284/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42284/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137757 has finished for PR 32280 at commit 5a09050.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42309/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42309/

@sigmod sigmod changed the title [WIP][SPARK-35078] Add tree traversal pruning in expression rules [WIP][SPARK-35078][SQL] Add tree traversal pruning in expression rules Apr 22, 2021
@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42324/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42324/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137780 has finished for PR 32280 at commit 3083338.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137795 has finished for PR 32280 at commit 549d976.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sigmod sigmod changed the title [WIP][SPARK-35078][SQL] Add tree traversal pruning in expression rules [SPARK-35078][SQL] Add tree traversal pruning in expression rules Apr 22, 2021
@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42349/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42349/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42353/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42353/

@SparkQA
Copy link

SparkQA commented Apr 22, 2021

Test build #137819 has finished for PR 32280 at commit e9fec51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137823 has finished for PR 32280 at commit 63df6ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sigmod
Copy link
Contributor Author

sigmod commented Apr 23, 2021

@hvanhovell @gengliangwang @maryannxue @dbaliafroozeh this PR is ready for review. Let me know if I missed anything.

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsUp {
def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning(
_.containsAnyPattern(NULL_CHECK, NULL_LITERAL, COUNT, CAST), ruleId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need CAST here?

Copy link
Contributor Author

@sigmod sigmod Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to capture the first case:
case e @ WindowExpression(Cast(Literal(0L, _), _, _), _)

I'm not sure why this pattern is related to NullPropagation, but that seems a separate issue.
I updated the condition to be more restrictive since CAST may be common in certain queries:

t.containsAnyPattern(NULL_CHECK, NULL_LITERAL, COUNT)
|| t.containsAllPatterns(WINDOW_EXPRESSION, CAST, LITERAL)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. ​I think now developers need to spend extra effort on choosing the Bits here. And there seems no good solution for avoiding regressions except for asking developers to add more unit tests.
Do you have any idea to improve the framework for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems no good solution for avoiding regressions except for asking developers to add more unit tests

  • For an existing rule that we're adding this pruning, yes, we need to be careful, particularly when a rule has complex case patterns, although many rules are simple and intuitive.

  • For a new rule or a new case branch in an existing rule, it seems that the regression risk doesn't change much? if a case pattern branch is not covered by unit tests, there's also a regression risk even without bits pruning, e.g., the case pattern could be wrong too.

@@ -29,7 +29,7 @@ import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning, RangePartitioning, RoundRobinPartitioning, SinglePartition}
import org.apache.spark.sql.catalyst.trees.TreeNodeTag
import org.apache.spark.sql.catalyst.trees.TreePattern.{
INNER_LIKE_JOIN, JOIN, LEFT_SEMI_OR_ANTI_JOIN, NATURAL_LIKE_JOIN, OUTER_JOIN, TreePattern
FILTER, INNER_LIKE_JOIN, JOIN, LEFT_SEMI_OR_ANTI_JOIN, NATURAL_LIKE_JOIN, OUTER_JOIN, TreePattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
import org.apache.spark.sql.catalyst.trees.TreePattern._

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except two comments

@gengliangwang
Copy link
Member

@sigmod could you resolve the conflicts?

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42374/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42374/

Copy link
Contributor Author

@sigmod sigmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sigmod could you resolve the conflicts?

Done.

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case q: LogicalPlan => q transformExpressionsUp {
def apply(plan: LogicalPlan): LogicalPlan = plan.transformWithPruning(
_.containsAnyPattern(NULL_CHECK, NULL_LITERAL, COUNT, CAST), ruleId) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems no good solution for avoiding regressions except for asking developers to add more unit tests

  • For an existing rule that we're adding this pruning, yes, we need to be careful, particularly when a rule has complex case patterns, although many rules are simple and intuitive.

  • For a new rule or a new case branch in an existing rule, it seems that the regression risk doesn't change much? if a case pattern branch is not covered by unit tests, there's also a regression risk even without bits pruning, e.g., the case pattern could be wrong too.

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42376/

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42376/

@gengliangwang
Copy link
Member

Thanks, merging to master

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137844 has finished for PR 32280 at commit bc6c103.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2021

Test build #137846 has finished for PR 32280 at commit cd1d9ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants