Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35155][SQL] Add rule id pruning to Analyzer rules #32425

Closed
wants to merge 7 commits into from

Conversation

sigmod
Copy link
Contributor

@sigmod sigmod commented May 3, 2021

What changes were proposed in this pull request?

Added rule id based pruning to Analyzer rules in fixed point batches:

  • org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions
  • org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggAliasInGroupBy
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveBinaryArithmetic
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveEncodersInUDF
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveInsertInto
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRandomSeed
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubqueryColumnAliases
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast
  • org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUserSpecifiedColumns
  • org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution
  • org.apache.spark.sql.catalyst.analysis.DeduplicateRelations
  • org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
  • org.apache.spark.sql.catalyst.analysis.EliminateUnions
  • org.apache.spark.sql.catalyst.analysis.ResolveCreateNamedStruct
  • org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveCoalesceHints
  • org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveJoinStrategyHints
  • org.apache.spark.sql.catalyst.analysis.ResolveInlineTables
  • org.apache.spark.sql.catalyst.analysis.ResolveLambdaVariables
  • org.apache.spark.sql.catalyst.analysis.ResolveTimeZone
  • org.apache.spark.sql.catalyst.analysis.ResolveUnion
  • org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals
  • org.apache.spark.sql.catalyst.analysis.TimeWindowing

Subsequent PRs will add tree bits based pruning to those rules. Split a big PR to reduce review load.

Why are the changes needed?

Reduce the number of tree traversals and hence improve the query compilation latency.

How was this patch tested?

Existing tests.

@github-actions github-actions bot added the SQL label May 3, 2021
@sigmod sigmod changed the title [WIP][SPARK-35155][SQL] Add rule id pruning to Resolve rules [WIP][SPARK-35155][SQL] Add rule id pruning to Analyzer rules May 3, 2021
@sigmod sigmod changed the title [WIP][SPARK-35155][SQL] Add rule id pruning to Analyzer rules [SPARK-35155][SQL] Add rule id pruning to Analyzer rules May 3, 2021
@sigmod sigmod changed the title [SPARK-35155][SQL] Add rule id pruning to Analyzer rules [WIP][SPARK-35155][SQL] Add rule id pruning to Analyzer rules May 3, 2021
@sigmod sigmod changed the title [WIP][SPARK-35155][SQL] Add rule id pruning to Analyzer rules [SPARK-35155][SQL] Add rule id pruning to Analyzer rules May 4, 2021
@sigmod
Copy link
Contributor Author

sigmod commented May 4, 2021

@hvanhovell @gengliangwang @dbaliafroozeh @maryannxue, this PR is ready for review. Changes in this PR are kind of mechanic -- I only added rule id pruning to Analyzer rules. I plan to add tree bit pruning in a subsequent PR so as to limit PR size and reduce review load.

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42671/

@SparkQA
Copy link

SparkQA commented May 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42671/

@gengliangwang
Copy link
Member

@sigmod Does this include all the analyzer rules?

@sigmod
Copy link
Contributor Author

sigmod commented May 5, 2021

@sigmod Does this include all the analyzer rules?

It includes most rules in fixed point batches in those places (because rule id only helps rules that invoke multiple times, e.g., in a fixed point batch):

ResolveUserSpecifiedColumns ::
ResolveInsertInto ::
ResolveRelations ::
ResolveTables ::
ResolvePartitionSpec ::
AddMetadataColumns ::
DeduplicateRelations ::
ResolveReferences ::
ResolveCreateNamedStruct ::
ResolveDeserializer ::
ResolveNewInstance ::
ResolveUpCast ::
ResolveGroupingAnalytics ::
ResolvePivot ::
ResolveOrdinalInOrderByAndGroupBy ::
ResolveAggAliasInGroupBy ::
ResolveMissingReferences ::
ExtractGenerator ::
ResolveGenerate ::
ResolveFunctions ::
ResolveAliases ::
ResolveSubquery ::
ResolveSubqueryColumnAliases ::
ResolveWindowOrder ::
ResolveWindowFrame ::
ResolveNaturalAndUsingJoin ::
ResolveOutputRelation ::
ExtractWindowExpressions ::
GlobalAggregates ::
ResolveAggregateFunctions ::
TimeWindowing ::
ResolveInlineTables ::
ResolveHigherOrderFunctions(catalogManager) ::
ResolveLambdaVariables ::
ResolveTimeZone ::
ResolveRandomSeed ::
ResolveBinaryArithmetic ::
ResolveUnion ::

OptimizeUpdateFields,
CTESubstitution,
WindowsSubstitution,
EliminateUnions,
SubstituteUnresolvedOrdinals),

ResolveHints.ResolveJoinStrategyHints,
ResolveHints.ResolveCoalesceHints),

Two rules are currently not included:

  • TypeCoercionRule which currently does a hand-written recursion instead of calling resolve/transform;
  • CTESubstitution which has a slightly complex logic with multiple transform calls.

I plan to address them in subsequent PRs.

Three rules that rely on potentially changing, external states currently are not included neither (although they are probably fine for the current use cases):

  • ResolveTableValuedFunctions(v1SessionCatalog)
  • ResolveNamespace(catalogManager)
  • ResolveCatalogs(catalogManager)

@SparkQA
Copy link

SparkQA commented May 5, 2021

Test build #138150 has finished for PR 32425 at commit 58923f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait ExtractValue extends Expression

@gengliangwang
Copy link
Member

Thanks, merging to master

@sigmod sigmod deleted the analyzer branch May 27, 2021 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants