Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34701][SQL] Remove analyzing temp view again in CreateViewCommand #31933

Closed

Conversation

imback82
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to remove re-analyzing the already analyzed plan for CreateViewCommand as discussed https://github.com/apache/spark/pull/31273/files#r581592786.

Why are the changes needed?

No need to analyze the plan if it's already analyzed.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests should cover this.

@imback82 imback82 changed the title [W[SPARK-34701][SQL] Remove analyzing temp view again in CreateViewCommand [WIP][SPARK-34701][SQL] Remove analyzing temp view again in CreateViewCommand Mar 22, 2021
@imback82 imback82 marked this pull request as draft March 22, 2021 19:06
@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Test build #136359 has finished for PR 31933 at commit ff7fabb.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40943/

@github-actions github-actions bot added the SQL label Mar 22, 2021
@imback82 imback82 changed the title [WIP][SPARK-34701][SQL] Remove analyzing temp view again in CreateViewCommand [SPARK-34701][SQL] Remove analyzing temp view again in CreateViewCommand Mar 22, 2021
@imback82 imback82 marked this pull request as ready for review March 22, 2021 23:39
@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Test build #136368 has finished for PR 31933 at commit 5829e48.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40952/

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40952/

allowExisting: Boolean,
replace: Boolean,
viewType: ViewType)
extends RunnableCommand {

import ViewHelper._

override def innerChildren: Seq[QueryPlan[_]] = Seq(child)
override def plansToCheckAnalysis: Seq[LogicalPlan] = Seq(analyzedPlan)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to run checkAnalysis on the analyzed plan, otherwise, for the following:

sql("CREATE TABLE view_base_table (key int, data varchar(20)) USING PARQUET")
sql("CREATE VIEW key_dependent_view AS SELECT * FROM view_base_table GROUP BY key")

, view creation works fine, whereas it should have failed with:

org.apache.spark.sql.AnalysisException
expression 'spark_catalog.default.view_base_table.data' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.

@@ -167,6 +167,9 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
case _: ShowTableExtended =>
throw new AnalysisException("SHOW TABLE EXTENDED is not supported for v2 tables.")

case c: Command =>
c.plansToCheckAnalysis.foreach(checkAnalysis)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems hacky? But we cannot make the analyzed plan as children for CreateViewCommaned. The reason is that the View will be optimized away (in the optimizer), and the verification that checks if a permanent view references temp views will fail.

@@ -28,6 +28,7 @@ trait Command extends LogicalPlan {
override def output: Seq[Attribute] = Seq.empty
override def producedAttributes: AttributeSet = outputSet
override def children: Seq[LogicalPlan] = Seq.empty
def plansToCheckAnalysis: Seq[LogicalPlan] = Seq.empty
Copy link
Contributor

@cloud-fan cloud-fan Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse innerChildren instead of adding this?

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40963/

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40963/

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Test build #136379 has finished for PR 31933 at commit 6fdd9e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2021

Test build #136418 has started for PR 31933 at commit e04fbdd.


// Check analysis on internal nodes.
plan match {
case c: Command =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move it to L682?

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Test build #136471 has finished for PR 31933 at commit a14b3b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41055/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41055/

@imback82
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41070/

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41070/

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Test build #136486 has finished for PR 31933 at commit a14b3b9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41080/

@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41080/

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in da04f1f Mar 25, 2021
@SparkQA
Copy link

SparkQA commented Mar 25, 2021

Test build #136496 has finished for PR 31933 at commit 507b00c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class GetViewColumnByNameAndOrdinal(
  • case class MultiplyYMInterval(
  • case class LocalLimitExec(limit: Int, child: SparkPlan) extends BaseLimitExec

@@ -94,14 +94,21 @@ case class CacheTableAsSelectExec(
override lazy val relationName: String = tempViewName

override lazy val planToCache: LogicalPlan = {
// CacheTableAsSelectExec.query is not resolved yet (e.g., not a child of CacheTableAsSelect)
// in order to skip optimizing it; note that we need to pass an analyzed plan to
// CreateViewCommand for the cache to work correctly. Thus, the query is analyzed below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we further clean this up by creating CreateViewStatement instead of CreateViewCommand below?

// If the plan cannot be analyzed, throw an exception and don't proceed.
val qe = sparkSession.sessionState.executePlan(child)
qe.assertAnalyzed()
val analyzedPlan = qe.analyzed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is a regression. CreateViewCommand can be created from CreateViewStatement in ResolveSessionCatalog. However, ResolveSessionCatalog is inside extendedResolutionRules, which means the analyzedPlan doesn't go through postHocResolutionRules.

I think we need to make CreateViewCommand.analyzedPlan a real child, so that it can go through all the analyzer rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. One issue with "make CreateViewCommand.analyzedPlan a real child" is that it will become an optimized plan, which will affect the caching. I can try to make certain command's children to skip optimizer. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great to have such a mechanism to skip optimizer, useful to CacheTableAsSelect as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will give it a shot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I created #32032 to handle this scenario. Please let me know what you think. TIA!

@cloud-fan
Copy link
Contributor

I've reverted it due to the regression mentioned in #31933 (comment)

cloud-fan pushed a commit that referenced this pull request Apr 14, 2021
…dren to be removed once the command is marked as analyzed

### What changes were proposed in this pull request?

This PR proposes to introduce the `AnalysisOnlyCommand` trait such that a command that extends this trait can have its children only analyzed, but not optimized. There is a corresponding analysis rule `HandleAnalysisOnlyCommand` that marks the command as analyzed after all other analysis rules are run.

This can be useful if a logical plan has children where they need to be only analyzed, but not optimized - e.g., `CREATE VIEW` or `CACHE TABLE AS`. This also addresses the issue found in #31933.

This PR also updates `CreateViewCommand`, `CacheTableAsSelect`, and `AlterViewAsCommand` to use the new trait / rule such that their children are only analyzed.

### Why are the changes needed?

To address the issue where the plan is unnecessarily re-analyzed in `CreateViewCommand`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should cover the changes.

Closes #32032 from imback82/skip_transform.

Authored-by: Terry Kim <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants