Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16839][SQL] Simplify Struct creation code path #15718

Closed
wants to merge 47 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

Simplify struct creation, especially the aspect of CleanupAliases which missed some aliases when handling trees created by CreateStruct.

This PR includes:

  1. A failing test (create struct with nested aliases, some of the aliases survive CleanupAliases).
  2. A fix that transforms CreateStruct into a CreateNamedStruct constructor, effectively eliminating CreateStruct from all expression trees.
  3. A NamePlaceHolder used by CreateStruct when column names cannot be extracted from unresolved NamedExpression.
  4. A new Analyzer rule that resolves NamePlaceHolder into a string literal once the NamedExpression is resolved.
  5. CleanupAliases code was simplified as it no longer has to deal with CreateStruct's top level columns.

How was this patch tested?

Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.

Modified few tests that expected CreateStruct which is now transformed into CreateNamedStruct.

eyal farago and others added 30 commits August 2, 2016 01:08
…ruct into CreateNamedStruct, this maintains field names and simplifies Alias removal.
…>namedStruct loginc into its own Analyzer rule.
…eStruct: CreateStruct is now unevaluable, direct test is marked as ignored.
…eStruct: reorg analyzer rules according to cloud-fan's comment.
…teStruct: ExpressionEncoder used CreateStruct directly, fixed.
…eStruct: refactored ResolveCreateStruct, move the conversion logic to CrateStruct and CreateStructUnsafe, introduce a helper mixin.
…eStruct: few tester fixes, few tests are still failing.
…teStruct: it turns out we're calling toCreateNameStruct during AST construction (and some tests), in these cases the tree is not resolved yet which causes the dataType method to explode. instead of relying on the dataType method, we examine the children names directly, as a manner of code reuse dataType is now implemented in terms of toCreateNamedStruct.
…esulted with an invalid URI, URI considered file:c:/... as a relative path, modified to file:c:/...).
…miss any expressions 'hiding' in projections or other corners of the logical plan.
…ast-builder no longer 'resolves' createStruct into createNamedStruct.
…ruct and CreateStructUnsafe. restructure test to meet project's standards.
…en file for a test affected by CreateStruct becoming a CreateNamedStruct.
…pAliases

Conflicts:
	sql/hive/src/test/resources/sqlgen/subquery_in_having_2.sql
fix R tests that relied on the generated alias for CreateStruct expressions (thanks to @HyukjinKwon)
…pAliases

Conflicts:
	sql/hive/src/test/resources/sqlgen/subquery_in_having_2.sql
eyalfa and others added 15 commits September 18, 2016 22:59
…y avoids calling name() on unresolved named expressions.
… contribute minimal fix to get (some of) hive tests running on windows.
…mes are not available when CreateStruct.apply executes, we now identify this during construction and plant a place holder for the attrubute name (leaving the resulting CreateNamesStruct unresolved). once the Analyzer resolves the relevant child expressions we deduce the attirbute names from the values
… step to get (some of) hive tests running on Windows.
[SPARK-16839][SQL] Use unresolved CreateStruct instead
…cases for org.apache.spark.sql.SQLQueryTestSuite, basically add an explicit alias over an aggregated struct in projection, this is basically best practice any way as there's no guarantee that 'generated' aliases are stable.
@hvanhovell
Copy link
Contributor Author

cc @eyalfa

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67906 has finished for PR 15718 at commit af277ee.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# Conflicts:
#	sql/core/src/test/resources/sql-tests/inputs/group-by.sql
#	sql/core/src/test/resources/sql-tests/results/group-by.sql.out
@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67918 has finished for PR 15718 at commit 29ccf4e.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2016

Test build #67925 has finished for PR 15718 at commit c0263d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

Merging to master/2.1.

asfgit pushed a commit that referenced this pull request Nov 2, 2016
## What changes were proposed in this pull request?

Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.

This PR includes:

1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.

## How was this patch tested?
Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.

Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.

Author: eyal farago <eyal farago>
Author: Herman van Hovell <[email protected]>
Author: eyal farago <[email protected]>
Author: Eyal Farago <[email protected]>
Author: Hyukjin Kwon <[email protected]>
Author: eyalfa <[email protected]>

Closes #15718 from hvanhovell/SPARK-16839-2.

(cherry picked from commit f151bd1)
Signed-off-by: Herman van Hovell <[email protected]>
@asfgit asfgit closed this in f151bd1 Nov 2, 2016
StructType(fields)
}

case object NamePlaceholder extends LeafExpression with Unevaluable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Although this is merged, I have a question about this NamePlaceholder.

Is it necessary? Actually it has no function and will be replaced with the field name later in analysis.

The resolution of CreateNamedStruct is depended on the field expressions, actually. So if we directly use the field name, seems it is no harm too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. I got it. nvm. Some unresolved named expressions cannot be accessed of name property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking the same thing at first and I tried a few things. However this was the most straightforward way of getting it to work.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.

This PR includes:

1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.

## How was this patch tested?
Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.

Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.

Author: eyal farago <eyal farago>
Author: Herman van Hovell <[email protected]>
Author: eyal farago <[email protected]>
Author: Eyal Farago <[email protected]>
Author: Hyukjin Kwon <[email protected]>
Author: eyalfa <[email protected]>

Closes apache#15718 from hvanhovell/SPARK-16839-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants