Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4296][SQL] Trims aliases when resolving and checking aggregate expressions #3910

Closed
wants to merge 1 commit into from

Conversation

liancheng
Copy link
Contributor

This PR is a follow-up of PR #3248. We should not only trim Alias around GetField but also all unnamed expressions when resolving and checking aggregate expressions. Trimming all aliases is safe because GROUP BY doesn't allow aliasing, thus all aliases added to an aggregate expression which is also a grouping expression must be introduced while resolving unnamed expressions like GetField and UDF calls.

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25100 has started for PR 3910 at commit 60d10fb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25100 has finished for PR 3910 at commit 60d10fb.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25100/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25123 has started for PR 3910 at commit 807e13e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25123 has finished for PR 3910 at commit 807e13e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • SparkSubmit.printErrorAndExit(s"Cannot load main class from JAR $primaryResource")
    • class BinaryClassificationMetrics(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25123/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 11, 2015

Test build #563 has started for PR 3910 at commit 807e13e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 11, 2015

Test build #563 has finished for PR 3910 at commit 807e13e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • SparkSubmit.printErrorAndExit(s"Cannot load main class from JAR $primaryResource")
    • class BinaryClassificationMetrics(
    • case class Sort(
    • case class BroadcastLeftSemiJoinHash(

// Should trim aliases. These aliases can only be introduced while resolving unnamed
// expressions like `GetField` and UDF calls, because GROUP BY clause doesn't allow
// aliasing.
.get(e.transform { case a: Alias => a.child })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this line can remove needed aliases. For example, in subq2.q

SELECT a.k, a.c
FROM (SELECT b.key as k, count(1) as c FROM src b GROUP BY b.key) a
WHERE a.k >= 90;

We have the following plans.

[info]   == Parsed Logical Plan ==
[info]   'Project ['a.k,'a.c]
[info]    'Filter ('a.k >= 90)
[info]     'Subquery a
[info]      'Aggregate ['b.key], ['b.key AS k#84130,COUNT(1) AS c#84131L]
[info]       'UnresolvedRelation [src], Some(b)
[info]   
[info]   == Analyzed Logical Plan ==
[info]   Project [k#84130,c#84131L]
[info]    Filter (k#84130 >= 90)
[info]     Aggregate [key#84132], [key#84132 AS k#84130,COUNT(1) AS c#84131L]
[info]      MetastoreRelation default, src, Some(b)
[info]   
[info]   == Optimized Logical Plan ==
[info]   Filter (k#84130 >= 90)
[info]    Aggregate [key#84132], [key#84132 AS k#84130,COUNT(1) AS c#84131L]
[info]     Project [key#84132]
[info]      MetastoreRelation default, src, Some(b)
[info]   
[info]   == Physical Plan ==
[info]   !Filter (k#84130 >= 90)
[info]    Aggregate false, [key#84132], [key#84132,Coalesce(SUM(PartialCount#84135L),0) AS c#84131L]
[info]     Exchange (HashPartitioning [key#84132], 2)
[info]      Aggregate true, [key#84132], [key#84132,COUNT(1) AS PartialCount#84135L]
[info]       HiveTableScan [key#84132], (MetastoreRelation default, src, Some(b)), None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this issue puzzled me for a loooong time! I realized this solution is wrong. But I didn't figure out the right one until you pointed out that #3987 (accidentally?) fixes this. Thanks!

@liancheng
Copy link
Contributor Author

Closing this as @yhuai had pointed out in #4010 that #3987 already fixed this issue.

@liancheng liancheng closed this Jan 13, 2015
@liancheng liancheng deleted the spark-4296 branch January 13, 2015 16:54
asfgit pushed a commit that referenced this pull request Jan 29, 2015
… expressions

I believe that SPARK-4296 has been fixed by 3684fd2. I am adding tests based #3910 (change the udf to HiveUDF instead).

Author: Yin Huai <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #4010 from yhuai/SPARK-4296-yin and squashes the following commits:

6343800 [Yin Huai] Merge remote-tracking branch 'upstream/master' into SPARK-4296-yin
6cfadd2 [Yin Huai] Actually, this issue has been fixed by 3684fd2.
d42b707 [Yin Huai] Update comment.
8b3a274 [Yin Huai] Since expressions in grouping expressions can have aliases, which can be used by the outer query block,     revert this change.
443538d [Cheng Lian] Trims aliases when resolving and checking aggregate expressions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants