-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19981][SQL] Respect aliases in output partitioning of projects and aggregates #17400
Conversation
Test build #75104 has finished for PR 17400 at commit
|
Just curious, how come the fix is not in this code? So anywhere we compare expressions for semantic equality, we can say |
Here is a sort example with 1 partition. I believe the extra sort on
|
ISTM the solution you suggested does not work because the planner actually compares references (that is, |
I suggest the following code for
|
@allengeorge yea, we could there. But, I think we should first make sure about how to fix this issue. I'm not sure that the approach of this pr is the best. cc: @gatorsmile |
91a412e
to
0492c0f
Compare
Test build #76738 has finished for PR 17400 at commit
|
Test build #76740 has finished for PR 17400 at commit
|
ping |
@@ -36,6 +36,12 @@ import org.apache.spark.util.random.{BernoulliCellSampler, PoissonSampler} | |||
case class ProjectExec(projectList: Seq[NamedExpression], child: SparkPlan) | |||
extends UnaryExecNode with CodegenSupport { | |||
|
|||
@transient private lazy val aliasesInProjects = projectList.flatMap(_.collectFirst { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we fix the issue in EnsureRequirements
? Aggregate operators can also introduce alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, ok. I'll consider again.
Test build #77030 has finished for PR 17400 at commit
|
Jenkins, retest this please. |
Test build #77053 has finished for PR 17400 at commit
|
@gatorsmile ping |
ping |
Test build #79520 has finished for PR 17400 at commit
|
Test build #85034 has finished for PR 17400 at commit
|
@maropu , any reason why this is on hold for so long? |
@@ -321,6 +321,58 @@ case class EnsureRequirements(conf: SQLConf) extends Rule[SparkPlan] { | |||
} | |||
} | |||
|
|||
private def updatePartitioningByAliases(exprs: Seq[NamedExpression], partioning: Partitioning) | |||
: Partitioning = { | |||
val aliasSeq = exprs.flatMap(_.collectFirst { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might do more than you'd like it to (at least if it behaves the way I understand collect first), i.e.
df.select($"x" as "x1, struct($"a" as "a1", $"b" as "b1") as "s1")
x1 and s1 are aliases, a1 and b1 are not. it could even get more complicated if there was an a1 alias in the top level projections list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pr only focuses on aliases, so the point you described above is out-of-scope in this pr. IMO more complicated cases should be fixed in follow-ups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu , I didn't aim for supporting complex partitioning expressions (which deserves its own separate PR), I meant that this code could introduce regressions by 'over-capturing' nested aliases.
- my specific example is wrong since struct is transformed into a named struct (alias is replaced by an explicit name).
I think that's because the priority is not much high. This issue causes any problem in your query? |
@maropu , yes it does :-) |
If possible, could you describe that problem in your case to encourage this work? |
in my use case, I aggregate a dataset, the use select to align columns with a case-class. I later try to join the resulting dataset based on the same columns used for aggregattion. |
Test build #94999 has finished for PR 17400 at commit
|
Test build #95009 has finished for PR 17400 at commit
|
e288288
to
ec3e6d9
Compare
Test build #95079 has finished for PR 17400 at commit
|
Test build #95078 has finished for PR 17400 at commit
|
Test build #95080 has finished for PR 17400 at commit
|
Test build #95097 has finished for PR 17400 at commit
|
Test build #95102 has finished for PR 17400 at commit
|
retest this please |
Test build #96061 has finished for PR 17400 at commit
|
retest this please |
import org.apache.spark.sql.catalyst.expressions.{Alias, Expression, NamedExpression} | ||
import org.apache.spark.sql.catalyst.plans.physical._ | ||
|
||
trait AliasAwareOutputPartitioning extends UnaryExecNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a general utility class for this. cc @maryannxue She did the similar things for the other projects in the past. Maybe @maryannxue can help deliver such a utility class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll wait for @maryannxue suggestion.
Test build #96161 has finished for PR 17400 at commit
|
retest this please |
Test build #96164 has finished for PR 17400 at commit
|
Test build #97746 has finished for PR 17400 at commit
|
Test build #97753 has finished for PR 17400 at commit
|
Test build #97788 has finished for PR 17400 at commit
|
What changes were proposed in this pull request?
The current master might wrongly add shuffle operations when projects and aggregates in physical plans have aliases in output expressions. A concrete example is as follows;
In the query, the second
Exchange
is not necessary. The root cause is that the planner wrongly regardskey
andk
as different attributes because they have differentexprId
. Then, it fails distribution requirement checks inEnsureRequirements
. This pr proposes to handle these aliases inEnsureRequirements
so as to check if the operators satisfy their output distribution requirements.How was this patch tested?
Added tests in
SQLQueryTestSuite
.