[SQL] Improve SparkSQL Aggregates #683

marmbrus · 2014-05-07T20:45:53Z

Add native min/max (was using hive before).
Handle nulls correctly in Avg and Sum.

* Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.

AmplabJenkins · 2014-05-07T20:47:58Z

Merged build triggered.

AmplabJenkins · 2014-05-07T20:48:05Z

Merged build started.

rxin · 2014-05-07T21:28:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala

+  override def newInstance() = new MinFunction(child, this)
+}
+
+case class MinFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction {


this is unrelated to this pr - but I just realized the way we are storing the aggregation buffer in Spark SQL uses much more memory than needed, because there are two extra pointers to expr/base, which is identical for every tuple.

Good point, though this is not an issue in the code gen version.
On May 7, 2014 2:28 PM, "Reynold Xin" [email protected] wrote:

In
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala:

@@ -86,6 +86,67 @@ abstract class AggregateFunction
override def newInstance() = makeCopy(productIterator.map { case a: AnyRef => a }.toArray)
}

+case class Min(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] {

override def references = child.references

override def nullable = child.nullable

override def dataType = child.dataType

override def toString = s"MIN($child)"

override def asPartial: SplitEvaluation = {

val partialMin = Alias(Min(child), "PartialMin")()

SplitEvaluation(Min(partialMin.toAttribute), partialMin :: Nil)

}

override def newInstance() = new MinFunction(child, this)
+}

+case class MinFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction {

this is unrelated to this pr - but I just realized the way we are storing
the aggregation buffer in Spark SQL uses much more memory than needed,
because there are two extra pointers to expr/base, which is identical for
every tuple.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/683/files#r12404003
.

rxin · 2014-05-07T21:29:16Z

LGTM

AmplabJenkins · 2014-05-07T22:06:02Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-07T22:06:02Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14784/

marmbrus · 2014-05-08T00:13:44Z

@pwendell, this should probably go in 1.0.

rxin · 2014-05-08T05:08:31Z

Merged.

* Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum. Author: Michael Armbrust <[email protected]> Closes #683 from marmbrus/aggFixes and squashes the following commits: 64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum. (cherry picked from commit 19c8fb0) Signed-off-by: Reynold Xin <[email protected]>

* Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum. Author: Michael Armbrust <[email protected]> Closes apache#683 from marmbrus/aggFixes and squashes the following commits: 64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.

Improve SparkSQL Aggregates

64fe30b

* Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.

rxin reviewed May 7, 2014
View reviewed changes

asfgit closed this in 19c8fb0 May 8, 2014

marmbrus deleted the aggFixes branch June 6, 2014 05:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SQL] Improve SparkSQL Aggregates #683

[SQL] Improve SparkSQL Aggregates #683

marmbrus commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

rxin May 7, 2014

marmbrus May 7, 2014

rxin commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

marmbrus commented May 8, 2014

rxin commented May 8, 2014

[SQL] Improve SparkSQL Aggregates #683

[SQL] Improve SparkSQL Aggregates #683

Conversation

marmbrus commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

rxin May 7, 2014

Choose a reason for hiding this comment

marmbrus May 7, 2014

Choose a reason for hiding this comment

rxin commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

marmbrus commented May 8, 2014

rxin commented May 8, 2014