Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL] Improve SparkSQL Aggregates #683

Closed
wants to merge 1 commit into from
Closed

Conversation

marmbrus
Copy link
Contributor

@marmbrus marmbrus commented May 7, 2014

  • Add native min/max (was using hive before).
  • Handle nulls correctly in Avg and Sum.

* Add native min/max (was using hive before).
* Handle nulls correctly in Avg and Sum.
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

override def newInstance() = new MinFunction(child, this)
}

case class MinFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated to this pr - but I just realized the way we are storing the aggregation buffer in Spark SQL uses much more memory than needed, because there are two extra pointers to expr/base, which is identical for every tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, though this is not an issue in the code gen version.
On May 7, 2014 2:28 PM, "Reynold Xin" [email protected] wrote:

In
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala:

@@ -86,6 +86,67 @@ abstract class AggregateFunction
override def newInstance() = makeCopy(productIterator.map { case a: AnyRef => a }.toArray)
}

+case class Min(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] {

  • override def references = child.references
  • override def nullable = child.nullable
  • override def dataType = child.dataType
  • override def toString = s"MIN($child)"
  • override def asPartial: SplitEvaluation = {
  • val partialMin = Alias(Min(child), "PartialMin")()
  • SplitEvaluation(Min(partialMin.toAttribute), partialMin :: Nil)
  • }
  • override def newInstance() = new MinFunction(child, this)
    +}

+case class MinFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction {

this is unrelated to this pr - but I just realized the way we are storing
the aggregation buffer in Spark SQL uses much more memory than needed,
because there are two extra pointers to expr/base, which is identical for
every tuple.


Reply to this email directly or view it on GitHubhttps://github.com//pull/683/files#r12404003
.

@rxin
Copy link
Contributor

rxin commented May 7, 2014

LGTM

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14784/

@marmbrus
Copy link
Contributor Author

marmbrus commented May 8, 2014

@pwendell, this should probably go in 1.0.

@rxin
Copy link
Contributor

rxin commented May 8, 2014

Merged.

@asfgit asfgit closed this in 19c8fb0 May 8, 2014
asfgit pushed a commit that referenced this pull request May 8, 2014
* Add native min/max (was using hive before).
* Handle nulls correctly in Avg and Sum.

Author: Michael Armbrust <[email protected]>

Closes #683 from marmbrus/aggFixes and squashes the following commits:

64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.

(cherry picked from commit 19c8fb0)
Signed-off-by: Reynold Xin <[email protected]>
@marmbrus marmbrus deleted the aggFixes branch June 6, 2014 05:26
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
* Add native min/max (was using hive before).
* Handle nulls correctly in Avg and Sum.

Author: Michael Armbrust <[email protected]>

Closes apache#683 from marmbrus/aggFixes and squashes the following commits:

64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants