[SPARK-18505][SQL] Simplify AnalyzeColumnCommand #15933

rxin · 2016-11-18T22:00:09Z

What changes were proposed in this pull request?

I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address.

This is a small pull request to clean up AnalyzeColumnCommand:

Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them.
Removed the nested updateStats function, by just inlining the function.
Renamed a few functions to better reflect what they do.
Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
Renamed ColumnStatStruct to just AnalyzeColumnCommand.
Added more documentation explaining some of the non-obvious return types and code blocks.

In follow-up pull requests, I'd like to address the following:

Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings.
Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow.
Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions.
Clearly document the data representation stored in the catalog for statistics.

How was this patch tested?

Affected test cases have been updated.

rxin · 2016-11-18T22:00:28Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -97,7 +97,7 @@ private[hive] class HiveClientImpl(
  }

  // Create an internal session state for this HiveClientImpl.
-  val state = {
+  val state: SessionState = {


this was extremely confusing what the return type was, given the size of the block.

rxin · 2016-11-18T22:00:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala


    // Collect statistics per column.
    // The first element in the result will be the overall row count, the following elements
    // will be structs containing all column stats.
    // The layout of each struct follows the layout of the ColumnStats.
    val ndvMaxErr = sparkSession.sessionState.conf.ndvMaxError
    val expressions = Count(Literal(1)).toAggregateExpression() +:
-      attributesToAnalyze.map(ColumnStatStruct(_, ndvMaxErr))
+      attributesToAnalyze.map(AnalyzeColumnCommand.createColumnStatStruct(_, ndvMaxErr))


I also want to move type validation out of createColumnStatStruct.

SparkQA · 2016-11-19T00:28:11Z

Test build #68872 has finished for PR 15933 at commit 1a713fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-11-19T00:31:09Z

LGTM

rxin · 2016-11-19T00:32:59Z

Merging in master/branch-2.1.

## What changes were proposed in this pull request? I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address. This is a small pull request to clean up AnalyzeColumnCommand: 1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them. 2. Removed the nested updateStats function, by just inlining the function. 3. Renamed a few functions to better reflect what they do. 4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct). 5. Renamed ColumnStatStruct to just AnalyzeColumnCommand. 6. Added more documentation explaining some of the non-obvious return types and code blocks. In follow-up pull requests, I'd like to address the following: 1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings. 2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow. 3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions. 4. Clearly document the data representation stored in the catalog for statistics. ## How was this patch tested? Affected test cases have been updated. Author: Reynold Xin <[email protected]> Closes #15933 from rxin/SPARK-18505. (cherry picked from commit 6f7ff75) Signed-off-by: Reynold Xin <[email protected]>

## What changes were proposed in this pull request? I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address. This is a small pull request to clean up AnalyzeColumnCommand: 1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them. 2. Removed the nested updateStats function, by just inlining the function. 3. Renamed a few functions to better reflect what they do. 4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct). 5. Renamed ColumnStatStruct to just AnalyzeColumnCommand. 6. Added more documentation explaining some of the non-obvious return types and code blocks. In follow-up pull requests, I'd like to address the following: 1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings. 2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow. 3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions. 4. Clearly document the data representation stored in the catalog for statistics. ## How was this patch tested? Affected test cases have been updated. Author: Reynold Xin <[email protected]> Closes apache#15933 from rxin/SPARK-18505.

[SPARK-18505][SQL] Simplify AnalyzeColumnCommand

1a713fd

rxin commented Nov 18, 2016

View reviewed changes

asfgit closed this in 6f7ff75 Nov 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18505][SQL] Simplify AnalyzeColumnCommand #15933

[SPARK-18505][SQL] Simplify AnalyzeColumnCommand #15933

rxin commented Nov 18, 2016

rxin Nov 18, 2016

rxin Nov 18, 2016

SparkQA commented Nov 19, 2016

marmbrus commented Nov 19, 2016

rxin commented Nov 19, 2016

[SPARK-18505][SQL] Simplify AnalyzeColumnCommand #15933

[SPARK-18505][SQL] Simplify AnalyzeColumnCommand #15933

Conversation

rxin commented Nov 18, 2016

What changes were proposed in this pull request?

How was this patch tested?

rxin Nov 18, 2016

Choose a reason for hiding this comment

rxin Nov 18, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 19, 2016

marmbrus commented Nov 19, 2016

rxin commented Nov 19, 2016