[SPARK-2883][SQL] Spark Support for ORCFile with New Framework #6135

zhzhan · 2015-05-13T23:25:30Z

Major feature:

New data source API support.
Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the table into orc format file, and the latter is used to import orc format file into spark sql table.
Column pruning, partitioning, etc
Self-contained schema support: The orc support is fully functional independent of hive metastore. The table schema is maintained by the orc file itself.
To support the orc file, user need to: import import org.apache.spark.sql.hive.orc._ to bring in the orc support into context
The orc file is operated in HiveContext, the only reason is due to package issue, and we don’t want to bring in hive dependency into spark sql. Note that orc operations does not relies on Hive metastore.
It support full complicated dataType in Spark Sql, for example, list, seq, and nested datatype.

The current code also integrate the work from @scwf, as we both work on the same jira and the work used to be consolidated.

AmplabJenkins · 2015-05-13T23:27:10Z

Merged build triggered.

AmplabJenkins · 2015-05-13T23:27:15Z

Merged build started.

AmplabJenkins · 2015-05-13T23:42:19Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-13T23:42:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32654/
Test FAILed.

AmplabJenkins · 2015-05-14T00:07:10Z

Merged build triggered.

AmplabJenkins · 2015-05-14T00:07:16Z

Merged build started.

SparkQA · 2015-05-14T00:07:56Z

Test build #32659 has started for PR 6135 at commit a76d5b8.

scwf · 2015-05-14T00:56:26Z

hi @zhzhan instead of making a such a big PR, i think we'd better split it to several ones for easy review as i suggest in #3753.

zhzhan · 2015-05-14T01:31:50Z

@scwf Thanks for the comments. With the new framework, the production code is not so big. Most of the code are for testing purpose. Also I should acknowledge that the code also integrate some work from yours.

SparkQA · 2015-05-14T02:37:57Z

Test build #32659 timed out for PR 6135 at commit a76d5b8 after a configured wait of 150m.

AmplabJenkins · 2015-05-14T02:38:02Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-14T02:38:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32659/
Test FAILed.

AmplabJenkins · 2015-05-14T02:52:11Z

Merged build triggered.

AmplabJenkins · 2015-05-14T02:52:17Z

Merged build started.

SparkQA · 2015-05-14T02:53:00Z

Test build #32674 has started for PR 6135 at commit dc1bfa1.

liancheng · 2015-05-14T02:53:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/HadoopTypeConverter.scala

+    case _: JavaHiveDecimalObjectInspector =>
+      (o: Any) => HiveShim.createDecimal(o.asInstanceOf[BigDecimal].underlying())
+
+    case soi: StandardStructObjectInspector =>


Maybe use SettableStructObjectInspector instead?

Will reuse wrapperFor directly in the next push to remove this part of code totally.

liancheng · 2015-05-14T17:39:57Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/package.scala

+  val orcDefaultCompressVar = "hive.exec.orc.default.compress"
+  var ORC_FILTER_PUSHDOWN_ENABLED = true
+  val SARG_PUSHDOWN = "sarg.pushdown";
+  val INDEX_FILTER = "hive.optimize.index.filter"


This can be removed.

liancheng · 2015-05-14T17:46:16Z

Haven't examine test code in detail, the other part generally looks good. Thanks for working so hard on this! Most comments are about styling and code simplification. Other than those, there are also some known missing features:

Metastore table conversion
Schema merging
HDFS style globbing (provided in the data sources API framework, but got disabled in this PR as commented above)

AmplabJenkins · 2015-05-14T20:22:10Z

Merged build triggered.

AmplabJenkins · 2015-05-14T20:22:19Z

Merged build started.

SparkQA · 2015-05-14T20:22:59Z

Test build #32729 has started for PR 6135 at commit 8b885d6.

AmplabJenkins · 2015-05-14T20:27:11Z

Merged build triggered.

AmplabJenkins · 2015-05-14T20:27:19Z

Merged build started.

SparkQA · 2015-05-14T20:28:07Z

Test build #32730 has started for PR 6135 at commit 4dbea6e.

SparkQA · 2015-05-14T22:13:13Z

Test build #32729 has finished for PR 6135 at commit 8b885d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class OrcContext(sqlContext: HiveContext)
- implicit class OrcSchemaRDD(dataFrame: DataFrame)

AmplabJenkins · 2015-05-14T22:13:17Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-14T22:13:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32729/
Test PASSed.

SparkQA · 2015-05-14T22:20:27Z

Test build #32730 has finished for PR 6135 at commit 4dbea6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class OrcContext(sqlContext: HiveContext)
- implicit class OrcSchemaRDD(dataFrame: DataFrame)

AmplabJenkins · 2015-05-14T22:20:31Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-14T22:20:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32730/
Test PASSed.

liancheng · 2015-05-15T08:47:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

+    this.path = path
+    taskAttemptContext = context
+    val orcSchema = HiveMetastoreTypes.toMetastoreType(dataSchema)
+    serializer = new OrcSerde


Seems that we also need to call serialize.initialize(...) here?

Here we initialize the ObjectInspector per file bases. The other approach is that we send the schema from the driver side. In that case, it may become complicated if we want to support schema merge in the future.

Just saw your new code.My misunderstanding. ObjectInspector is also specified in serialize, but doing initialization does look more elegant.

liancheng · 2015-05-15T16:47:05Z

Hey @zhzhan, I finished rebasing and updating the ORC data source. However, I just realized that I can't make a PR to this PR branch since I rebased your code. So I have to open a new PR for the updated code.

PS: Our merge script specifies the account with the most commits in the PR as the primary author, so you'll still be recorded as the author of the new PR in Git log after that one gets merged.

zhzhan · 2015-05-15T17:20:37Z

@liancheng Thanks for taking care of it. Much appreciate your help.

liancheng · 2015-05-15T17:30:32Z

Opened #6194 for the rebased and updated version.

@SInCE

This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support (cherry picked from commit aa31e43) Signed-off-by: Michael Armbrust <[email protected]>

@SInCE

This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support

marmbrus · 2015-05-18T19:27:58Z

Thanks for you work on this! Can we close this issue now that #6194 has been merged?

andrewor14 · 2015-05-18T20:40:35Z

By the way this caused a build break in master.

[warn]                                           ^
[error] /Users/andrew/Documents/dev/spark/andrew-spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala:174: overriding method buildScan in class HadoopFsRelation of type (requiredColumns: Array[String], filters: Array[org.apache.spark.sql.sources.Filter], inputPaths: Array[String])org.apache.spark.rdd.RDD[org.apache.spark.sql.catalyst.expressions.Row];
[error]  method buildScan cannot override final member
[error]   override def buildScan(requiredColumns: Array[String],
[error]                ^
[warn] two warnings found
[error] one error found
[warn] 12 warnings found

liancheng · 2015-05-19T16:10:38Z

@andrewor14 Not this one. It was caused by #6194 and #6225 combined together.

@SInCE

This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support

@SInCE

This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support

@SInCE

This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support

zhzhan added 4 commits May 13, 2015 17:04

orc data source support

a96e8d9

minor change

90ded0b

predicate fix

e7f7178

reuse test suite

a76d5b8

zhzhan force-pushed the orc-partition branch from 4aa8b6c to a76d5b8 Compare May 14, 2015 00:05

save mode fix

dc1bfa1

liancheng reviewed May 14, 2015
View reviewed changes

zhzhan added 2 commits May 14, 2015 13:19

resolve review comments

8b885d6

resolve review comments

4dbea6e

liancheng reviewed May 15, 2015
View reviewed changes

liancheng mentioned this pull request May 15, 2015

[SPARK-2883] [SQL] ORC data source for Spark SQL #6194

Closed

zhzhan closed this May 18, 2015

[SPARK-2883][SQL] Spark Support for ORCFile with New Framework #6135

[SPARK-2883][SQL] Spark Support for ORCFile with New Framework #6135

Conversation

zhzhan commented May 13, 2015

AmplabJenkins commented May 13, 2015

AmplabJenkins commented May 13, 2015

AmplabJenkins commented May 13, 2015

AmplabJenkins commented May 13, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

SparkQA commented May 14, 2015

scwf commented May 14, 2015

zhzhan commented May 14, 2015

SparkQA commented May 14, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

SparkQA commented May 14, 2015

liancheng May 14, 2015

Choose a reason for hiding this comment

zhzhan May 14, 2015

Choose a reason for hiding this comment

liancheng May 14, 2015

Choose a reason for hiding this comment

liancheng commented May 14, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

SparkQA commented May 14, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

SparkQA commented May 14, 2015

SparkQA commented May 14, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

SparkQA commented May 14, 2015

AmplabJenkins commented May 14, 2015

AmplabJenkins commented May 14, 2015

liancheng May 15, 2015

Choose a reason for hiding this comment

zhzhan May 15, 2015

Choose a reason for hiding this comment

zhzhan May 15, 2015

Choose a reason for hiding this comment

liancheng commented May 15, 2015

zhzhan commented May 15, 2015

liancheng commented May 15, 2015

marmbrus commented May 18, 2015

andrewor14 commented May 18, 2015

liancheng commented May 19, 2015