[SPARK-2883] [SQL] ORC data source for Spark SQL #6194

liancheng · 2015-05-15T17:09:40Z

This PR updates PR #6135 authored by @zhzhan from Hortonworks.

This PR implements a Spark SQL data source for accessing ORC files.

NOTE

Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under org.apache.spark.sql.hive package, and must be used with HiveContext. However, it doesn't require existing Hive installation to access ORC files.

New Features

Saving/loading ORC files without contacting Hive metastore
Support for complex data types (i.e. array, map, and struct)
Aware of common optimizations provided by Spark SQL:

Column pruning
Partitioning pruning
Filter push-down

Future Work

Schema evolution support
Hive metastore table conversion

Acknowledgements

This PR also include initial work done by @scwf from Huawei (PR #3753).

AmplabJenkins · 2015-05-15T17:12:11Z

Merged build triggered.

AmplabJenkins · 2015-05-15T17:12:19Z

Merged build started.

SparkQA · 2015-05-15T17:13:10Z

Test build #32839 has started for PR 6194 at commit 4bc937f.

liancheng · 2015-05-15T17:25:51Z

@zhzhan Here is a rough list of my updates:

Rebased to PR [SPARK-7591] [SQL] Partitioning support API tweaks #6150, which updated the newly introduced partitioning support API

Made corresponding changes according to the new API changes.
2. OrcFilters updates

I worked around the builder state inconsistency issue by employing a double-checking mechanism. Now we can convert a single child of an And filter even if the other child is inconvertible.
Added data type checking, as ORC doesn't accept all Spark SQL atomic data types (e.g. timestamp).

Added new tests

Extracted some useful testing utility methods to SQLTestUtils, and added NewOrcQuerySuite based on ParquetQuerySuite.
4. Some regular refactoring

Mainly styling issues and code simplifications.

liancheng · 2015-05-15T17:28:58Z

Some TODO items related to testing:

Cleanup current ORC test suites, as most of them are based on old Parquet test code, which has been deprecated and removed.
More tests on filter push-down

zhzhan · 2015-05-15T17:58:44Z

@liancheng Thanks for the followup. For the future work, feel free to assign to me.

zhzhan · 2015-05-15T18:17:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala

+          .orElse(tryLeft.flatMap(_ => buildSearchArgument(left, builder)))
+          .orElse(tryRight.flatMap(_ => buildSearchArgument(right, builder)))
+
+      case And(left, right) =>


Should be Or?

Oops, thanks!

SparkQA · 2015-05-15T19:07:54Z

Test build #32839 has finished for PR 6194 at commit 4bc937f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-15T19:07:58Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-15T19:07:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32839/
Test FAILed.

zhzhan · 2015-05-15T19:20:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala

+    // children with brand new builders, and only do the actual conversion with the right builder
+    // instance when the children are proven to be convertible.
+    //
+    // P.S.: Hive seems to use `SearchArgument` together with `ExprNodeGenericFuncDesc` only.


I checked with hive team. For external user, it is more expected to use the current builder approach, although hive internally build xml file by ExprNodeGenericFuncDesc.

Thanks. Do you know are there any other projects that uses ORC SearchArgument builder API? I'm looking for examples. I think the problem we faced should be pretty general. Would like to see how other projects solve it.

is SearchArgument builder API stable/compatible for different hive version?

@scwf Good question. @zhzhan Would you mind helping confirming maturity of this API?

@scwf BTW, with the help of the newly introduced isolated classloader mechanism, Spark SQL can always depend on the most recent version of Hive. At the meanwhile, users can specify arbitrary Hive metastore version to use. So even if this API changes across Hive versions, we don't need shim code to ensure compatibility.

get it, thanks for the explanation

zhzhan · 2015-05-15T19:56:07Z

Jenkins, test this please.

AmplabJenkins · 2015-05-15T19:57:11Z

Merged build triggered.

AmplabJenkins · 2015-05-15T19:57:21Z

Merged build started.

SparkQA · 2015-05-15T19:57:45Z

Test build #32848 has started for PR 6194 at commit 4bc937f.

zhzhan · 2015-05-15T21:09:12Z

@liancheng FYI: For schema merging, I checked some orc experts, and probably it is not supported the filter push down if the column is not in that specific orc file (I myself does not check the implementation yet). In the meantime, separating orc from hive is a on-going effort. We can separate orc from hive afterwards, and upgrade orc support to latest, which I think will improve the performance and a lot and remove potential version mismatch due to hive versions.

SparkQA · 2015-05-15T21:50:41Z

Test build #32848 has finished for PR 6194 at commit 4bc937f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-15T21:50:46Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-15T21:50:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32848/
Test PASSed.

AmplabJenkins · 2015-05-16T01:57:10Z

Merged build triggered.

AmplabJenkins · 2015-05-16T01:57:18Z

Merged build started.

SparkQA · 2015-05-16T01:59:40Z

Test build #32874 has started for PR 6194 at commit 563ee1a.

liancheng · 2015-05-16T02:17:29Z

@zhzhan Thanks for the information.

AmplabJenkins · 2015-05-16T03:22:10Z

Merged build triggered.

AmplabJenkins · 2015-05-16T03:22:19Z

Merged build started.

SparkQA · 2015-05-16T03:22:40Z

Test build #32881 has started for PR 6194 at commit eda453d.

SparkQA · 2015-05-16T03:52:44Z

Test build #32874 has finished for PR 6194 at commit 563ee1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- implicit class OrcContext(sqlContext: HiveContext)
- implicit class OrcDataFrame(dataFrame: DataFrame)

AmplabJenkins · 2015-05-16T03:52:48Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-16T03:52:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32874/
Test PASSed.

AmplabJenkins · 2015-05-16T14:27:18Z

Merged build started.

SparkQA · 2015-05-16T14:29:27Z

Test build #32907 has started for PR 6194 at commit d4afeed.

SparkQA · 2015-05-16T16:24:30Z

Test build #32907 has finished for PR 6194 at commit d4afeed.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s"Failed to load class for data source: $provider")

AmplabJenkins · 2015-05-16T16:24:34Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-16T16:24:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32907/
Test PASSed.

AmplabJenkins · 2015-05-17T04:37:11Z

Merged build triggered.

AmplabJenkins · 2015-05-17T04:37:19Z

Merged build started.

SparkQA · 2015-05-17T04:39:13Z

Test build #32926 has started for PR 6194 at commit 55ecd96.

rxin · 2015-05-17T06:25:00Z

LGTM with respect to API change (there isn't any).

SparkQA · 2015-05-17T06:36:10Z

Test build #32926 has finished for PR 6194 at commit 55ecd96.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s"Failed to load class for data source: $provider")

AmplabJenkins · 2015-05-17T06:36:15Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-17T06:36:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32926/
Test PASSed.

liancheng · 2015-05-18T01:53:19Z

In the last a few commits, I added "orc" as a built-in data source name, so that we can have

hiveContext.read.format("orc").load("hdfs://...")

and

df.write.format("orc").save("hdfs://...")

Note that ORC data source is coupled with Hive. If users try to use it with SQLContext, an error message will be thrown to ask users to use HiveContext instead.

liancheng · 2015-05-18T15:37:17Z

@marmbrus This should be ready to go.

@SInCE

This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support (cherry picked from commit aa31e43) Signed-off-by: Michael Armbrust <[email protected]>

marmbrus · 2015-05-18T19:04:20Z

Thanks guys! Merged to master and 1.4.

Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <[email protected]> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <[email protected]> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break (cherry picked from commit fcf90b7) Signed-off-by: Andrew Or <[email protected]>

@SInCE

This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support

Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

@SInCE

This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support

Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

@SInCE

This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support

Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

liancheng mentioned this pull request May 15, 2015

[SPARK-2883][SQL] Spark Support for ORCFile with New Framework #6135

Closed

zhzhan reviewed May 15, 2015
View reviewed changes

Reorganizes ORC test suites

55ecd96

asfgit closed this in aa31e43 May 18, 2015

marmbrus mentioned this pull request May 18, 2015

[HOTFIX] Fix ORC build break #6244

Closed

liancheng deleted the polishing-orc branch May 19, 2015 02:24

marmbrus mentioned this pull request May 21, 2015

[SQL] Move test code into test #6335

Closed

dongjoon-hyun mentioned this pull request Aug 4, 2017

[SPARK-21422][BUILD] Depend on Apache ORC 1.4.0 #18640

Closed

[SPARK-2883] [SQL] ORC data source for Spark SQL #6194

[SPARK-2883] [SQL] ORC data source for Spark SQL #6194

Conversation

liancheng commented May 15, 2015

New Features

Future Work

Acknowledgements

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

SparkQA commented May 15, 2015

liancheng commented May 15, 2015

liancheng commented May 15, 2015

zhzhan commented May 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhzhan commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

SparkQA commented May 15, 2015

zhzhan commented May 15, 2015

SparkQA commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 15, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

SparkQA commented May 16, 2015

liancheng commented May 16, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

SparkQA commented May 16, 2015

SparkQA commented May 16, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

SparkQA commented May 16, 2015

SparkQA commented May 16, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 16, 2015

AmplabJenkins commented May 17, 2015

AmplabJenkins commented May 17, 2015

SparkQA commented May 17, 2015

rxin commented May 17, 2015

SparkQA commented May 17, 2015

AmplabJenkins commented May 17, 2015

AmplabJenkins commented May 17, 2015

liancheng commented May 18, 2015

liancheng commented May 18, 2015

marmbrus commented May 18, 2015