Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2883] [SQL] ORC data source for Spark SQL #6194

Closed
wants to merge 12 commits into from

Conversation

liancheng
Copy link
Contributor

This PR updates PR #6135 authored by @zhzhan from Hortonworks.


This PR implements a Spark SQL data source for accessing ORC files.

NOTE

Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under org.apache.spark.sql.hive package, and must be used with HiveContext. However, it doesn't require existing Hive installation to access ORC files.

New Features

  1. Saving/loading ORC files without contacting Hive metastore
  2. Support for complex data types (i.e. array, map, and struct)
  3. Aware of common optimizations provided by Spark SQL:
  • Column pruning
  • Partitioning pruning
  • Filter push-down

Future Work

  1. Schema evolution support
  2. Hive metastore table conversion

Acknowledgements

This PR also include initial work done by @scwf from Huawei (PR #3753).

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 15, 2015

Test build #32839 has started for PR 6194 at commit 4bc937f.

@liancheng
Copy link
Contributor Author

@zhzhan Here is a rough list of my updates:

  1. Rebased to PR [SPARK-7591] [SQL] Partitioning support API tweaks #6150, which updated the newly introduced partitioning support API

Made corresponding changes according to the new API changes.
2. OrcFilters updates

  • I worked around the builder state inconsistency issue by employing a double-checking mechanism. Now we can convert a single child of an And filter even if the other child is inconvertible.
  • Added data type checking, as ORC doesn't accept all Spark SQL atomic data types (e.g. timestamp).
  1. Added new tests

Extracted some useful testing utility methods to SQLTestUtils, and added NewOrcQuerySuite based on ParquetQuerySuite.
4. Some regular refactoring

Mainly styling issues and code simplifications.

@liancheng
Copy link
Contributor Author

Some TODO items related to testing:

  • Cleanup current ORC test suites, as most of them are based on old Parquet test code, which has been deprecated and removed.
  • More tests on filter push-down

@zhzhan
Copy link
Contributor

zhzhan commented May 15, 2015

@liancheng Thanks for the followup. For the future work, feel free to assign to me.

.orElse(tryLeft.flatMap(_ => buildSearchArgument(left, builder)))
.orElse(tryRight.flatMap(_ => buildSearchArgument(right, builder)))

case And(left, right) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks!

@SparkQA
Copy link

SparkQA commented May 15, 2015

Test build #32839 has finished for PR 6194 at commit 4bc937f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32839/
Test FAILed.

// children with brand new builders, and only do the actual conversion with the right builder
// instance when the children are proven to be convertible.
//
// P.S.: Hive seems to use `SearchArgument` together with `ExprNodeGenericFuncDesc` only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with hive team. For external user, it is more expected to use the current builder approach, although hive internally build xml file by ExprNodeGenericFuncDesc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Do you know are there any other projects that uses ORC SearchArgument builder API? I'm looking for examples. I think the problem we faced should be pretty general. Would like to see how other projects solve it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is SearchArgument builder API stable/compatible for different hive version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwf Good question. @zhzhan Would you mind helping confirming maturity of this API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scwf BTW, with the help of the newly introduced isolated classloader mechanism, Spark SQL can always depend on the most recent version of Hive. At the meanwhile, users can specify arbitrary Hive metastore version to use. So even if this API changes across Hive versions, we don't need shim code to ensure compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get it, thanks for the explanation

@zhzhan
Copy link
Contributor

zhzhan commented May 15, 2015

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 15, 2015

Test build #32848 has started for PR 6194 at commit 4bc937f.

@zhzhan
Copy link
Contributor

zhzhan commented May 15, 2015

@liancheng FYI: For schema merging, I checked some orc experts, and probably it is not supported the filter push down if the column is not in that specific orc file (I myself does not check the implementation yet). In the meantime, separating orc from hive is a on-going effort. We can separate orc from hive afterwards, and upgrade orc support to latest, which I think will improve the performance and a lot and remove potential version mismatch due to hive versions.

@SparkQA
Copy link

SparkQA commented May 15, 2015

Test build #32848 has finished for PR 6194 at commit 4bc937f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32848/
Test PASSed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32874 has started for PR 6194 at commit 563ee1a.

@liancheng
Copy link
Contributor Author

@zhzhan Thanks for the information.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32881 has started for PR 6194 at commit eda453d.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32874 has finished for PR 6194 at commit 563ee1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • implicit class OrcContext(sqlContext: HiveContext)
    • implicit class OrcDataFrame(dataFrame: DataFrame)

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32874/
Test PASSed.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32907 has started for PR 6194 at commit d4afeed.

@SparkQA
Copy link

SparkQA commented May 16, 2015

Test build #32907 has finished for PR 6194 at commit d4afeed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sys.error(s"Failed to load class for data source: $provider")

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32907/
Test PASSed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 17, 2015

Test build #32926 has started for PR 6194 at commit 55ecd96.

@rxin
Copy link
Contributor

rxin commented May 17, 2015

LGTM with respect to API change (there isn't any).

@SparkQA
Copy link

SparkQA commented May 17, 2015

Test build #32926 has finished for PR 6194 at commit 55ecd96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sys.error(s"Failed to load class for data source: $provider")

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32926/
Test PASSed.

@liancheng
Copy link
Contributor Author

In the last a few commits, I added "orc" as a built-in data source name, so that we can have

hiveContext.read.format("orc").load("hdfs://...")

and

df.write.format("orc").save("hdfs://...")

Note that ORC data source is coupled with Hive. If users try to use it with SQLContext, an error message will be thrown to ask users to use HiveContext instead.

@liancheng
Copy link
Contributor Author

@marmbrus This should be ready to go.

asfgit pushed a commit that referenced this pull request May 18, 2015
This PR updates PR #6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR #3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support

(cherry picked from commit aa31e43)
Signed-off-by: Michael Armbrust <[email protected]>
@marmbrus
Copy link
Contributor

Thanks guys! Merged to master and 1.4.

@asfgit asfgit closed this in aa31e43 May 18, 2015
asfgit pushed a commit that referenced this pull request May 18, 2015
Fix break caused by merging #6225 and #6194.

Author: Michael Armbrust <[email protected]>

Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:

b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
asfgit pushed a commit that referenced this pull request May 18, 2015
Fix break caused by merging #6225 and #6194.

Author: Michael Armbrust <[email protected]>

Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:

b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

(cherry picked from commit fcf90b7)
Signed-off-by: Andrew Or <[email protected]>
@liancheng liancheng deleted the polishing-orc branch May 19, 2015 02:24
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This PR updates PR apache#6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR apache#3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
Fix break caused by merging apache#6225 and apache#6194.

Author: Michael Armbrust <[email protected]>

Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:

b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This PR updates PR apache#6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR apache#3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
Fix break caused by merging apache#6225 and apache#6194.

Author: Michael Armbrust <[email protected]>

Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:

b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This PR updates PR apache#6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR apache#3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
Fix break caused by merging apache#6225 and apache#6194.

Author: Michael Armbrust <[email protected]>

Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:

b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants