Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2883][SQL] Spark Support for ORCFile with New Framework #6135

Closed
wants to merge 7 commits into from

Conversation

zhzhan
Copy link
Contributor

@zhzhan zhzhan commented May 13, 2015

Major feature:

  1. New data source API support.
  2. Basic Operator: saveAsOrcFile and OrcFile. The former is used to save the table into orc format file, and the latter is used to import orc format file into spark sql table.
  3. Column pruning, partitioning, etc
  4. Self-contained schema support: The orc support is fully functional independent of hive metastore. The table schema is maintained by the orc file itself.
  5. To support the orc file, user need to: import import org.apache.spark.sql.hive.orc._ to bring in the orc support into context
  6. The orc file is operated in HiveContext, the only reason is due to package issue, and we don’t want to bring in hive dependency into spark sql. Note that orc operations does not relies on Hive metastore.
  7. It support full complicated dataType in Spark Sql, for example, list, seq, and nested datatype.

The current code also integrate the work from @scwf, as we both work on the same jira and the work used to be consolidated.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32654/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32659 has started for PR 6135 at commit a76d5b8.

@scwf
Copy link
Contributor

scwf commented May 14, 2015

hi @zhzhan instead of making a such a big PR, i think we'd better split it to several ones for easy review as i suggest in #3753.

@zhzhan
Copy link
Contributor Author

zhzhan commented May 14, 2015

@scwf Thanks for the comments. With the new framework, the production code is not so big. Most of the code are for testing purpose. Also I should acknowledge that the code also integrate some work from yours.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32659 timed out for PR 6135 at commit a76d5b8 after a configured wait of 150m.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32659/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32674 has started for PR 6135 at commit dc1bfa1.

case _: JavaHiveDecimalObjectInspector =>
(o: Any) => HiveShim.createDecimal(o.asInstanceOf[BigDecimal].underlying())

case soi: StandardStructObjectInspector =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use SettableStructObjectInspector instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reuse wrapperFor directly in the next push to remove this part of code totally.

val orcDefaultCompressVar = "hive.exec.orc.default.compress"
var ORC_FILTER_PUSHDOWN_ENABLED = true
val SARG_PUSHDOWN = "sarg.pushdown";
val INDEX_FILTER = "hive.optimize.index.filter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

@liancheng
Copy link
Contributor

Haven't examine test code in detail, the other part generally looks good. Thanks for working so hard on this! Most comments are about styling and code simplification. Other than those, there are also some known missing features:

  1. Metastore table conversion
  2. Schema merging
  3. HDFS style globbing (provided in the data sources API framework, but got disabled in this PR as commented above)

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32729 has started for PR 6135 at commit 8b885d6.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32730 has started for PR 6135 at commit 4dbea6e.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32729 has finished for PR 6135 at commit 8b885d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • implicit class OrcContext(sqlContext: HiveContext)
    • implicit class OrcSchemaRDD(dataFrame: DataFrame)

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32729/
Test PASSed.

@SparkQA
Copy link

SparkQA commented May 14, 2015

Test build #32730 has finished for PR 6135 at commit 4dbea6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • implicit class OrcContext(sqlContext: HiveContext)
    • implicit class OrcSchemaRDD(dataFrame: DataFrame)

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32730/
Test PASSed.

this.path = path
taskAttemptContext = context
val orcSchema = HiveMetastoreTypes.toMetastoreType(dataSchema)
serializer = new OrcSerde
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we also need to call serialize.initialize(...) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we initialize the ObjectInspector per file bases. The other approach is that we send the schema from the driver side. In that case, it may become complicated if we want to support schema merge in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw your new code.My misunderstanding. ObjectInspector is also specified in serialize, but doing initialization does look more elegant.

@liancheng
Copy link
Contributor

Hey @zhzhan, I finished rebasing and updating the ORC data source. However, I just realized that I can't make a PR to this PR branch since I rebased your code. So I have to open a new PR for the updated code.

PS: Our merge script specifies the account with the most commits in the PR as the primary author, so you'll still be recorded as the author of the new PR in Git log after that one gets merged.

@zhzhan
Copy link
Contributor Author

zhzhan commented May 15, 2015

@liancheng Thanks for taking care of it. Much appreciate your help.

@liancheng
Copy link
Contributor

Opened #6194 for the rebased and updated version.

asfgit pushed a commit that referenced this pull request May 18, 2015
This PR updates PR #6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR #3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support

(cherry picked from commit aa31e43)
Signed-off-by: Michael Armbrust <[email protected]>
asfgit pushed a commit that referenced this pull request May 18, 2015
This PR updates PR #6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR #3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes #6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
@marmbrus
Copy link
Contributor

Thanks for you work on this! Can we close this issue now that #6194 has been merged?

@zhzhan zhzhan closed this May 18, 2015
@andrewor14
Copy link
Contributor

By the way this caused a build break in master.

[warn]                                           ^
[error] /Users/andrew/Documents/dev/spark/andrew-spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala:174: overriding method buildScan in class HadoopFsRelation of type (requiredColumns: Array[String], filters: Array[org.apache.spark.sql.sources.Filter], inputPaths: Array[String])org.apache.spark.rdd.RDD[org.apache.spark.sql.catalyst.expressions.Row];
[error]  method buildScan cannot override final member
[error]   override def buildScan(requiredColumns: Array[String],
[error]                ^
[warn] two warnings found
[error] one error found
[warn] 12 warnings found

@liancheng
Copy link
Contributor

@andrewor14 Not this one. It was caused by #6194 and #6225 combined together.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This PR updates PR apache#6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR apache#3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This PR updates PR apache#6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR apache#3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This PR updates PR apache#6135 authored by zhzhan from Hortonworks.

----

This PR implements a Spark SQL data source for accessing ORC files.

> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.

1.  Saving/loading ORC files without contacting Hive metastore

1.  Support for complex data types (i.e. array, map, and struct)

1.  Aware of common optimizations provided by Spark SQL:

    - Column pruning
    - Partitioning pruning
    - Filter push-down

1.  Schema evolution support
1.  Hive metastore table conversion

This PR also include initial work done by scwf from Huawei (PR apache#3753).

Author: Zhan Zhang <[email protected]>
Author: Cheng Lian <[email protected]>

Closes apache#6194 from liancheng/polishing-orc and squashes the following commits:

55ecd96 [Cheng Lian] Reorganizes ORC test suites
d4afeed [Cheng Lian] Addresses comments
21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations
128bd3b [Cheng Lian] ORC filter bug fix
d734496 [Cheng Lian] Polishes the ORC data source
2650a42 [Zhan Zhang] resolve review comments
3c9038e [Zhan Zhang] resolve review comments
7b3c7c5 [Zhan Zhang] save mode fix
f95abfd [Zhan Zhang] reuse test suite
7cc2c64 [Zhan Zhang] predicate fix
4e61c16 [Zhan Zhang] minor change
305418c [Zhan Zhang] orc data source support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants