-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2883][SQL] Spark Support for ORCFile with New Framework #6135
Conversation
Merged build triggered. |
Merged build started. |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #32659 has started for PR 6135 at commit |
@scwf Thanks for the comments. With the new framework, the production code is not so big. Most of the code are for testing purpose. Also I should acknowledge that the code also integrate some work from yours. |
Test build #32659 timed out for PR 6135 at commit |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #32674 has started for PR 6135 at commit |
case _: JavaHiveDecimalObjectInspector => | ||
(o: Any) => HiveShim.createDecimal(o.asInstanceOf[BigDecimal].underlying()) | ||
|
||
case soi: StandardStructObjectInspector => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use SettableStructObjectInspector
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will reuse wrapperFor directly in the next push to remove this part of code totally.
val orcDefaultCompressVar = "hive.exec.orc.default.compress" | ||
var ORC_FILTER_PUSHDOWN_ENABLED = true | ||
val SARG_PUSHDOWN = "sarg.pushdown"; | ||
val INDEX_FILTER = "hive.optimize.index.filter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed.
Haven't examine test code in detail, the other part generally looks good. Thanks for working so hard on this! Most comments are about styling and code simplification. Other than those, there are also some known missing features:
|
Merged build triggered. |
Merged build started. |
Test build #32729 has started for PR 6135 at commit |
Merged build triggered. |
Merged build started. |
Test build #32730 has started for PR 6135 at commit |
Test build #32729 has finished for PR 6135 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
Test build #32730 has finished for PR 6135 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
this.path = path | ||
taskAttemptContext = context | ||
val orcSchema = HiveMetastoreTypes.toMetastoreType(dataSchema) | ||
serializer = new OrcSerde |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that we also need to call serialize.initialize(...)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we initialize the ObjectInspector per file bases. The other approach is that we send the schema from the driver side. In that case, it may become complicated if we want to support schema merge in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw your new code.My misunderstanding. ObjectInspector is also specified in serialize, but doing initialization does look more elegant.
Hey @zhzhan, I finished rebasing and updating the ORC data source. However, I just realized that I can't make a PR to this PR branch since I rebased your code. So I have to open a new PR for the updated code. PS: Our merge script specifies the account with the most commits in the PR as the primary author, so you'll still be recorded as the author of the new PR in Git log after that one gets merged. |
@liancheng Thanks for taking care of it. Much appreciate your help. |
Opened #6194 for the rebased and updated version. |
This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support (cherry picked from commit aa31e43) Signed-off-by: Michael Armbrust <[email protected]>
This PR updates PR #6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR #3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes #6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
Thanks for you work on this! Can we close this issue now that #6194 has been merged? |
By the way this caused a build break in master.
|
@andrewor14 Not this one. It was caused by #6194 and #6225 combined together. |
This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
This PR updates PR apache#6135 authored by zhzhan from Hortonworks. ---- This PR implements a Spark SQL data source for accessing ORC files. > **NOTE** > > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive. That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`. However, it doesn't require existing Hive installation to access ORC files. 1. Saving/loading ORC files without contacting Hive metastore 1. Support for complex data types (i.e. array, map, and struct) 1. Aware of common optimizations provided by Spark SQL: - Column pruning - Partitioning pruning - Filter push-down 1. Schema evolution support 1. Hive metastore table conversion This PR also include initial work done by scwf from Huawei (PR apache#3753). Author: Zhan Zhang <[email protected]> Author: Cheng Lian <[email protected]> Closes apache#6194 from liancheng/polishing-orc and squashes the following commits: 55ecd96 [Cheng Lian] Reorganizes ORC test suites d4afeed [Cheng Lian] Addresses comments 21ada22 [Cheng Lian] Adds @SInCE and @experimental annotations 128bd3b [Cheng Lian] ORC filter bug fix d734496 [Cheng Lian] Polishes the ORC data source 2650a42 [Zhan Zhang] resolve review comments 3c9038e [Zhan Zhang] resolve review comments 7b3c7c5 [Zhan Zhang] save mode fix f95abfd [Zhan Zhang] reuse test suite 7cc2c64 [Zhan Zhang] predicate fix 4e61c16 [Zhan Zhang] minor change 305418c [Zhan Zhang] orc data source support
Major feature:
The current code also integrate the work from @scwf, as we both work on the same jira and the work used to be consolidated.