-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3007][SQL] Adds dynamic partitioning support #2616
Conversation
Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
QA tests have started for PR 2616 at commit
|
QA tests have started for PR 2616 at commit
|
QA tests have finished for PR 2616 at commit
|
Reverted the accidental trailing space change. However, since this is really dangerous, fixed it in #2619. |
QA tests have finished for PR 2616 at commit
|
Test PASSed. |
MD5 of query strings in `createQueryTest` calls are used to generate golden files, leaving trailing spaces there can be really dangerous. Got bitten by this while working on #2616: my "smart" IDE automatically removed a trailing space and makes Jenkins fail. (Really should add "no trailing space" to our coding style guidelines!) Author: Cheng Lian <[email protected]> Closes #2619 from liancheng/kill-trailing-space and squashes the following commits: 034f119 [Cheng Lian] Kill dangerous trailing space in query string
@marmbrus Let's try to merge this one to master and see whether Jenkins accepts it. |
Tried merging but it failed :( @kayousterhout what did you end up doing to merge this the first time? |
Comment out the print statement in merge_pr that causes the failure. On Thu, Oct 2, 2014 at 7:58 PM, Michael Armbrust [email protected]
|
Hmmm, still failing with:
|
Hi, @liancheng, master branch test failed in my machine for all dynamic partition , Detail log--- Here i miss something? My test command as follows: |
@scwf Can you elaborate on what configurations you're using? Details like compilation flags, environment variables and building process can be helpful. I've been tracking this failure during the last a few days but couldn't reproduce it either locally or on Jenkins PR builder. |
@scwf Or could you please describe the steps to reproduce this failure from a newly checked out master branch? I guess once you can reproduce it, it happens deterministically. |
Ah, just found out that I can reproduce it with |
Yes, i will use -Phive,hadoop-2.4 to see whether it has the peoblem |
using -Phive,hadoop-2.4 is also ok in my local maching |
So this bug can be triggered by lower versions of Hadoop, e.g. 1.0.3. I haven't validate the exact range yet. In Within Hive,
|
@scwf Thanks for all the information you provided offline :) |
Get it. |
The reason why Test code: object GlobExperiments extends App {
val conf = new Configuration()
val fs = FileSystem.getLocal(conf)
fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status =>
println(status.getPath)
}
} Target directory structure:
Hadoop 2.4.1 result:
Hadoop 1.0.4 resuet:
|
… versions This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x). The root cause is the semantics difference of `FileSystem.globStatus()` between different versions of Hadoop, as illustrated by the following test code: ```scala object GlobExperiments extends App { val conf = new Configuration() val fs = FileSystem.getLocal(conf) fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status => println(status.getPath) } } ``` Target directory structure: ``` /tmp/wh ├── dir0 │ ├── dir1 │ │ └── level2 │ └── level1 └── level0 ``` Hadoop 2.4.1 result: ``` file:/tmp/wh/dir0/dir1/level2 ``` Hadoop 1.0.4 resuet: ``` file:/tmp/wh/dir0/dir1/level2 file:/tmp/wh/dir0/level1 file:/tmp/wh/level0 ``` In #2226 and #2616, we call `FileOutputCommitter.commitJob()` at the end of the job, and the `_SUCCESS` mark file is written. When working with lower Hadoop versions, due to the `globStatus()` semantics issue, `_SUCCESS` is included as a separate partition data file by `Hive.loadDynamicPartitions()`, and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the `_SUCCESS` marker to workaround this issue. Hive doesn't suffer this issue because `FileSinkOperator` doesn't call `FileOutputCommitter.commitJob()`, instead, it calls `Utilities.mvFileToFinalPath()` to cleanup the output directory and then loads it into Hive warehouse by with `loadDynamicPartitions()`/`loadPartition()`/`loadTable()`. This approach is better because it handles failed job and speculative tasks properly. We should add this step to `InsertIntoHiveTable` in another PR. Author: Cheng Lian <[email protected]> Closes #2663 from liancheng/dp-hadoop-1-fix and squashes the following commits: 0177dae [Cheng Lian] Fixes dynamic partitioning support for lower Hadoop versions
PR #2226 was reverted because it broke Jenkins builds for unknown reason. This debugging PR aims to fix the Jenkins build.
This PR also fixes two bugs:
Compression configurations in
InsertIntoHiveTable
are disabled by mistakeThe
FileSinkDesc
object passed to the writer container doesn't have compression related configurations. These configurations are not taken care of untilsaveAsHiveFile
is called. This PR moves compression code forward, right after instantiation of theFileSinkDesc
object.PreInsertionCasts
doesn't take table partitions into accountIn
castChildOutput
,table.attributes
only contains non-partition columns, thus for partitioned tablechildOutputDataTypes
never equals totableOutputDataTypes
. This results funny analyzed plan like this:Awful though this logical plan looks, it's harmless because all projects will be eliminated by optimizer. Guess that's why this issue hasn't been caught before.