[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663

liancheng · 2014-10-05T16:20:05Z

This is a follow up of #2226 and #2616 to fix Jenkins master SBT build failures for lower Hadoop versions (1.0.x and 2.0.x).

The root cause is the semantics difference of FileSystem.globStatus() between different versions of Hadoop, as illustrated by the following test code:

object GlobExperiments extends App {
  val conf = new Configuration()
  val fs = FileSystem.getLocal(conf)
  fs.globStatus(new Path("/tmp/wh/*/*/*")).foreach { status =>
    println(status.getPath)
  }
}

Target directory structure:

/tmp/wh
├── dir0
│   ├── dir1
│   │   └── level2
│   └── level1
└── level0

Hadoop 2.4.1 result:

file:/tmp/wh/dir0/dir1/level2

Hadoop 1.0.4 result:

file:/tmp/wh/dir0/dir1/level2
file:/tmp/wh/dir0/level1
file:/tmp/wh/level0

In #2226 and #2616, we call FileOutputCommitter.commitJob() at the end of the job, and the _SUCCESS mark file is written. When working with lower Hadoop versions, due to the globStatus() semantics issue, _SUCCESS is included as a separate partition data file by Hive.loadDynamicPartitions(), and fails partition spec checking. The fix introduced in this PR is kind of a hack: when inserting data with dynamic partitioning, we intentionally avoid writing the _SUCCESS marker to workaround this issue.

Hive doesn't suffer this issue because FileSinkOperator doesn't call FileOutputCommitter.commitJob(), instead, it calls Utilities.mvFileToFinalPath() to cleanup the output directory and then loads it into Hive warehouse by with loadDynamicPartitions()/loadPartition()/loadTable(). This approach is better because it handles failed job and speculative tasks properly. We should add this step to InsertIntoHiveTable in another PR.

SparkQA · 2014-10-05T16:24:29Z

QA tests have started for PR 2663 at commit 0177dae.

This patch merges cleanly.

liancheng · 2014-10-05T16:28:36Z

@yhuai Would you mind to leave some suggestions/comments? I reached the conclusion in the PR description by combing insertion and loading related code in Hive, wondering is there any better approaches?

SparkQA · 2014-10-05T17:14:39Z

QA tests have finished for PR 2663 at commit 0177dae.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-10-05T18:18:49Z

I'm going to merge this so we can fix Jenkins. If @yhuai has comments they can be addressed in a follow up.

yhuai · 2014-10-05T20:34:54Z

I think it is good.

Just a note. For Hive, seems it also set the output committer to its NullOutputCommitter. It is done before a MR job is submitted.

liancheng · 2014-10-06T02:57:20Z

@yhuai Thanks for pointing out the NullOutputCommitter part, that's the missing piece I was looking for :)

Fixes dynamic partitioning support for lower Hadoop versions

0177dae

asfgit closed this in 1b97a94 Oct 5, 2014

liancheng deleted the dp-hadoop-1-fix branch October 6, 2014 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663

[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663

liancheng commented Oct 5, 2014

SparkQA commented Oct 5, 2014

liancheng commented Oct 5, 2014

SparkQA commented Oct 5, 2014

marmbrus commented Oct 5, 2014

yhuai commented Oct 5, 2014

liancheng commented Oct 6, 2014

[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663

[SPARK-3007][SQL] Fixes dynamic partitioning support for lower Hadoop versions #2663

Conversation

liancheng commented Oct 5, 2014

SparkQA commented Oct 5, 2014

liancheng commented Oct 5, 2014

SparkQA commented Oct 5, 2014

marmbrus commented Oct 5, 2014

yhuai commented Oct 5, 2014

liancheng commented Oct 6, 2014