[SPARK-19631][CORE] OutputCommitCoordinator should not allow commits for already failed tasks #16959

pwoody · 2017-02-16T16:51:25Z

What changes were proposed in this pull request?

Previously it was possible for there to be a race between a task failure and committing the output of a task. For example, the driver may mark a task attempt as failed due to an executor heartbeat timeout (possibly due to GC), but the task attempt actually ends up coordinating with the OutputCommitCoordinator once the executor recovers and committing its result. This will lead to any retry attempt failing because the task result has already been committed despite the original attempt failing.

This ensures that any previously failed task attempts cannot enter the commit protocol.

How was this patch tested?

Added a unit test

pwoody · 2017-02-17T10:54:38Z

@JoshRosen @mccheah to review

ash211

Looks good @pwoody -- I can't imagine a case where we want the coordinator to authorize a task to commit that just came back as failed.

ash211 · 2017-02-17T14:09:40Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -48,25 +48,28 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
  private type StageId = Int
  private type PartitionId = Int
  private type TaskAttemptNumber = Int
+  private case class StageState(authorizedCommitters: Array[TaskAttemptNumber],


can you put a comment here that the index into the authorizedCommitters array is the partitionId ?

vanzin · 2017-02-22T19:31:37Z

ok to test

SparkQA · 2017-02-22T22:02:00Z

Test build #73294 has finished for PR 16959 at commit b0ac2a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ash211 · 2017-02-24T23:33:06Z

@vanzin are you right person to review this?

vanzin

Can you expand the explanation of the race in the commit message? It's not clear how the race can happen with just what you wrote. The bug adds a little bit of info but it'd be better to have it properly explained here.

vanzin · 2017-02-27T19:51:06Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -48,25 +48,28 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
  private type StageId = Int
  private type PartitionId = Int
  private type TaskAttemptNumber = Int
+  private case class StageState(authorizedCommitters: Array[TaskAttemptNumber],


Parameter indentation is incorrect. See "Indentation" section at http://spark.apache.org/contributing.html

vanzin · 2017-02-27T19:56:05Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -137,10 +141,15 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
        logInfo(s"Task was denied committing, stage: $stage, partition: $partition, " +
          s"attempt: $attemptNumber")
      case otherReason =>
-        if (authorizedCommitters(partition) == attemptNumber) {
+        // Mark the attempt as failed to blacklist from future commit protocol
+        stageState.failures.get(partition) match {


better:

stageState.failures.getOrElseUpdate(partition, mutable.Set()) += attemptNumber

vanzin · 2017-02-27T19:57:00Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

        false
    }
  }
+
+  private def attemptFailed(stage: StageId,


Same about parameter indentation.

vanzin · 2017-02-27T19:58:32Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

-        authorizedCommitters(partition) match {
+    stageStates.get(stage) match {
+      case Some(state) if attemptFailed(stage, partition, attemptNumber) =>
+        logWarning(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage," +


warning seems a little strong here; maybe info.

pwoody · 2017-02-27T20:45:40Z

Thanks for the feedback @vanzin , I've updated the PR and the description

SparkQA · 2017-02-27T20:51:24Z

Test build #73532 has started for PR 16959 at commit ed3ab09.

vanzin · 2017-02-28T00:02:51Z

Looks ok to me, but let me ping some others @squito @kayousterhout

kayousterhout

I left one small comment inline, but other than that, this looks good.

@squito This commit makes me worried there are more bugs related to #16620. For example, what if a task was OK'ed to commit, but then DAGScheduler decides to ignore it because of the epoch. The DAGScheduler / TaskSetManager will attempt to re-run the task, but the output commit will never be OK'ed, which will cause the task to fail a bunch of times and the stage to get aborted. Maybe this is OK because it's unlikely a stage will both be a shuffle map stage and also save output to HDFS? Thoughts?

kayousterhout · 2017-02-28T00:22:28Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -48,25 +48,29 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
  private type StageId = Int
  private type PartitionId = Int
  private type TaskAttemptNumber = Int
+  private case class StageState(
+      authorizedCommitters: Array[TaskAttemptNumber],
+      failures: mutable.Map[PartitionId, mutable.Set[TaskAttemptNumber]])


Why not define failures as an member variable (and initialize it there with an empty map), rather than forcing the caller to pass in an empty map?

kayousterhout · 2017-02-28T01:10:03Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -111,13 +115,13 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
    val arr = new Array[TaskAttemptNumber](maxPartitionId + 1)
    java.util.Arrays.fill(arr, NO_AUTHORIZED_COMMITTER)
    synchronized {
-      authorizedCommittersByStage(stage) = arr
+      stageStates(stage) = new StageState(arr)


ah sorry now that I see this I realized it probably makes sense to initialize arr in the StageState constructor too (so this line would look like new StageState(maxPartitionId +1), and the StageState constructor just takes in numPartitions). Would you mind making that change too?

Yep sure thing, just pushed up the change.

kayousterhout

LGTM assuming tests pass. Let's wait to see if @squito has any comments here before merging.

SparkQA · 2017-02-28T03:34:06Z

Test build #73544 has finished for PR 16959 at commit 20f028a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T04:07:36Z

Test build #73547 has finished for PR 16959 at commit 1fcbd5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

lgtm

squito · 2017-02-28T20:13:24Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+      partition: PartitionId,
+      attempt: TaskAttemptNumber): Boolean = synchronized {
+    stageStates.get(stage).exists { state =>
+      state.failures.get(partition).exists(_.contains(attempt))


minor: the one place this is called, you've already looked up the state. so you could take the StageState instead of the StageId (or even just inline the function entirely, though I think it may be a bit easier to understand as a helper method).

squito · 2017-02-28T20:37:24Z

@kayousterhout

This commit makes me worried there are more bugs related to #16620. For example, what if a task was OK'ed to commit, but then DAGScheduler decides to ignore it because of the epoch. The DAGScheduler / TaskSetManager will attempt to re-run the task, but the output commit will never be OK'ed, which will cause the task to fail a bunch of times and the stage to get aborted. Maybe this is OK because it's unlikely a stage will both be a shuffle map stage and also save output to HDFS? Thoughts?

yes, I think you are right, both about the bug, and that its pretty unlikely. It looks like SparkHadoopMapRedUtil is a public class, so a user could write to hdfs inside a map stage, but that would be pretty weird.

squito · 2017-02-28T20:52:57Z

core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala

+    outputCommitCoordinator.stageStart(stage, maxPartitionId = 1)
+    outputCommitCoordinator.taskCompleted(stage, partition, attemptNumber = failedAttempt,
+      reason = ExecutorLostFailure("0", exitCausedByApp = true, None))
+    assert(!outputCommitCoordinator.canCommit(stage, partition, failedAttempt))


I just realized there is still a potential race here -- what if canCommit happens, and then the Executor dies? we don't know if the output has been written or not. This change allows another task to commit its output. That is good if the first task hadn't ever written its output.

But what if the earlier attempt had committed? I think its OK for it to let another task commit its output over the original output. (If it didn't, then we'd be back to the original scenario, with all future tasks failing b/c they couldn't commit their output.)

If that reasoning sounds correct, I think there should also be a test case where canCommit() comes before the task failure.

But what if the earlier attempt had committed?

I think there's a really tricky race here that I don't know if Spark is even able to fix. Basically:

E1 starts to commit

E1 loses connectivity with the driver, still committing

E2 gets permission to commit from driver

both E1 and E2 are committing their output files

Depending on how those output files are generated, you may either end up with corrupt output (i.e. output files from both executors) or task failures with partial output (E2 would fail to commit, you'd have some output from E1 in the final directory, task would be retried).

I believe that both tasks in this case would have the same set of output files, so one would overwrite the other; so the real problem is if E1 "wins" the race but leaves incomplete output around.

BTW this is very unlikely to happen with filesystems that have atomic moves (or close enough to that), such as HDFS. If all attempts have the same set of outputs, you'd end up with the correct output eventually.

It might be trickier to reason about with filesystems that don't have that feature (hello S3), depending on the actual semantics of how (and if) tasks fail.

Yeah - this should be fine and is how it currently works as well.

If a failure comes in after the authorization, then the task may or may not commit before failure. The coordinator will release the authorization once it gets the failure, and the next task attempt will check and possibly delete any existing data left over while attempting its own commit.

Happy to add the test though.

sorry that was in response to @squito

@vanzin Task attempt 1 could very well remove task attempt 2s output even after the second attempt has reported success back to the driver and then get forcibly killed during its own commit.

Yes, that seems accurate. In my example above, this change wouldn't change anything, since it's solving a different race (driver would not allow E1 to commit if it thinks it's dead).

sorry if I am being really dense, but it still seems to me like in this particular scenario, we're taking broken behavior, fixing it in some cases, and making it worse in others.

suppose E1 got permission to commit, then lost connectivity to the driver (or missed heartbeats etc), but continued to try to commit. then E2 asks to commit.

Before, we might have ended up with an infinite loop, where E1 never finishes committing, and E2 never gets to commit. Similarly, all future attempts don't get to commit, but we don't even fail the task set because taskcommitdenied doesn't count towards failing a taskset, so it just keeps retrying.

After this change, E2 gets to commit immediately after E1 loses connectivity to the driver. E1 may or may not commit at any time. If E1 doesn't commit, great. If E1 does commit, then in most scenarios, things will still be fine. But sometimes, the two commits will stomp on each other.

so we've narrowed the scenarios with incorrect behavior -- but the behavior has gone from an infinite loop (bad), to jobs appearing to succeed when they have actually written corrupt data (worse, IMO).

@squito This PR does not affect anything after E1 gets permission to commit, the race you describe is definitely possible and has existed before. This change here only makes it such that if E1 fails before asking to commit, then it is blacklisted from being authorized to commit.

ok, I finally get it. I was thinking this change was doing something different. Sorry it took me a while.

That said, I realize there is another issue here. I tried to update the test to confirm that no other task would commit once there was an executor failure, like so:

test("SPARK-19631: Do not allow failed attempts to be authorized for committing") { val stage: Int = 1 val partition: Int = 1 val failedAttempt: Int = 0 outputCommitCoordinator.stageStart(stage, maxPartitionId = 1) outputCommitCoordinator.taskCompleted(stage, partition, attemptNumber = failedAttempt, reason = ExecutorLostFailure("0", exitCausedByApp = true, None)) // if we get a request to commit after we learn the executor failed, we don't authorize // the task to commit, so another attempt can commit. assert(!outputCommitCoordinator.canCommit(stage, partition, failedAttempt)) assert(outputCommitCoordinator.canCommit(stage, partition, failedAttempt + 1)) // but if we get an executor failure *after* we authorize a task to commit, we never let // another task commit. Unfortunately, we just don't know what the status is of the first task, // so we can't safely let any other task proceed. outputCommitCoordinator.taskCompleted(stage, partition, attemptNumber = failedAttempt + 1, reason = ExecutorLostFailure("0", exitCausedByApp = true, None)) assert(!outputCommitCoordinator.canCommit(stage, partition, failedAttempt + 2)) }

This test fails at the final assert, because the executor failure does clear the authorized commiter. But this PR doesn't change that at all -- an equivalent check in master would also fail.

https://issues.apache.org/jira/browse/SPARK-19790

SparkQA · 2017-03-01T00:18:15Z

Test build #73613 has finished for PR 16959 at commit 7d12b78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-03-01T00:47:21Z

retest this please

SparkQA · 2017-03-01T03:29:33Z

Test build #73629 has finished for PR 16959 at commit 7d12b78.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2017-03-01T17:30:32Z

lgtm, sorry for the noise

kayousterhout · 2017-03-02T05:35:22Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+    stageStates.get(stage) match {
+      case Some(state) if attemptFailed(state, partition, attemptNumber) =>
+        logInfo(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage," +
+        s" partition=$partition as task attempt $attemptNumber has already failed.")


can you fix the indentation here? (should be +2 spaces)

SparkQA · 2017-03-02T18:14:50Z

Test build #73774 has finished for PR 16959 at commit 9e3c8fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ash211 · 2017-03-02T21:35:55Z

Any last changes before merging?

kayousterhout · 2017-03-02T23:56:54Z

LGTM -- I merged this into master. Thanks for fixing this @pwoody!

…for already failed tasks ## What changes were proposed in this pull request? Previously it was possible for there to be a race between a task failure and committing the output of a task. For example, the driver may mark a task attempt as failed due to an executor heartbeat timeout (possibly due to GC), but the task attempt actually ends up coordinating with the OutputCommitCoordinator once the executor recovers and committing its result. This will lead to any retry attempt failing because the task result has already been committed despite the original attempt failing. This ensures that any previously failed task attempts cannot enter the commit protocol. ## How was this patch tested? Added a unit test Author: Patrick Woody <[email protected]> Closes apache#16959 from pwoody/pw/recordFailuresForCommitter.

Patrick Woody added 2 commits February 16, 2017 16:48

Record failed attempts in the OutputCommitCoordinator

ce17e02

minor style changes

b0ac2a7

ash211 mentioned this pull request Feb 17, 2017

OutputCommitCoordinator should be resilient to task failures palantir/spark#103

Closed

ash211 reviewed Feb 17, 2017

View reviewed changes

onursatici mentioned this pull request Feb 22, 2017

Os/commit denied race condition palantir/spark#94

Closed

vanzin reviewed Feb 27, 2017

View reviewed changes

PR feedback

ed3ab09

kayousterhout suggested changes Feb 28, 2017

View reviewed changes

feedback

20f028a

kayousterhout reviewed Feb 28, 2017

View reviewed changes

refactor constructor

1fcbd5d

kayousterhout reviewed Feb 28, 2017

View reviewed changes

squito approved these changes Feb 28, 2017

View reviewed changes

squito reviewed Feb 28, 2017

View reviewed changes

pass state into helper

7d12b78

kayousterhout reviewed Mar 2, 2017

View reviewed changes

style

9e3c8fe

asfgit closed this in 433d9eb Mar 2, 2017

[SPARK-19631][CORE] OutputCommitCoordinator should not allow commits for already failed tasks #16959

[SPARK-19631][CORE] OutputCommitCoordinator should not allow commits for already failed tasks #16959

Conversation

pwoody commented Feb 16, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

pwoody commented Feb 17, 2017

ash211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Feb 22, 2017

SparkQA commented Feb 22, 2017

ash211 commented Feb 24, 2017

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwoody commented Feb 27, 2017 • edited Loading

SparkQA commented Feb 27, 2017

vanzin commented Feb 28, 2017

kayousterhout left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kayousterhout left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 28, 2017

SparkQA commented Feb 28, 2017

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito commented Feb 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 1, 2017

vanzin commented Mar 1, 2017

SparkQA commented Mar 1, 2017

squito commented Mar 1, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2017

ash211 commented Mar 2, 2017

kayousterhout commented Mar 2, 2017 • edited Loading

pwoody commented Feb 16, 2017 •

edited

Loading

pwoody commented Feb 27, 2017 •

edited

Loading

kayousterhout commented Mar 2, 2017 •

edited

Loading