Os/commit denied race condition #94

onursatici · 2017-01-12T21:14:03Z

To prevent the race condition when the executor gets preempted after being authorized to commit.

ash211

This seems pretty reasonable to handle the timeouts portion.

I think we need the two-phase though so we're more resilient to the problems I laid out with speculation. One problem of that though is that it's an extra round-trip on every commit.

ash211 · 2017-01-12T23:11:54Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -49,6 +52,8 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
  private type TaskAttemptNumber = Int

  private val NO_AUTHORIZED_COMMITTER: TaskAttemptNumber = -1
+  // TODO: get below from config?
+  private val MAX_WAIT_FOR_COMMIT = 120000L


yes want from config, and should specify what units this number is in

ash211 · 2017-01-12T23:12:44Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

            logDebug(s"Authorizing attemptNumber=$attemptNumber to commit for stage=$stage, " +
              s"partition=$partition")
-            authorizedCommitters(partition) = attemptNumber
+            authorizedCommitters(partition) = CommitState(
+              attemptNumber, System.currentTimeMillis())


nit: this should fit in one line

ash211 · 2017-01-12T23:13:41Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+          case CommitState(existingCommitter, startTime)
+            if System.currentTimeMillis() - startTime > MAX_WAIT_FOR_COMMIT =>
+            logDebug(s"Authorizing attemptNumber=$attemptNumber to commit for stage=$stage, " +
+              s"partition=$partition; maxWaitTime reached for attempId=$existingCommitter")


include maxWaitTime config setting and how the time that's progressed has exceeded that threshold

this should be warn level since a lock expired on a committer -- mention something about prior lock being expired too

typo: attempId -> attemptId

ash211 · 2017-01-12T23:16:49Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

 private case object StopCoordinator extends OutputCommitCoordinationMessage
 private case class AskPermissionToCommitOutput(stage: Int, partition: Int, attemptNumber: Int)

+private case class CommitState(attempt: Int, time: Long )


should have a third entry here for whether the committer returned back that it completed the commit. This lets us distinguish between a repeated request on the same partition that should be authorized (because the prior one timed out) vs shouldn't be authorized (because the prior one completed successfully).

this two-phase commit would defend against issues with speculation (and unintentional speculation due to network partition) where attempt 1 starts, attempt 2 starts, attempt 1 starts to commit, attempt 1 commits, attempt 1 reports that it committed, attempt 2 starts to commit

The bold step is required so that in the "attempt 2 starts to commit" step the OCC is able to deny the commit attempt

ash211 · 2017-01-12T23:27:45Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

-              s"attemptNumber=$attemptNumber to commit for stage=$stage, partition=$partition; " +
-              s"existingCommitter = $existingCommitter. This can indicate dropped network traffic.")
+          case CommitState(existingCommitter, startTime)
+            if System.currentTimeMillis() - startTime > MAX_WAIT_FOR_COMMIT =>


need to think through what happens when currentTimeMillis() goes backwards, like it does ever few years for leap-seconds. Does this code handle that ok?

ash211 · 2017-01-12T23:27:48Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

            true
-          case existingCommitter =>
+          case CommitState(existingCommitter, _) =>


add another case in here so we can get better logging for commit attempts where another attempt has the lock:

case CommitState(existingCommitter, startTime) => logDebug(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage, " + s"partition=$partition; existingCommitter = $existingCommitter with startTime=$startTime and currentTime=${System.currentTimeMillis()}") false

ash211 · 2017-01-14T00:59:17Z

I chatted with @onursatici offline and he's planning to address these comments on Monday

…ommit

robert3005

I think we should use nanoTime instead of currentTimeMillis

robert3005 · 2017-01-17T16:54:07Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

-              s"attemptNumber=$attemptNumber to commit for stage=$stage, partition=$partition; " +
-              s"existingCommitter = $existingCommitter. This can indicate dropped network traffic.")
+          case CommitState(existingCommitter, _, Committed) =>
+            logDebug(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage, " +


can you make it warn? info? Seems like useful information if you're analyzing failures

robert3005 · 2017-01-17T16:55:55Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -25,8 +25,17 @@ import org.apache.spark.rpc.{RpcCallContext, RpcEndpoint, RpcEndpointRef, RpcEnv

 private sealed trait OutputCommitCoordinationMessage extends Serializable

+


unnecessary whitespace

robert3005 · 2017-01-17T17:02:15Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+              s"partition=$partition; it is already committed")
+            false
+          case CommitState(existingCommitter, startTime, Committing)
+            if System.currentTimeMillis() - startTime > MAX_WAIT_FOR_COMMIT =>


This can lead to unexpected issues. Systemm.currentTimeMillis isn't guaranteed to be consistent across calls and in case of ntp adjustment this can cause unnecessary delays in releasing the lock here. Since this is only ever called from one machine for the lifetime of a job this should be System.nanoTime everywhere to guarantee monotonically increasing results.

robert3005 · 2017-01-17T17:18:42Z

Spoke offline. Seems everything in spark uses system.currenttimemillis so we can leave as is. Worst case scenario we are holding lock for longer than the setting.

ash211 · 2017-01-18T18:55:03Z

Upstream added a (partial) fix for this and some tests at the PR linked from https://issues.apache.org/jira/browse/SPARK-18113. Maybe we can use that as a base for adding tests here?

ash211

Looks good @onursatici ! I put a bunch of nitpicky things here but I think this is the direction we want to be going.

Can you also look at the PR linked from https://issues.apache.org/jira/browse/SPARK-18113 and the test that it added? I'm hoping we can use those tests to make sure that this is working properly as well.

ash211 · 2017-01-18T18:58:22Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+  // Timeout to release the lock on a task in milliseconds, defaults to 120 seconds
+  private val MAX_WAIT_FOR_COMMIT = conf.getLong(
+    "spark.scheduler.outputCommitCoordinator.maxWaitTime", 120000L
+  ) * 1e6.toLong


Can you please name this in a way where the unit is in the name for both config value and variable? So MAX_WAIT_FOR_COMMIT_NANOS and spark.scheduler.outputCommitCoordinator.maxWaitTimeMillis

ash211 · 2017-01-18T18:58:47Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

-   * lost), then a subsequent task attempt may be authorized to commit its output.
+   * If a task attempt has been authorized to commit, then all other attempts to commit
+   * the same task within spark.scheduler.outputCommitCoordinator.maxWaitTime
+   * will be denied. If the authorized task attempt fails (e.g. due to its executor being lost),


or preemption

ash211 · 2017-01-18T18:59:11Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

@@ -97,6 +112,31 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean)
  }

  /**
+   * Called by tasks to inform their commit is done.


to inform the OutputCommitCoordinator

ash211 · 2017-01-18T19:05:40Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+private case class InformCommitDone(stage: Int, partition: Int, attemptNumber: Int)
+
+object CommitStatus extends Enumeration {
+  val NotCommitted, Committing, Committed = Value


change to more distinct names, say Uncommitted, MidCommit, Committed

ash211 · 2017-01-18T19:07:32Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+              s"partition=$partition; it is already committed")
+            false
+          case CommitState(existingCommitter, startTime, Committing)
+            if System.nanoTime() - startTime > MAX_WAIT_FOR_COMMIT =>


can you tab this in? I think ./dev/scala-style will flag you on this, and it's unclear that the if goes with the case and not the following statements

ash211 · 2017-01-18T19:07:45Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

            logDebug(s"Authorizing attemptNumber=$attemptNumber to commit for stage=$stage, " +
              s"partition=$partition")
-            authorizedCommitters(partition) = attemptNumber
+            authorizedCommitters(partition) = CommitState(


make this one line?

can't do, exceeds 100 characters

ash211 · 2017-01-18T19:07:51Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+              s"partition=$partition; maxWaitTime=$MAX_WAIT_FOR_COMMIT " +
+              s"reached and prior lock released for attemptId=$existingCommitter")
+            authorizedCommitters(partition) = CommitState(
+              attemptNumber, System.nanoTime(), Committing


make this one line?

ash211 · 2017-01-18T19:08:01Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+            logDebug(s"Marking attemptNumber=$attemptNumber for stage=$stage, " +
+              s"partition=$partition as committed")
+            authorizedCommitters(partition) = CommitState(
+              attemptNumber, startTime, Committed


make this one line?

ash211 · 2017-01-18T19:11:48Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

+            )
+            true
+          case CommitState(committer, startTime, status) =>
+            logWarning(s"Bad state on attemptNumber=$attemptNumber for stage=$stage, " +


Can you break this out to separately handle when a commit attempt happens when the partition has already been committed vs a commit attempt by an unauthorized committer vs when the partition is not yet in MidCommit state?

This has been tricky to debug so if we see it again I want to get really nice and verbose logs

ash211 · 2017-01-18T19:23:19Z

core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala

            true
-          case existingCommitter =>
+          case CommitState(existingCommitter, startTime, _) =>
            logDebug(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage, " +


can you put a reason for why this was denied here?

ash211 · 2017-01-23T17:14:39Z

@onursatici is this something you guys can work through this week?

robert3005 · 2017-02-15T16:13:29Z

@ash211 can we merge this?

onursatici · 2017-02-15T16:17:42Z

We hit a TaskCommitDenied exception once after having this pr on our stack yesterday; this is less than what we would have otherwise, but still I want to investigate a bit next week.

…

On Wed, 15 Feb 2017 at 19:13, Robert Kruszewski ***@***.***> wrote: @ash211 <https://github.com/ash211> can we merge this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE0UsXwHgV2Fe00Gejo7EcqaCnCDC6A1ks5rcyQpgaJpZM4LiPXB> .

onursatici · 2017-02-22T13:30:13Z

It turns out that exception we got was irrelevant from this PR, and a fix for that is here: apache#16959

gregakinman · 2017-03-06T11:22:30Z

Hey @onursatici is anything blocking this? Would like to see this fixed if possible.

robert3005 · 2017-03-06T11:26:16Z

We have a correct fixed already merged upstream. Will have a release within a week

…

On Mon, 6 Mar 2017 at 11:22, Greg Kinman ***@***.***> wrote: Hey @onursatici <https://github.com/onursatici> is anything blocking this? Would like to see this fixed if possible. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#94 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAfQVO6Hycct0aivte7EIrYTt1y1AFt6ks5ri-x3gaJpZM4LiPXB> .

…wo plans are the same ### What changes were proposed in this pull request? This PR combines the current plan and the initial plan in the AQE query plan string when the two plans are the same. It also removes the `== Current Plan ==` and `== Initial Plan ==` headers: Before ```scala AdaptiveSparkPlan isFinalPlan=false +- == Current Plan == SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=palantir#94] ... +- == Initial Plan == SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=palantir#94] ... ``` After ```scala AdaptiveSparkPlan isFinalPlan=false +- SortMergeJoin [key#13], [a#23], Inner :- Sort [key#13 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(key#13, 5), true, [id=palantir#94] ... ``` For SQL `EXPLAIN` output: Before ```scala AdaptiveSparkPlan (8) +- == Current Plan == Sort (7) +- Exchange (6) ... +- == Initial Plan == Sort (7) +- Exchange (6) ... ``` After ```scala AdaptiveSparkPlan (8) +- Sort (7) +- Exchange (6) ... ``` ### Why are the changes needed? To simplify the AQE plan string by removing the redundant plan information. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Modified the existing unit test. Closes apache#29915 from allisonwang-db/aqe-explain. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Xiao Li <[email protected]>

onursatici added 2 commits January 12, 2017 13:07

add alreadyCommitted to skip already committed tasks

c12ce22

add timeout for waiting the task to be committed

692c4fd

onursatici requested review from ash211 and robert3005 January 12, 2017 22:17

ash211 reviewed Jan 12, 2017

View reviewed changes

onursatici added 3 commits January 16, 2017 17:06

get the maxwait time from config, better logging

0db3781

begin add two phase commit

0344614

confirm that commit done attempt was the one that was authorized to c…

dd2ab18

…ommit

robert3005 suggested changes Jan 17, 2017

View reviewed changes

use getNano

5660cbb

ash211 reviewed Jan 18, 2017

View reviewed changes

onursatici and others added 3 commits January 24, 2017 12:54

better warn logging

5df6615

cherry-pick SPARK-18113

5fb4070

get the comparison for authorized commited on task done correct

1f471ce

onursatici changed the title ~~[WIP] Os/commit denied race condition~~ Os/commit denied race condition Jan 30, 2017

robert3005 approved these changes Feb 15, 2017

View reviewed changes

robert3005 closed this Mar 6, 2017

robert3005 deleted the os/commit-denied-race-condition branch March 6, 2017 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Os/commit denied race condition #94

Os/commit denied race condition #94

onursatici commented Jan 12, 2017

ash211 left a comment

ash211 Jan 12, 2017

ash211 Jan 12, 2017

ash211 Jan 12, 2017

ash211 Jan 12, 2017

ash211 Jan 12, 2017

ash211 Jan 12, 2017

ash211 Jan 12, 2017

ash211 commented Jan 14, 2017

robert3005 left a comment

robert3005 Jan 17, 2017

robert3005 Jan 17, 2017

robert3005 Jan 17, 2017

robert3005 commented Jan 17, 2017

ash211 commented Jan 18, 2017

ash211 left a comment

ash211 Jan 18, 2017

ash211 Jan 18, 2017

ash211 Jan 18, 2017

ash211 Jan 18, 2017

ash211 Jan 18, 2017

ash211 Jan 18, 2017

onursatici Jan 24, 2017

ash211 Jan 18, 2017

ash211 Jan 18, 2017

ash211 Jan 18, 2017

ash211 Jan 18, 2017

ash211 commented Jan 23, 2017

robert3005 commented Feb 15, 2017

onursatici commented Feb 15, 2017 via email

onursatici commented Feb 22, 2017

gregakinman commented Mar 6, 2017

robert3005 commented Mar 6, 2017 via email

		@@ -25,8 +25,17 @@ import org.apache.spark.rpc.{RpcCallContext, RpcEndpoint, RpcEndpointRef, RpcEnv

		private sealed trait OutputCommitCoordinationMessage extends Serializable

Os/commit denied race condition #94

Os/commit denied race condition #94

Conversation

onursatici commented Jan 12, 2017

ash211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ash211 commented Jan 14, 2017

robert3005 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robert3005 commented Jan 17, 2017

ash211 commented Jan 18, 2017

ash211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ash211 commented Jan 23, 2017

robert3005 commented Feb 15, 2017

onursatici commented Feb 15, 2017 via email

onursatici commented Feb 22, 2017

gregakinman commented Mar 6, 2017

robert3005 commented Mar 6, 2017 via email