[SPARK-19263] DAGScheduler should avoid sending conflicting task set. #16620

jinxing64 · 2017-01-17T17:02:27Z

What changes were proposed in this pull request?

In current DAGScheduler handleTaskCompletion code, when event.reason is Success, it will first do stage.pendingPartitions -= task.partitionId, which maybe a bug when FetchFailed happens.

Think about below

Stage 0 runs and generates shuffle output data.
Stage 1 reads the output from stage 0 and generates more shuffle data. It has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are launched on executorA.
ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to the driver. The driver marks executorA as lost and updates failedEpoch;
The driver resubmits stage 0 so the missing output can be re-generated, and then once it completes, resubmits stage 1 with ShuffleMapTask1x and ShuffleMapTask2x.
ShuffleMapTask2 (from the original attempt of stage 1) successfully finishes on executorA and sends Success back to driver. This causes DAGScheduler::handleTaskCompletion to remove partition 2 from stage.pendingPartitions (line 1149), but it does not add the partition to the set of output locations (line 1192), because the task’s epoch is less than the failure epoch for the executor (because of the earlier failure on executor A)
ShuffleMapTask1x successfully finishes on executorB, causing the driver to remove partition 1 from stage.pendingPartitions. Combined with the previous step, this means that there are no more pending partitions for the stage, so the DAGScheduler marks the stage as finished (line 1196). However, the shuffle stage is not available (line 1215) because the completion for ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler resubmits the stage.
ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks is called for the re-submitted stage, it throws an error, because there’s an existing active task set

In this fix

Add a check if there is already active(not zombie) taskSetManager before resubmission.

How was this patch tested?

Added a unit test in.

markhamstra · 2017-01-17T18:48:00Z

Thanks for the work thus far, @jinxing64 , but this really needs updated test coverage before we can consider merging it.

@squito

markhamstra · 2017-01-17T19:33:49Z

ok to test

markhamstra · 2017-01-17T19:35:06Z

Beyond the lack of new tests, this patch is causing a couple of existing DAGSchedulerSuite tests to fail for me locally: "run trivial shuffle with out-of-band failure and retry" and "map stage submission with executor failure late map task completions"

squito · 2017-01-17T22:26:32Z

Thanks for pointing out this issue, and the nice description. Still looking but sounds like a legitimate issue. I think SchedulerIntegrationSuite should be able to replicate the exact scenario you have outlined for adding a test case. @jinxing64 can you look at adding a test case that way? I can try to help there as well, but will take me a few days to get to it.

markhamstra · 2017-01-18T00:35:31Z

Jenkins, test this please

SparkQA · 2017-01-18T19:07:40Z

Test build #3540 has finished for PR 16620 at commit 9e4aab2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-01-19T10:08:56Z

@squito
SchedulerIntegrationSuite is very helpful. I like it very much, I can reproduce this issue in SchedulerIntegrationSuite now.
To fix this issue, it is more complicated than I thought, I'll make some change and create a unit test.

jinxing64 · 2017-01-20T08:00:39Z

@markhamstra @squito
Thanks a lot for your helpful comments.
I made a unit test for this fix and changed the patch. Now it can pass all unit tests for me locally.
In this fix: add a check if there is already active(not zombie) taskSetManager before resubmission.

markhamstra · 2017-01-20T19:36:29Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+                  taskScheduler.rootPool.getSortedTaskSetQueue.exists {
+                    tsm => tsm.stageId == stageId && !tsm.isZombie
+                  }
+                } else false


The if...else is unnecessary:

val activeTaskSetManagerExist = taskScheduler.rootPool != null && taskScheduler.rootPool.getSortedTaskSetQueue.exists { tsm => tsm => tsm.stageId == stageId && !tsm.isZombie }

markhamstra · 2017-01-20T19:37:25Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -1193,7 +1193,15 @@ class DAGScheduler(
            }

            if (runningStages.contains(shuffleStage) && shuffleStage.pendingPartitions.isEmpty) {
-              markStageAsFinished(shuffleStage)
+              val activeTaskSetManagerExist =


nit: should be activeTaskSetManagerExists

And since it is being used as !activeTaskSetManagerExists, you could reverse the sense, avoid needing the !, and call it something like noActiveTaskSetManager.

jinxing64 · 2017-01-21T05:21:11Z

@markhamstra
Thanks a lot for your comment, I've already refined, please take another look ~

squito · 2017-01-23T20:09:45Z

Jenkins, ok to test

SparkQA · 2017-01-23T22:52:42Z

Test build #71874 has finished for PR 16620 at commit be8bfe5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

Thanks @jinxing64 for working on this. I'm sorry that at the moment my comments are mostly critical, without providing very constructive advice. I'll keep thinking about this but I thought I'd share my feedback now.

This is a really important fix, and the work you are doing on it is great -- but also tricky enough I want to try to put in a change which improves the clarity of the code and we feel confident in.

squito · 2017-01-23T20:17:58Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -1218,7 +1225,9 @@ class DAGScheduler(
                logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +
                  ") because some of its tasks had failed: " +
                  shuffleStage.findMissingPartitions().mkString(", "))
-                submitStage(shuffleStage)
+                if (noActiveTaskSetManager) {


shouldn't this condition go into the surrounding if (!shuffleStage.isAvailable) ? the logInfo is very confusing in this case otherwise.

squito · 2017-01-23T21:15:15Z

core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala

+            case (1, 1) =>
+              // Wait long enough until Success of task(stageAttempt=1 and partition=0)
+              // is handled by DAGScheduler.
+              Thread.sleep(5000)


hmm, this is a nuisance. I don't see any good way to get rid of this sleep ... but now that I think about it, why can't you do this in DAGSchedulerSuite? it seems like this can be entirely contained to the DAGScheduler and doesn't require tricky interactions with other parts of the scheduler. (I'm sorry I pointed you in the wrong direction earlier -- I thought perhaps you had tried to copy the examples of DAGSchedlerSuite but there was some reason you couldn't.)

squito · 2017-01-23T21:15:51Z

core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala

+    assert(results === (0 until 2).map { _ -> 10}.toMap)
+  }
+
+  def waitUntilConditionBecomeTrue(condition: => Boolean, timeout: Long, msg: String): Unit = {


nit: rename to waitForCondition (maybe irrevlant given other comments)

squito · 2017-01-24T16:19:29Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+                  }
+              if (shuffleStage.isAvailable || noActiveTaskSetManager) {
+                markStageAsFinished(shuffleStage)
+              }


I have to admit, though this passes all the tests, this is really confusing to me. I only somewhat understand why your original version didn't work, and why this should be used instead. Perhaps some more commenting here would help? The condition under which you do markStageAsFinished seems very broad, so perhaps its worth a comment on the case when you do not (and perhaps even a logInfo in an else branch). The discrepancy between pendingPartitions and availableOutputs is also surprising -- perhaps that is worth extra comments on Stage, on how the meaning of those two are different.

jinxing64 · 2017-01-25T09:04:30Z

@squito
Thanks a lot for your comments, they are very helpful. I've already refined the code, please take another look : )

In current ShuffleMapStage, pendingPartitions.size() == 0, doesn't mean the stage is available. Because the succeeded task can be bogus and out of date and task's epoch is older than corresponding executor's failed epoch in DAGScheduler.

When handle Success of ShuffleMapTask, what I want to do is to check whether there is some tasks running for same stage, if so, do not resubmit if pendingPartitions.isEmpty && !stage.isAvailable. there are two benefits for this:

Success of the running tasks have chance to update mapstatus to ShuffleMapStage, and turn it to be available;
Avoid submitting conflicting taskSet.

jinxing64 · 2017-01-25T09:18:59Z

hmm, this is a nuisance. I don't see any good way to get rid of this sleep ... but now that I think about it, why can't you do this in DAGSchedulerSuite? it seems like this can be entirely contained to the DAGScheduler and doesn't require tricky interactions with other parts of the scheduler. (I'm sorry I pointed you in the wrong direction earlier -- I thought perhaps you had tried to copy the examples of DAGSchedlerSuite but there was some reason you couldn't.)

@squito
DAGSchedulerSuite is quite hard for me. Because this bug happens during the interreaction between DAGScheduler and TaksSchedulerImpl, actually the conflicting exception is thrown in TaskSchedulerImpl when submitTasks is called from DAGScheduler. DAGSchedulerSuite only provides a very simple TaskScheduler, of course I can check the conflicting in it but I don't think it is convincing enough.

I don't like the Thread.sleep(5000) either. But I didn't find a better way. I'm sorry to add TestDAGScheduler in SchedulerIntegrationSuite just like TestTaskScheduler for tracking more state. But perhaps it can also be used in the future. If it is not preferred, I'm so sorry.

SparkQA · 2017-01-25T10:50:03Z

Test build #71979 has finished for PR 16620 at commit 3f0ebb8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-01-25T17:22:03Z

Fail to pass unit test. I will keep working on this.

SparkQA · 2017-01-26T17:01:07Z

Test build #72023 has finished for PR 16620 at commit be7e701.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-01-26T17:15:13Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+        s"in stage ${taskSet.id} (TID ${attemptInfo.taskId}) on ${attemptInfo.host} " +
+        s"as the attempt ${info.attemptNumber} succeeded on ${info.host}")
+      sched.backend.killTask(attemptInfo.taskId, attemptInfo.executorId, true)
+    }


The Success is handled in DAGScheduler in a different thread. DAGScheduler perhaps needs to check tasksetManager's status, e.g. isZombie. Move the code here, thus the checking status of TaskSetManager in DAGScheduler when handle Success is safe.

could this be moved before maybeFinishTaskSet(), if you only need isZombie=true? for performance its helpful to hand off to the dagscheduler thread as soon as we can. Probably not a huge impact but we should try to avoid impacting performance where possible.

@squito
Yes, it makes sense to move this part before maybeFinishTaskSet(), I will refine.

SparkQA · 2017-01-26T20:05:42Z

Test build #72028 has finished for PR 16620 at commit de19333.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jinxing64 · 2017-01-27T00:33:37Z

@squito
Could you please take another look at this ? : )

jinxing64 · 2017-01-30T11:19:12Z

@squito
ping for review~~

kayousterhout

This looks great -- I just left some nits on improving test commenting.

kayousterhout · 2017-02-13T23:01:20Z