[SPARK-19868] conflict TasksetManager lead to spark stopped #17208

liujianhuiouc · 2017-03-08T12:01:38Z

What changes were proposed in this pull request?

We must set the taskset to zombie before the DAGScheduler handles the taskEnded event. It's possible the taskEnded event will cause the DAGScheduler to launch a new stage attempt (this happens when map output data was lost), and if this happens before the taskSet has been set to zombie, it will appear that we have conflicting task sets.

srowen · 2017-03-08T12:02:45Z

CC @kayousterhout or @squito

squito · 2017-03-08T16:52:22Z

This looks like the right change. In fact, I could have sworn we had recently merged in something like this -- maybe there is another pr still in flight which includes this? @jinxing64 perhaps this is in one of your open prs?

The description needs to be updated, and we really should have a unit test (though with a very quick look I don't see a good way to test, I'll need to think about that part). Here is my suggestion for the description:

We must set the taskset to zombie before the dagscheduler handles the taskEnded event, because that event may cause the dagscheduler to launch another task attempt. If that happens before the taskSet has been set to zombie, it will appear that we have conflicting task sets.

The code worked before this change because dagScheduler.taskEnded() is async, so the task ended was almost always processed after the zombie status had been updated. However, that left a race, which would occasionally go the wrong way.

squito · 2017-03-08T16:52:27Z

Jenkins, ok to test

SparkQA · 2017-03-08T19:24:37Z

Test build #74212 has finished for PR 17208 at commit 6c40b9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-03-08T19:48:48Z

Looks good. Expanding on Imran's comment, how about:

We must set the taskset to zombie before the DAGScheduler handles the taskEnded event. It's possible the taskEnded event will cause the DAGScheduler to launch a new stage attempt (this happens when map output data was lost), and if this happens before the taskSet has been set to zombie, it will appear that we have conflicting task sets.

jinxing64 · 2017-03-09T10:06:24Z

@squito
Thanks for notification :) this is not in my pr.

kayousterhout · 2017-03-15T17:34:32Z

@liujianhuiouc do you have time to update the comment here? It would be great to get this in soon.

squito · 2017-03-15T18:02:30Z

to be clear, I agree with Kay's rewording (in particular, I meant stage attempt, not task attempt).

Also I think its worth including a test. You can use this: squito@aac8d98

I know its very narrowly focused but it seems worth including.

liujianhuiouc · 2017-03-20T06:43:50Z

ok, I will update that.

kayousterhout · 2017-03-24T00:02:27Z

@liujianhuiouc have you had time to fix this up yet?

liujianhuiouc · 2017-03-27T15:16:02Z

@kayousterhout I have already update the comments, and fix this issue, do you mean i should merge the test case by squito

kayousterhout · 2017-03-27T19:05:38Z

Yes can you also merge @squito's test case?

liujianhuiouc · 2017-03-28T01:09:23Z

@kayousterhout Done

squito · 2017-03-28T01:53:27Z

lgtm assuming tests pass

SparkQA · 2017-03-28T03:15:45Z

Test build #75290 has finished for PR 17208 at commit 17acd55.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liujianhuiouc · 2017-03-28T03:57:55Z

@squito tests fails

…9868

SparkQA · 2017-03-28T06:57:29Z

Test build #75299 has started for PR 17208 at commit fd67392.

liujianhuiouc · 2017-03-28T09:22:25Z

@squito update the no-args ManualClock constructor �with initialized time

squito · 2017-03-28T13:43:58Z

Jenkins, retest this please

squito · 2017-03-28T13:44:31Z

Looks like the tests were manually killed (-9).

Thanks for catching that and fixing @liujianhuiouc

SparkQA · 2017-03-28T16:30:22Z

Test build #75312 has finished for PR 17208 at commit fd67392.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kayousterhout · 2017-03-28T19:14:14Z

LGTM merged this to master

zsxwing · 2018-02-15T01:07:24Z

I think handleFailedTask has the similar issue. Right?

squito · 2018-02-15T04:00:03Z

hmm I think you're right @zsxwing that we should be updating isZombie before sched.dagScheduler.taskEnded and sched.dagScheduler.taskSetFailed is called, just to keep state consistent. I don't think you'll actually hit the bug described here, as (a) if it was from a fetch failure, isZombie is already set first or if (b) its just a regular task failure, and it leads to the stage getting aborted, then there aren't any more retries of the stage anyway.

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. #17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. #21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Imran Rashid <[email protected]> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <[email protected]>

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. apache#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. apache#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, apache#21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. apache#22806 and apache#23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, apache#21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). apache#22806 and apache#23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes apache#23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Imran Rashid <[email protected]> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <[email protected]>

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. apache/spark#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. apache/spark#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, #21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. #22806 and #23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, #21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). #22806 and #23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes #23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Imran Rashid <[email protected]> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <[email protected]> (cherry picked from commit 7df5aa6)

…a stage ## What changes were proposed in this pull request? This is another attempt to fix the more-than-one-active-task-set-managers bug. apache#17208 is the first attempt. It marks the TSM as zombie before sending a task completion event to DAGScheduler. This is necessary, because when the DAGScheduler gets the task completion event, and it's for the last partition, then the stage is finished. However, if it's a shuffle stage and it has missing map outputs, DAGScheduler will resubmit it(see the [code](https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1416-L1422)) and create a new TSM for this stage. This leads to more than one active TSM of a stage and fail. This fix has a hole: Let's say a stage has 10 partitions and 2 task set managers: TSM1(zombie) and TSM2(active). TSM1 has a running task for partition 10 and it completes. TSM2 finishes tasks for partitions 1-9, and thinks he is still active because he hasn't finished partition 10 yet. However, DAGScheduler gets task completion events for all the 10 partitions and thinks the stage is finished. Then the same problem occurs: DAGScheduler may resubmit the stage and cause more than one actice TSM error. apache#21131 fixed this hole by notifying all the task set managers when a task finishes. For the above case, TSM2 will know that partition 10 is already completed, so he can mark himself as zombie after partitions 1-9 are completed. However, apache#21131 still has a hole: TSM2 may be created after the task from TSM1 is completed. Then TSM2 can't get notified about the task completion, and leads to the more than one active TSM error. apache#22806 and apache#23871 are created to fix this hole. However the fix is complicated and there are still ongoing discussions. This PR proposes a simple fix, which can be easy to backport: mark all existing task set managers as zombie when trying to create a new task set manager. After this PR, apache#21131 is still necessary, to avoid launching unnecessary tasks and fix [SPARK-25250](https://issues.apache.org/jira/browse/SPARK-25250 ). apache#22806 and apache#23871 are its followups to fix the hole. ## How was this patch tested? existing tests. Closes apache#23927 from cloud-fan/scheduler. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Imran Rashid <[email protected]> (cherry picked from commit cb20fbc) Signed-off-by: Imran Rashid <[email protected]>

[SPARK-19868] conflict TasksetManager lead to spark stopped

6c40b9f

[SPARK-19868] conflict TasksetManager lead to spark stopped

17acd55

liujianhui added 2 commits March 28, 2017 14:47

Merge branch 'master' of https://github.com/apache/spark into spark-1…

0beb76c

…9868

[SPARK-19868] conflict TasksetManager lead to spark stopped

fd67392

asfgit closed this in 92e385e Mar 28, 2017

liujianhuiouc deleted the spark-19868 branch June 2, 2017 05:31

cloud-fan mentioned this pull request Mar 1, 2019

[SPARK-27065][CORE] avoid more than one active task set managers for a stage #23927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19868] conflict TasksetManager lead to spark stopped #17208

[SPARK-19868] conflict TasksetManager lead to spark stopped #17208

liujianhuiouc commented Mar 8, 2017 •

edited

Loading

srowen commented Mar 8, 2017

squito commented Mar 8, 2017

squito commented Mar 8, 2017

SparkQA commented Mar 8, 2017

kayousterhout commented Mar 8, 2017

jinxing64 commented Mar 9, 2017

kayousterhout commented Mar 15, 2017

squito commented Mar 15, 2017 •

edited

Loading

liujianhuiouc commented Mar 20, 2017

kayousterhout commented Mar 24, 2017

liujianhuiouc commented Mar 27, 2017

kayousterhout commented Mar 27, 2017

liujianhuiouc commented Mar 28, 2017

squito commented Mar 28, 2017

SparkQA commented Mar 28, 2017

liujianhuiouc commented Mar 28, 2017

SparkQA commented Mar 28, 2017

liujianhuiouc commented Mar 28, 2017

squito commented Mar 28, 2017

squito commented Mar 28, 2017

SparkQA commented Mar 28, 2017

kayousterhout commented Mar 28, 2017

zsxwing commented Feb 15, 2018

squito commented Feb 15, 2018

[SPARK-19868] conflict TasksetManager lead to spark stopped #17208

[SPARK-19868] conflict TasksetManager lead to spark stopped #17208

Conversation

liujianhuiouc commented Mar 8, 2017 • edited Loading

What changes were proposed in this pull request?

srowen commented Mar 8, 2017

squito commented Mar 8, 2017

squito commented Mar 8, 2017

SparkQA commented Mar 8, 2017

kayousterhout commented Mar 8, 2017

jinxing64 commented Mar 9, 2017

kayousterhout commented Mar 15, 2017

squito commented Mar 15, 2017 • edited Loading

liujianhuiouc commented Mar 20, 2017

kayousterhout commented Mar 24, 2017

liujianhuiouc commented Mar 27, 2017

kayousterhout commented Mar 27, 2017

liujianhuiouc commented Mar 28, 2017

squito commented Mar 28, 2017

SparkQA commented Mar 28, 2017

liujianhuiouc commented Mar 28, 2017

SparkQA commented Mar 28, 2017

liujianhuiouc commented Mar 28, 2017

squito commented Mar 28, 2017

squito commented Mar 28, 2017

SparkQA commented Mar 28, 2017

kayousterhout commented Mar 28, 2017

zsxwing commented Feb 15, 2018

squito commented Feb 15, 2018

liujianhuiouc commented Mar 8, 2017 •

edited

Loading

squito commented Mar 15, 2017 •

edited

Loading