-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524
Conversation
when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished,because the finished's task from the failed stage should not remove from the pending partition list.
@seayi Could you provide a proper title to the PR, this should contain the JIRA ticket, the Spark component and descriptive title. Something like this: Manual test are typically not sufficient. Please think of a way of capturing this in a test. |
is this related to #12436 ? |
Yes I think this PR should be closed, and discussion merged to #12436 |
@srowen thanks for your attention ,but i think it is not the same,the 14649 jira is running duplicate tasks, but this jira is submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. |
@hvanhovell thanks for your attention, ok i will try to write the test. |
can you refer this, and have a look, #8927 |
Can one of the admins verify this patch? |
@seayi any progress on this ? Would be good to add this in if consistently reproducible. |
Per my comment on the JIRA, I believe that this is not a duplicate of #12436 as was originally suggested, so I'd propose that we revive discussion and review of this. @mridulm, I have logs from a reproduction which occurred on a Spark 2.1.0 production cluster, which I posted on the JIRA (https://issues.apache.org/jira/browse/SPARK-14658). I'm still not entirely sure what's happening here, but one clue comes from the fact that it's the third submission of the task set which is failing. My hunch is that there's an invariant regarding overlapping original attempts and re-attempts which is violated when a re-attempt itself fails and is re-attempted again. /cc @kayousterhout and @markhamstra for review of this scheduler-related patch. |
@JoshRosen I haven't tried to walk through the logs in your JIRA comment, but it wouldn't surprise me at all if this is the same issue that we've been working through in #16620 |
@JoshRosen This is interesting : thanks for the details ! |
I just closed the JIRA as a duplicate and agree with @markhamstra that this duplicates #16620 (let's move discussion about whether this is a duplicate to the JIRA so it's recorded) |
Also, the approach in this PR was discussed and rejected in #16620 (see #16620 (comment) for a description of why; the approach here will also fail the DAGSchedulerSuite unit tests). |
Can you update the PR description here to have the JIRA number (SPARK-14658), not the PR number? |
What changes were proposed in this pull request?
when task finished from failed stage ,it shouldn't. remove from pending rdd partition list.
How was this patch tested?
manual tests
when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished,because the finished's task from the failed stage should not remove from the pending partition list.