Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524

Closed
wants to merge 1 commit into from

Conversation

seayi
Copy link
Contributor

@seayi seayi commented Apr 20, 2016

What changes were proposed in this pull request?

when task finished from failed stage ,it shouldn't. remove from pending rdd partition list.

How was this patch tested?

manual tests

when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished,because the finished's task from the failed stage should not remove from the pending partition list.

when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished,because the finished's task from the failed stage should not remove from the pending partition list.
@seayi
Copy link
Contributor Author

seayi commented Apr 20, 2016

@hvanhovell
Copy link
Contributor

@seayi Could you provide a proper title to the PR, this should contain the JIRA ticket, the Spark component and descriptive title. Something like this: [SPARK-14658][Core] Resubmit tasks on failure

Manual test are typically not sufficient. Please think of a way of capturing this in a test.

@jodersky
Copy link
Member

is this related to #12436 ?

@srowen
Copy link
Member

srowen commented Apr 21, 2016

Yes I think this PR should be closed, and discussion merged to #12436

@seayi
Copy link
Contributor Author

seayi commented Apr 22, 2016

@srowen thanks for your attention ,but i think it is not the same,the 14649 jira is running duplicate tasks, but this jira is submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit.

@seayi seayi changed the title Update DAGScheduler.scala [SPARK-12524][Core]submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. Apr 22, 2016
@seayi seayi changed the title [SPARK-12524][Core]submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. [SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. Apr 22, 2016
@seayi
Copy link
Contributor Author

seayi commented Apr 22, 2016

@hvanhovell thanks for your attention, ok i will try to write the test.
it happened in our spark cluster,which cause our spark thrift server spark context exit after running a few days when executor lost often happens ,and after change the code ,it doesn't happen again.

@suyanNone
Copy link
Contributor

can you refer this, and have a look, #8927

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mridulm
Copy link
Contributor

mridulm commented Oct 11, 2016

@seayi any progress on this ? Would be good to add this in if consistently reproducible.

@JoshRosen
Copy link
Contributor

Per my comment on the JIRA, I believe that this is not a duplicate of #12436 as was originally suggested, so I'd propose that we revive discussion and review of this.

@mridulm, I have logs from a reproduction which occurred on a Spark 2.1.0 production cluster, which I posted on the JIRA (https://issues.apache.org/jira/browse/SPARK-14658). I'm still not entirely sure what's happening here, but one clue comes from the fact that it's the third submission of the task set which is failing. My hunch is that there's an invariant regarding overlapping original attempts and re-attempts which is violated when a re-attempt itself fails and is re-attempted again.

/cc @kayousterhout and @markhamstra for review of this scheduler-related patch.

@markhamstra
Copy link
Contributor

@JoshRosen I haven't tried to walk through the logs in your JIRA comment, but it wouldn't surprise me at all if this is the same issue that we've been working through in #16620

@mridulm
Copy link
Contributor

mridulm commented Feb 17, 2017

@JoshRosen This is interesting : thanks for the details !
On the face of it, I think @markhamstra's comment about #16620 should apply - but given the additional details, it might possible to reproduce it consistently ?
I am hoping we can create a repeatable test to trigger this : which should greatly speed up the debugging. The earlier case was not reproducible when I tried, but we have more info now ...

@kayousterhout
Copy link
Contributor

kayousterhout commented Feb 23, 2017

I just closed the JIRA as a duplicate and agree with @markhamstra that this duplicates #16620 (let's move discussion about whether this is a duplicate to the JIRA so it's recorded)

@kayousterhout
Copy link
Contributor

Also, the approach in this PR was discussed and rejected in #16620 (see #16620 (comment) for a description of why; the approach here will also fail the DAGSchedulerSuite unit tests).

@kayousterhout
Copy link
Contributor

Can you update the PR description here to have the JIRA number (SPARK-14658), not the PR number?

@kayousterhout
Copy link
Contributor

@seayi -- can you close this PR, since it's a duplicate of #16620?

@srowen srowen mentioned this pull request Mar 22, 2017
@asfgit asfgit closed this in b70c03a Mar 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants