[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524

seayi · 2016-04-20T09:03:03Z

What changes were proposed in this pull request?

when task finished from failed stage ,it shouldn't. remove from pending rdd partition list.

How was this patch tested?

manual tests

when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished,because the finished's task from the failed stage should not remove from the pending partition list.

seayi · 2016-04-20T09:10:28Z

https://issues.apache.org/jira/browse/SPARK-14658

hvanhovell · 2016-04-20T09:26:31Z

@seayi Could you provide a proper title to the PR, this should contain the JIRA ticket, the Spark component and descriptive title. Something like this: [SPARK-14658][Core] Resubmit tasks on failure

Manual test are typically not sufficient. Please think of a way of capturing this in a test.

jodersky · 2016-04-21T05:54:23Z

is this related to #12436 ?

srowen · 2016-04-21T06:29:13Z

Yes I think this PR should be closed, and discussion merged to #12436

seayi · 2016-04-22T08:18:01Z

@srowen thanks for your attention ,but i think it is not the same,the 14649 jira is running duplicate tasks, but this jira is submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit.

seayi · 2016-04-22T08:30:09Z

@hvanhovell thanks for your attention, ok i will try to write the test.
it happened in our spark cluster,which cause our spark thrift server spark context exit after running a few days when executor lost often happens ,and after change the code ,it doesn't happen again.

suyanNone · 2016-04-22T09:25:36Z

can you refer this, and have a look, #8927

AmplabJenkins · 2016-07-22T11:57:16Z

Can one of the admins verify this patch?

mridulm · 2016-10-11T23:40:49Z

@seayi any progress on this ? Would be good to add this in if consistently reproducible.

JoshRosen · 2017-02-16T22:50:49Z

Per my comment on the JIRA, I believe that this is not a duplicate of #12436 as was originally suggested, so I'd propose that we revive discussion and review of this.

@mridulm, I have logs from a reproduction which occurred on a Spark 2.1.0 production cluster, which I posted on the JIRA (https://issues.apache.org/jira/browse/SPARK-14658). I'm still not entirely sure what's happening here, but one clue comes from the fact that it's the third submission of the task set which is failing. My hunch is that there's an invariant regarding overlapping original attempts and re-attempts which is violated when a re-attempt itself fails and is re-attempted again.

/cc @kayousterhout and @markhamstra for review of this scheduler-related patch.

markhamstra · 2017-02-16T23:38:33Z

@JoshRosen I haven't tried to walk through the logs in your JIRA comment, but it wouldn't surprise me at all if this is the same issue that we've been working through in #16620

mridulm · 2017-02-17T06:59:01Z

@JoshRosen This is interesting : thanks for the details !
On the face of it, I think @markhamstra's comment about #16620 should apply - but given the additional details, it might possible to reproduce it consistently ?
I am hoping we can create a repeatable test to trigger this : which should greatly speed up the debugging. The earlier case was not reproducible when I tried, but we have more info now ...

kayousterhout · 2017-02-23T22:24:08Z

I just closed the JIRA as a duplicate and agree with @markhamstra that this duplicates #16620 (let's move discussion about whether this is a duplicate to the JIRA so it's recorded)

kayousterhout · 2017-02-23T22:27:11Z

Also, the approach in this PR was discussed and rejected in #16620 (see #16620 (comment) for a description of why; the approach here will also fail the DAGSchedulerSuite unit tests).

kayousterhout · 2017-02-23T22:31:37Z

Can you update the PR description here to have the JIRA number (SPARK-14658), not the PR number?

kayousterhout · 2017-03-07T20:33:14Z

@seayi -- can you close this PR, since it's a duplicate of #16620?

Update DAGScheduler.scala

85b0725

when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished,because the finished's task from the failed stage should not remove from the pending partition list.

seayi changed the title ~~Update DAGScheduler.scala~~ [SPARK-12524][Core]submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. Apr 22, 2016

srowen mentioned this pull request Mar 22, 2017

[INFRA] Close stale PRs #17386

Closed

asfgit closed this in b70c03a Mar 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524

[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524

seayi commented Apr 20, 2016

seayi commented Apr 20, 2016

hvanhovell commented Apr 20, 2016

jodersky commented Apr 21, 2016

srowen commented Apr 21, 2016

seayi commented Apr 22, 2016 •

edited

Loading

seayi commented Apr 22, 2016 •

edited

Loading

suyanNone commented Apr 22, 2016

AmplabJenkins commented Jul 22, 2016

mridulm commented Oct 11, 2016

JoshRosen commented Feb 16, 2017

markhamstra commented Feb 16, 2017

mridulm commented Feb 17, 2017

kayousterhout commented Feb 23, 2017 •

edited

Loading

kayousterhout commented Feb 23, 2017

kayousterhout commented Feb 23, 2017

kayousterhout commented Mar 7, 2017

[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524

[SPARK-12524][Core]DagScheduler may submit a task set for a stage even if there are another active tast set for this stage is active which can cause the spark context exit. #12524

Conversation

seayi commented Apr 20, 2016

What changes were proposed in this pull request?

How was this patch tested?

seayi commented Apr 20, 2016

hvanhovell commented Apr 20, 2016

jodersky commented Apr 21, 2016

srowen commented Apr 21, 2016

seayi commented Apr 22, 2016 • edited Loading

seayi commented Apr 22, 2016 • edited Loading

suyanNone commented Apr 22, 2016

AmplabJenkins commented Jul 22, 2016

mridulm commented Oct 11, 2016

JoshRosen commented Feb 16, 2017

markhamstra commented Feb 16, 2017

mridulm commented Feb 17, 2017

kayousterhout commented Feb 23, 2017 • edited Loading

kayousterhout commented Feb 23, 2017

kayousterhout commented Feb 23, 2017

kayousterhout commented Mar 7, 2017

seayi commented Apr 22, 2016 •

edited

Loading

seayi commented Apr 22, 2016 •

edited

Loading

kayousterhout commented Feb 23, 2017 •

edited

Loading