Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1397] Notify SparkListeners when stages fail or are cancelled. #309

Closed
wants to merge 2 commits into from

Conversation

kayousterhout
Copy link
Contributor

[I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, #305. You can look at just the second commit: https://github.com/kayousterhout/spark-1/commit/93f08baf731b9eaf5c9792a5373560526e2bccac to see just the changes relevant to this PR]

Previously, when stages fail or get cancelled, the SparkListener is only notified
indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
stage that failed. This worked before job cancellation, because jobs would only fail
due to a single stage failure. However, with job cancellation, multiple running stages
can fail when a job gets cancelled. Right now, this is not handled correctly, which
results in stages that get stuck in the “Running Stages” window in the UI even
though they’re dead.

This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
event, and uses this event to tell SparkListeners when stages fail in addition to when
they complete successfully. This change is NOT publicly backward compatible for two
reasons. First, it changes the SparkListener interface. We could alternately add a new event,
SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted. However,
this is less consistent with the listener events for tasks / jobs ending, and will result in some
code duplication for listeners (because failed and completed stages are handled in similar
ways). Note that I haven’t finished updating the JSON code to correctly handle the new event
because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).

It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
method to no longer include a stage that caused the failure. I think this change should definitely
stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
single stage.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13707/

@pwendell
Copy link
Contributor

pwendell commented Apr 3, 2014

Jenkins, test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13717/

@markhamstra
Copy link
Contributor

Just a nomenclature note: The division between StageCompleted and StageFailed or StageEnded is not really consistent with Akka/Scala Futures, where Completed doesn't imply success, but rather completed futures are instances of either Success or Failure -- http://docs.scala-lang.org/overviews/core/futures.html. It wouldn't be the worst thing if we adopted our own semantics, but on the other hand, it would be less confusing to be consistent across Futures and Stage completion.

* Fails a job and all stages that are only used by that job, and cleans up relevant state.
*
* @param resultStage The result stage for the job, if known. Used to cleanup state for the job
* slightly more efficiently than when not specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not implemented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which part of this did you think should say not implemented? resultStage is an optional parameter so I don't think it makes sense to talk about it not being implemented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, too brief. I just meant that resultStage isn't being used in failJobAndIndependentStages, so the commented optimization isn't implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch thanks for pointing that out!!

@markhamstra
Copy link
Contributor

Aside from the naming issue, the few nits I noted and the in-progress elements, 305 + 309 is looking pretty good to me.

@kayousterhout
Copy link
Contributor Author

Cool, thanks @markhamstra ! I'll make the changes you suggested.

Given what you said about the naming and also the pain of changing the name (both for me to do and for others who have written Spark Listeners) it sounds like it makes sense to keep SparkListenerStageCompleted as the event for both when a stage ends successfully and when it fails. Does this seem reasonable to you, @pwendell ?

andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Apr 7, 2014
SPARK-544. Migrate configuration to a SparkConf class

This is still a work in progress based on Prashant and Evan's code. So far I've done the following:

- Got rid of global SparkContext.globalConf
- Passed SparkConf to serializers and compression codecs
- Made SparkConf public instead of private[spark]
- Improved API of SparkContext and SparkConf
- Switched executor environment vars to be passed through SparkConf
- Fixed some places that were still using system properties
- Fixed some tests, though others are still failing

This still fails several tests in core, repl and streaming, likely due to properties not being set or cleared correctly (some of the tests run fine in isolation). But the API at least is hopefully ready for review. Unfortunately there was a lot of global stuff before due to a "SparkContext.globalConf" method that let you set a "default" configuration of sorts, which meant I had to make some pretty big changes.
Previously, when stages fail or get cancelled, the SparkListener is only notified
indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
stage that failed.  This worked before job cancellation, because jobs would only fail
due to a single stage failure.  However, with job cancellation, multiple running stages
can fail when a job gets cancelled.  Right now, this is not handled correctly, which
results in stages that get stuck in the “Running Stages” window in the UI even
though they’re dead.

This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
event, and uses this event to tell SparkListeners when stages fail in addition to when
they complete successfully.  This change is NOT publicly backward compatible for two
reasons.  First, it changes the SparkListener interface.  We could alternately add a new event,
SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted.  However,
this is less consistent with the listener events for tasks / jobs ending, and will result in some
code duplication for listeners (because failed and completed stages are handled in similar
ways).  Note that I haven’t finished updating the JSON code to correctly handle the new event
because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).

It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
method to no longer include a stage that caused the failure.  I think this change should definitely
stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
single stage.
@pwendell
Copy link
Contributor

pwendell commented Apr 8, 2014

Sounds good to me.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@kayousterhout
Copy link
Contributor Author

Ok this is now ready -- fixed the things you comment on @markhamstra , rebased on master (so this no longer includes PR 305, which has been merged). Will merge later today unless anyone has any future comments! Thanks for the review!

@kayousterhout kayousterhout changed the title [SPARK-1397] [WIP] Notify SparkListeners when stages fail or are cancelled. [SPARK-1397] Notify SparkListeners when stages fail or are cancelled. Apr 8, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13902/

@kayousterhout
Copy link
Contributor Author

Jenkins, retest this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13904/

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13906/

@pwendell
Copy link
Contributor

pwendell commented Apr 8, 2014

@kayousterhout merging this per our offline discussion.

@asfgit asfgit closed this in fac6085 Apr 8, 2014
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
[I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, apache#305.  You can look at just the second commit: kayousterhout@93f08ba to see just the changes relevant to this PR]

Previously, when stages fail or get cancelled, the SparkListener is only notified
indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
stage that failed.  This worked before job cancellation, because jobs would only fail
due to a single stage failure.  However, with job cancellation, multiple running stages
can fail when a job gets cancelled.  Right now, this is not handled correctly, which
results in stages that get stuck in the “Running Stages” window in the UI even
though they’re dead.

This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
event, and uses this event to tell SparkListeners when stages fail in addition to when
they complete successfully.  This change is NOT publicly backward compatible for two
reasons.  First, it changes the SparkListener interface.  We could alternately add a new event,
SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted.  However,
this is less consistent with the listener events for tasks / jobs ending, and will result in some
code duplication for listeners (because failed and completed stages are handled in similar
ways).  Note that I haven’t finished updating the JSON code to correctly handle the new event
because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).

It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
method to no longer include a stage that caused the failure.  I think this change should definitely
stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
single stage.

Author: Kay Ousterhout <[email protected]>

Closes apache#309 from kayousterhout/stage_cancellation and squashes the following commits:

5533ecd [Kay Ousterhout] Fixes in response to Mark's review
320c7c7 [Kay Ousterhout] Notify SparkListeners when stages fail or are cancelled.
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Change the service name field of osb config according to osb-checker schema
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants