Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop #5004

Closed
wants to merge 5 commits into from

Conversation

CodingCat
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-4012

This patch is a resubmission for #2864

What I am proposing in this patch is that _when the exception is thrown from an infinite loop, we should stop the SparkContext, instead of let JVM throws exception forever_

So, in the infinite loops where we originally wrapped with a logUncaughtExceptions, I changed to tryOrStopSparkContext, so that the Spark component is stopped

Early stopped JVM process is helpful for HA scheme design, for example,

The user has a script checking the existence of the pid of the Spark Streaming driver for monitoring the availability; with the code before this patch, the JVM process is still available but not functional when the exceptions are thrown

@andrewor14, @srowen , mind taking further consideration about the change?

@SparkQA
Copy link

SparkQA commented Mar 12, 2015

Test build #28525 has started for PR 5004 at commit 6322959.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 12, 2015

Test build #28525 has finished for PR 5004 at commit 6322959.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28525/
Test PASSed.

@zsxwing
Copy link
Member

zsxwing commented Mar 13, 2015

Do you think calling Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler) in the driver side is acceptable? Or at least, I think we should make sure every thread created by Spark should set SparkUncaughtExceptionHandler.

@CodingCat
Copy link
Contributor Author

Hi, @zsxwing , Thread.setDefaultUncaughtExceptionHandler(SparkUncaughtExceptionHandler) is fine, though it's equivalent to the current patch in terms of functionality

but I'm not sure if we need to set exceptionHandler of all threads as SparkUncaughtExceptionHandler, because that means, once there is an uncaught exception, we stop the program...

@zsxwing
Copy link
Member

zsxwing commented Mar 13, 2015

once there is an uncaught exception, we stop the program...

Yes. So I suggests only threads created by Spark should set SparkUncaughtExceptionHandler. I think if a Spark internal thread throws an uncaught exception, it often means some Spark internal module has crashed. Just my 2 cents about improving the robustness.

Of course, look good to me about your changes.

@CodingCat
Copy link
Contributor Author

@zsxwing , sounds reasonable

let's wait for more eyes on this

@srowen
Copy link
Member

srowen commented Mar 13, 2015

I tend to favor this change for ContextCleaner and AsynchronousListenerBus, because an exception there kills a thread that is clearly intended to run forever. I'm not as sure about FsHistoryProvider, since there it is not as clear (yet) that the Runnable should cause the worker to die if it hits an exception. why this one?

I'd like it if, say, @aarondav might comment, since he added one of the lines being changed.

@CodingCat
Copy link
Contributor Author

@srowen thanks for the comments,

the reason to change FsHistoryProvider is that the runner generated by this function is essentially executed by the threads with fixed rate

https://github.com/CodingCat/spark/blob/SPARK-4012-1/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L138

@aarondav
Copy link
Contributor

Is it really OK to System.exit() the driver JVM? This may be user code that has an embedded SparkContext. The SparkUncaughtExceptionHandler is suitable for Executors, where we have full control over the JVM, and AppClient for the same reason, but I'm not sure TaskSchedulerImpl should be using it, or ContextCleaner, for instance.

For driver shutdowns, it seems safer just to stop() the SparkContext.

@SparkQA
Copy link

SparkQA commented Mar 14, 2015

Test build #28613 has started for PR 5004 at commit 6087864.

  • This patch merges cleanly.

@CodingCat
Copy link
Contributor Author

@aarondav , thanks for the insightful suggestion

I just updated the patch

the change becomes a bit bigger, as I need to create a new method in Utils as tryOrStopSparkContext, which is a curried method in the current patch

also, I need to pass a SparkContext reference to AsynchronousListenerBus to be able to pass sparkContext as the parameter, so I changed the start() to start(sparkContext: SparkContext) in which I set the value of the newly added sparkContext variable member of AsynchronousListenerBus

@CodingCat CodingCat changed the title [SPARK-4012] exit JVM process when the exception is thrown from an infinite loop [SPARK-4012] stop SparkContext when the exception is thrown from an infinite loop Mar 14, 2015
@SparkQA
Copy link

SparkQA commented Mar 14, 2015

Test build #28613 has finished for PR 5004 at commit 6087864.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28613/
Test PASSed.

@@ -1156,6 +1156,18 @@ private[spark] object Utils extends Logging {
}

/**
* Execute a block of code that evaluates to Unit, stop SparkContext is any uncaught exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment contrasting this to tryOrExit, saying that this method is suitable for the driver while tryOrExit should be used for other JVMs started by Spark, over which we have full control. Also, second part should say something like "stopping the SparkContext if there is any uncaught exception."

@CodingCat
Copy link
Contributor Author

@aarondav thanks for the comments, I just updated the patch

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28619 has started for PR 5004 at commit 3c72cd8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 15, 2015

Test build #28619 has finished for PR 5004 at commit 3c72cd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28619/
Test PASSed.

@aarondav
Copy link
Contributor

Alright, this looks good to me. @srowen and @zsxwing, any comments?

logError(s"uncaught error in thread ${Thread.currentThread().getName}, stopping " +
"SparkContext", t)
sc.stop()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about throwing t again here? So that the user can use UncaughtExceptionHandler to monitor the uncaught exception. If not, the user won't be aware that sc is shutdown until calling runJob next time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @zsxwing thanks for the comments

I personally prefer a more conservative way here (the current approach)

Because the throwable thrown from here can be varying in terms of types, and I'm concerning that the Throwable from here, like OOM, would be mixed with the instances of the same type generated by the other components in user's program; on the other hand, our goal is just to let the user know SparkContext is stopped

So I prefer to letting the user call SparkContext.runJob to get a IllegalStateException("SparkContext has been shutdown") which (hopefully) will be handled exactly

@srowen @aarondav , your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we catch NonFatal(e) and re-throw other Throwables? Basically saying that fatal errors should be re-thrown, but lesser ones can just stop here, they should only application-level exceptions which are our code's concern.

@srowen
Copy link
Member

srowen commented Mar 18, 2015

I don't have a strong opinion on this one. I suppose there's a question of what may happen if you stop the SparkContext but then immediately kill the calling thread with an uncaught Throwable. Does that undermine the action of stop() in some cases?

@CodingCat
Copy link
Contributor Author

@srowen I checked the code, one suspicious part might be the asynchronous shutdown of ActorSystem, but that should be OK in Spark case...

@aarondav ?

@zsxwing
Copy link
Member

zsxwing commented Mar 18, 2015

Since tryOrStopSparkContext is in Utils, I think it's better to check if it's reasonable in the general case. I'm concerning that swallowing fatal errors (OOM) or InterruptException may cause some issue.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28825 has started for PR 5004 at commit 589276a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28825 has finished for PR 5004 at commit 589276a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28825/
Test PASSed.

@CodingCat
Copy link
Contributor Author

@zsxwing @srowen @aarondav how about the current version?

@zsxwing
Copy link
Member

zsxwing commented Mar 19, 2015

LGTM

1 similar comment
@aarondav
Copy link
Contributor

LGTM

@CodingCat
Copy link
Contributor Author

thanks, guys~

@aarondav
Copy link
Contributor

Cool, merging this into master. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants