Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6560][CORE] Do not suppress exceptions from writer.write. #5223

Closed

Conversation

stephenh
Copy link
Contributor

If there is a failure in the Hadoop backend while calling
writer.write, we should remember this original exception,
and try to call writer.close(), but if that fails as well,
still report the original exception.

Note that, if writer.write fails, it is likely that writer
was left in an invalid state, and so actually makes it more
likely that writer.close will also fail. Which just increases
the chances for writer.write's exception to be suppressed.

This patch introduces an admittedly potentially too cute
Utils.tryWithSafeFinally method to handle the try/finally
gyrations.

@SparkQA
Copy link

SparkQA commented Mar 27, 2015

Test build #29280 has started for PR 5223 at commit f42e92d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 27, 2015

Test build #29280 has finished for PR 5223 at commit f42e92d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegexTokenizer extends UnaryTransformer[String, Seq[String], RegexTokenizer]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29280/
Test PASSed.

@srowen
Copy link
Member

srowen commented Mar 27, 2015

I like the idea. What I slightly don't like is making a utility method but then using it to solve just one instance of the issue. I think that this doesn't need to apply to all finally blocks, but just ones like this situation, where it's likely that a knock-on exception could hide the useful original one. Could you survey the codebase and see if there are other instances that could benefit from this treatment? I think it's a good improvement.

@stephenh
Copy link
Contributor Author

@srowen I waffled on the utility method, but went with it only because putting the logic inline with the writer.write/writer.close logic made it pretty horribly to read. That finally/try/finally is not exactly pretty.

Sure, I can scan for other places in the codebase that are obvious for this; although I'm tempted to think the opposite, in that I think I'd almost always want this "don't suppress the real exception" behavior, vs. hiding what really happened.

@srowen
Copy link
Member

srowen commented Mar 28, 2015

@stephenh how about other exactly similar cases in the code base, at least -- a writer which may fail with an exception, which often causes its closing to fail too, since it often triggers a flush? I don't think there are a load of them so it wouldn't be too disruptive. Use your judgment.

@stephenh stephenh force-pushed the do_not_suppress_writer_exception branch from f42e92d to d7949a3 Compare March 28, 2015 23:02
@SparkQA
Copy link

SparkQA commented Mar 28, 2015

Test build #29354 has started for PR 5223 at commit d7949a3.

  • This patch merges cleanly.

@stephenh
Copy link
Contributor Author

Okay, just grepping for "writer.write" in core/src/main, I found another place in PairRDDFunctions, and then an existing "safe finally" usage in ShuffleMapTask.

I've glanced around at other finally blocks, and, yeah, I guess most of them see fine/more innocent, so this is probably fine for now. I'll keep poking around, but this is what I have for now. Let me know if there are more suspecting places to look than others.

@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29354 has finished for PR 5223 at commit d7949a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29354/
Test PASSed.

log.debug("Could not stop writer", e)
}
throw e
} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was not closing the writer before during normal execution, so this is good. However, this change also means that it will stop with success = false whether an error occurred or not, and will do so even after stop was called with success = true. This one may not fit the pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, sorry, good catch.

@srowen
Copy link
Member

srowen commented Mar 29, 2015

Grepping for something coarse like out.*\.close\( found about 30 potential instances. Unfortunately a lot of them have a slightly more significant problem: they don't even close() in a finally block. No exception masking, but, also no cleanup on the error path.

In many of the cases (tests, utility helper code) it probably doesn't matter but some of them look worth fixing with this nice utility method, that gets both the finally and exception masking situation right.

@stephenh
Copy link
Contributor Author

Good idea about out.close; I'll look in that, but it might take a day or two to find some time. Thanks for the suggestion.

@stephenh stephenh force-pushed the do_not_suppress_writer_exception branch 2 times, most recently from 7e9cd5d to 9ce88d4 Compare March 31, 2015 16:01
@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29487 has started for PR 5223 at commit 9ce88d4.

@stephenh stephenh force-pushed the do_not_suppress_writer_exception branch from 9ce88d4 to 3e72362 Compare March 31, 2015 16:05
@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29489 has started for PR 5223 at commit 3e72362.

@stephenh
Copy link
Contributor Author

Okay, I put back the ShuffleMapTask change, and found a few other "out.close" places. I was only looking in core/src/main. Were there some specifically you'd seen that I've missed?

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29489 has finished for PR 5223 at commit 3e72362.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29489/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29487 has finished for PR 5223 at commit 9ce88d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29487/
Test PASSed.

val out = new DataOutputStream(conn.getOutputStream)
out.write(json.getBytes(Charsets.UTF_8))
out.close()
var out: DataOutputStream = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if there are other changes that are needed, but, I think this pattern is slightly inconsistent with the others. out can be a val that is declared before the try block. I suppose if the object to be closed never instantiates, there's nothing to call close() on anyway. If we were really picky we'd have to handle the case of conn.getOutputStream succeeding while the constructor fails but I think it is not needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, right...I had looked at this and not sure why I thought it needed to be a var...will change.

@stephenh stephenh force-pushed the do_not_suppress_writer_exception branch from 3e72362 to 15c4a6f Compare March 31, 2015 20:47
@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29500 has started for PR 5223 at commit 15c4a6f.

@srowen
Copy link
Member

srowen commented Mar 31, 2015

Hm, I think there are maybe 10 more examples of this pattern that could be touched up in the code base. MapOutputTracker has one, another in PythonRDD, HttpBroadcast has one, etc. In some cases the stream isn't cleaned up.

Maybe it's a good time to pause to ask if anyone else has thoughts either way?

I favor cleaning this up, at least the non-test, non-example, core instances.

@stephenh
Copy link
Contributor Author

Ah, yeah...just grepping for ".close()" shows a lot more hits. I've poked a few that you mentioned.

I'm all for getting other feedback...will someone notice our chatter on the PR, or do you want to specifically loop someone in?

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29500 has finished for PR 5223 at commit 15c4a6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29500/
Test PASSed.

@srowen
Copy link
Member

srowen commented Apr 1, 2015

Yeah this is looking good. I identified some more instances that might use the same treatment; have a look and see what you think:

  • Utils.RedirectThread.run()
  • PythonRDD.PythonBroadcast.readObject()
  • HttpBroadcast.write()
  • MapOutputTracker.serializeMapStatuses()
  • DiskBlockObjectWriter.close(), .revertPartialWritesAndClose()
  • CheckpointRDD.writeToFile()

CC @pwendell @rxin for thoughts on this which touches bits of code all over the place, but in core in particular. I think it improves error reporting and cleanup and is a clean, targeted change everywhere.

// We could do originalThrowable.addSuppressed(t), but it's
// not available in JDK 1.6.
logWarning(s"Skipping exception that happening in finally", t)
throw originalThrowable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do some of our own munging here to append to the original message in order to indicate the swallowed exception as well? Or maybe it's too verbose and not worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add.

@pwendell
Copy link
Contributor

pwendell commented Apr 1, 2015

This LGTM with one minor comment - I looked closely at the external sorter code, which seemed like the trickiest bit, and from what I can tell the existing behavior is preserved.

Thanks, this is a nice construct we can use throughout the codebase.

@stephenh stephenh force-pushed the do_not_suppress_writer_exception branch from 15c4a6f to 6092217 Compare April 2, 2015 17:06
If there is a failure in the Hadoop backend while calling
writer.write, we should remember this original exception,
and try to call writer.close(), but if that fails as well,
still report the original exception.

Note that, if writer.write fails, it is likely that writer
was left in an invalid state, and so actually makes it more
likely that writer.close will also fail. Which just increases
the chances for writer.write's exception to be suppressed.

This patch introduces an admittedly potentially too cute
Utils.tryWithSafeFinally method to handle the try/finally
gyrations.
@stephenh stephenh force-pushed the do_not_suppress_writer_exception branch from 6092217 to c7ad53f Compare April 2, 2015 17:07
@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29614 has started for PR 5223 at commit 6092217.

@stephenh
Copy link
Contributor Author

stephenh commented Apr 2, 2015

Okay, @srowen I believe I've addressed each of those additional locations.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29615 has started for PR 5223 at commit c7ad53f.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29614 has finished for PR 5223 at commit 6092217.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29614/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Apr 2, 2015

Test build #29615 has finished for PR 5223 at commit c7ad53f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29615/
Test PASSed.

@asfgit asfgit closed this in b0d884f Apr 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants