Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7689] Deprecate spark.cleaner.ttl #6220

Closed
wants to merge 1 commit into from

Conversation

JoshRosen
Copy link
Contributor

With the introduction of ContextCleaner (in #126), I think there's no longer any reason for users to enable the MetadataCleaner / spark.cleaner.ttl. This patch removes the last remaining documentation for spark.cleaner.ttl and logs a deprecation warning if it is used.

I think that this configuration used to be relevant for Spark Streaming jobs, but I think that's no longer the case since the latest Streaming docs have removed all mentions of spark.cleaner.ttl (see https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817, for example). The TTL-based cleaning is not safe and may prematurely clean resources that are still being used, leading to confusing errors (such as https://issues.apache.org/jira/browse/SPARK-5594), so it generally should not be enabled (see http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html for an old, related discussion).

The only use-case that I can think of is super-long-lived Spark REPLs where you're worried about orphaning RDDs or broadcast variables in your REPL history and having them never get cleaned up, but I don't know that anyone uses spark.cleaner.ttl for this in practice.

@JoshRosen
Copy link
Contributor Author

/cc @tdas

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32941/
Test FAILed.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32943/
Test FAILed.

@SparkQA
Copy link

SparkQA commented May 17, 2015

Test build #818 has started for PR 6220 at commit 608cdc9.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 18, 2015

Test build #32968 has started for PR 6220 at commit 608cdc9.

@SparkQA
Copy link

SparkQA commented May 18, 2015

Test build #32968 has finished for PR 6220 at commit 608cdc9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32968/
Test FAILed.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 18, 2015

Test build #32971 has started for PR 6220 at commit 608cdc9.

@tdas
Copy link
Contributor

tdas commented May 18, 2015

This is tricky actually. The problem with reference-tracking based mechanism that we have is not bullet-proof as it depends on GC behavior in the driver. This is a problem that I have seen in Spark Streaming programs as well. For driver with large heaps, some RDD that is not in scope may not got dereferenced for a long time, until a full GC occurs. And in the mean time, nothing will get cleaned. The solutions to that is to call System.gc() at some interval, say one hour.

So I am wondering whether this deprecation message should cover that aspect or not.

@SparkQA
Copy link

SparkQA commented May 18, 2015

Test build #32971 has finished for PR 6220 at commit 608cdc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32971/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

It's probably worth documenting the driver System.gc() trick somewhere in the main documentation. There's a nice writeup at https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html that Chris and I wrote; maybe we can repurpose some of that text.

The TTL-based mechanism won't work in many cases, such as streaming jobs that join streaming and historical data. Given that there are so many corner-cases where TTL might not work as expected, I'm in favor of removing the documentation. I think that there's probably only a handful of power users who would be able to use this safely while understanding all of the corner-cases. We can still leave the setting in, but I'd like to avoid having a documented setting that's so unsafe to use. If you feel strongly that it should be documented, then I can see about updating its doc to give more warnings about the corner-cases.

@andrewor14
Copy link
Contributor

I wonder if we should do the System.gc() internally ourselves every hour or so to trigger clean up. I am in full favor of deprecating or even removing spark.cleaner.ttl completely, but maybe we should introduce the periodic driver GC mechanism as an alternative for streaming applications.

spark.referenceTracking.gcInterval or something? @tdas @JoshRosen What do you think about defaulting it to 1 hour?

@JoshRosen
Copy link
Contributor Author

An internal System.gc() is a pretty good idea. I was wondering whether a one hour default might be too long, but maybe not:

If the driver fills up with too much in-memory metadata, then the GC will kick in and clean it up, so I guess we're only worried about cases where we run out of a non-memory resource, such as disk space, because GC wasn't run on the driver. You can probably back-of-the-envelope calculate the right GC interval based on your disk capacity and the maximum write throughput of your disks: if you have 100 gigabytes of temporary space for shuffle files and can only write at a maximum speed of 100 MB/s, then running GC at least once every 15 minutes should be sufficient to prevent the disks from filling up (since 100 gigabytes / (100 megabytes / second) ~= 16.5 minutes to fill the disks).

@andrewor14
Copy link
Contributor

Yeah, since it's a best-effort thing maybe it makes sense to do it more frequently. 15 minutes sounds fine to me.

@tdas
Copy link
Contributor

tdas commented May 21, 2015

Sounds good to me. Question is do we add it Spark 1.4?

TD

On Tue, May 19, 2015 at 1:20 PM, andrewor14 [email protected]
wrote:

Yeah, since it's a best-effort thing maybe it makes sense to do it more
frequently. 15 minutes sounds fine to me.


Reply to this email directly or view it on GitHub
#6220 (comment).

@JoshRosen
Copy link
Contributor Author

We might be able to add this to 1.4 if we feature-flag it as off-by-default. We can recommend this as a replacement for spark.cleaner.ttl in 1.4, then fully remove MetadataCleaner in 1.5 and enable this by default.

@tdas
Copy link
Contributor

tdas commented May 21, 2015

i am not sure if it is a good idea to completely remove MetadataCleaner.
Old existing workloads that relied on that could break completely without
this.

On Wed, May 20, 2015 at 5:47 PM, Josh Rosen [email protected]
wrote:

We might be able to add this to 1.4 if we feature-flag it as
off-by-default. We can recommend this as a replacement for
spark.cleaner.ttl in 1.4, then fully remove MetadataCleaner in 1.5 and
enable this by default.


Reply to this email directly or view it on GitHub
#6220 (comment).

@andrewor14
Copy link
Contributor

Ok, but we should have ugly paragraph warnings that explain why it's a bad idea.

@@ -478,7 +478,12 @@ private[spark] object SparkConf extends Logging {
DeprecatedConfig("spark.kryoserializer.buffer.mb", "1.4",
"Please use spark.kryoserializer.buffer instead. The default value for " +
"spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values " +
"are no longer accepted. To specify the equivalent now, one may use '64k'.")
"are no longer accepted. To specify the equivalent now, one may use '64k'."),
DeprecatedConfig("spark.cleaner.ttl", "1.4",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops looks like this needs to be 1.5 now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this patch is out of date; I was waiting until I had time to do the period GC timer feature; feel free to work-steal if you want to pick this up :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it's all yours...

@JoshRosen
Copy link
Contributor Author

I'm not actively working on this and won't have time to get to it for a while, so I'm going to close this PR and unassign the JIRA in the hopes that someone else can take over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants