[SPARK-7689] Deprecate spark.cleaner.ttl #6220

JoshRosen · 2015-05-17T17:59:10Z

With the introduction of ContextCleaner (in #126), I think there's no longer any reason for users to enable the MetadataCleaner / spark.cleaner.ttl. This patch removes the last remaining documentation for spark.cleaner.ttl and logs a deprecation warning if it is used.

I think that this configuration used to be relevant for Spark Streaming jobs, but I think that's no longer the case since the latest Streaming docs have removed all mentions of spark.cleaner.ttl (see https://github.com/apache/spark/pull/4956/files#diff-dbee746abf610b52d8a7cb65bf9ea765L1817, for example). The TTL-based cleaning is not safe and may prematurely clean resources that are still being used, leading to confusing errors (such as https://issues.apache.org/jira/browse/SPARK-5594), so it generally should not be enabled (see http://apache-spark-user-list.1001560.n3.nabble.com/is-spark-cleaner-ttl-safe-td2557.html for an old, related discussion).

The only use-case that I can think of is super-long-lived Spark REPLs where you're worried about orphaning RDDs or broadcast variables in your REPL history and having them never get cleaned up, but I don't know that anyone uses spark.cleaner.ttl for this in practice.

JoshRosen · 2015-05-17T17:59:19Z

/cc @tdas

AmplabJenkins · 2015-05-17T18:55:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32941/
Test FAILed.

JoshRosen · 2015-05-17T19:52:12Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-17T19:54:53Z

Merged build started.

AmplabJenkins · 2015-05-17T20:09:57Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-17T20:09:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32943/
Test FAILed.

SparkQA · 2015-05-17T22:58:11Z

Test build #818 has started for PR 6220 at commit 608cdc9.

JoshRosen · 2015-05-18T04:45:07Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-18T04:47:10Z

Merged build triggered.

AmplabJenkins · 2015-05-18T04:47:19Z

Merged build started.

SparkQA · 2015-05-18T04:49:16Z

Test build #32968 has started for PR 6220 at commit 608cdc9.

SparkQA · 2015-05-18T05:59:13Z

Test build #32968 has finished for PR 6220 at commit 608cdc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T05:59:18Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-18T05:59:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32968/
Test FAILed.

JoshRosen · 2015-05-18T06:01:19Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-18T06:02:10Z

Merged build triggered.

AmplabJenkins · 2015-05-18T06:02:19Z

Merged build started.

SparkQA · 2015-05-18T06:04:19Z

Test build #32971 has started for PR 6220 at commit 608cdc9.

tdas · 2015-05-18T07:29:32Z

This is tricky actually. The problem with reference-tracking based mechanism that we have is not bullet-proof as it depends on GC behavior in the driver. This is a problem that I have seen in Spark Streaming programs as well. For driver with large heaps, some RDD that is not in scope may not got dereferenced for a long time, until a full GC occurs. And in the mean time, nothing will get cleaned. The solutions to that is to call System.gc() at some interval, say one hour.

So I am wondering whether this deprecation message should cover that aspect or not.

SparkQA · 2015-05-18T07:49:13Z

Test build #32971 has finished for PR 6220 at commit 608cdc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T07:49:19Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-18T07:49:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32971/
Test PASSed.

JoshRosen · 2015-05-18T20:32:22Z

It's probably worth documenting the driver System.gc() trick somewhere in the main documentation. There's a nice writeup at https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html that Chris and I wrote; maybe we can repurpose some of that text.

The TTL-based mechanism won't work in many cases, such as streaming jobs that join streaming and historical data. Given that there are so many corner-cases where TTL might not work as expected, I'm in favor of removing the documentation. I think that there's probably only a handful of power users who would be able to use this safely while understanding all of the corner-cases. We can still leave the setting in, but I'd like to avoid having a documented setting that's so unsafe to use. If you feel strongly that it should be documented, then I can see about updating its doc to give more warnings about the corner-cases.

andrewor14 · 2015-05-18T20:37:14Z

I wonder if we should do the System.gc() internally ourselves every hour or so to trigger clean up. I am in full favor of deprecating or even removing spark.cleaner.ttl completely, but maybe we should introduce the periodic driver GC mechanism as an alternative for streaming applications.

spark.referenceTracking.gcInterval or something? @tdas @JoshRosen What do you think about defaulting it to 1 hour?

JoshRosen · 2015-05-19T18:42:23Z

An internal System.gc() is a pretty good idea. I was wondering whether a one hour default might be too long, but maybe not:

If the driver fills up with too much in-memory metadata, then the GC will kick in and clean it up, so I guess we're only worried about cases where we run out of a non-memory resource, such as disk space, because GC wasn't run on the driver. You can probably back-of-the-envelope calculate the right GC interval based on your disk capacity and the maximum write throughput of your disks: if you have 100 gigabytes of temporary space for shuffle files and can only write at a maximum speed of 100 MB/s, then running GC at least once every 15 minutes should be sufficient to prevent the disks from filling up (since 100 gigabytes / (100 megabytes / second) ~= 16.5 minutes to fill the disks).

andrewor14 · 2015-05-19T20:20:11Z

Yeah, since it's a best-effort thing maybe it makes sense to do it more frequently. 15 minutes sounds fine to me.

tdas · 2015-05-21T00:08:12Z

Sounds good to me. Question is do we add it Spark 1.4?

TD

On Tue, May 19, 2015 at 1:20 PM, andrewor14 [email protected]
wrote:

Yeah, since it's a best-effort thing maybe it makes sense to do it more
frequently. 15 minutes sounds fine to me.

—
Reply to this email directly or view it on GitHub
#6220 (comment).

JoshRosen · 2015-05-21T00:46:42Z

We might be able to add this to 1.4 if we feature-flag it as off-by-default. We can recommend this as a replacement for spark.cleaner.ttl in 1.4, then fully remove MetadataCleaner in 1.5 and enable this by default.

tdas · 2015-05-21T01:16:09Z

i am not sure if it is a good idea to completely remove MetadataCleaner.
Old existing workloads that relied on that could break completely without
this.

On Wed, May 20, 2015 at 5:47 PM, Josh Rosen [email protected]
wrote:

We might be able to add this to 1.4 if we feature-flag it as
off-by-default. We can recommend this as a replacement for
spark.cleaner.ttl in 1.4, then fully remove MetadataCleaner in 1.5 and
enable this by default.

—
Reply to this email directly or view it on GitHub
#6220 (comment).

andrewor14 · 2015-05-21T01:18:48Z

Ok, but we should have ugly paragraph warnings that explain why it's a bad idea.

andrewor14 · 2015-06-03T22:20:06Z

core/src/main/scala/org/apache/spark/SparkConf.scala

@@ -478,7 +478,12 @@ private[spark] object SparkConf extends Logging {
      DeprecatedConfig("spark.kryoserializer.buffer.mb", "1.4",
        "Please use spark.kryoserializer.buffer instead. The default value for " +
          "spark.kryoserializer.buffer.mb was previously specified as '0.064'. Fractional values " +
-          "are no longer accepted. To specify the equivalent now, one may use '64k'.")
+          "are no longer accepted. To specify the equivalent now, one may use '64k'."),
+      DeprecatedConfig("spark.cleaner.ttl", "1.4",


oops looks like this needs to be 1.5 now

yeah, this patch is out of date; I was waiting until I had time to do the period GC timer feature; feel free to work-steal if you want to pick this up :)

nope, it's all yours...

JoshRosen · 2015-06-18T20:06:24Z

I'm not actively working on this and won't have time to get to it for a while, so I'm going to close this PR and unassign the JIRA in the hopes that someone else can take over.

JoshRosen force-pushed the SPARK-7689 branch from f6e3d45 to 22a1093 Compare May 17, 2015 18:00

Deprecate spark.cleaner.ttl

608cdc9

JoshRosen force-pushed the SPARK-7689 branch from 22a1093 to 608cdc9 Compare May 17, 2015 18:00

andrewor14 reviewed Jun 3, 2015
View reviewed changes

JoshRosen closed this Jun 18, 2015

JoshRosen mentioned this pull request Apr 17, 2020

[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 #10534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7689] Deprecate spark.cleaner.ttl #6220

[SPARK-7689] Deprecate spark.cleaner.ttl #6220

JoshRosen commented May 17, 2015

JoshRosen commented May 17, 2015

AmplabJenkins commented May 17, 2015

JoshRosen commented May 17, 2015

AmplabJenkins commented May 17, 2015

AmplabJenkins commented May 17, 2015

AmplabJenkins commented May 17, 2015

SparkQA commented May 17, 2015

JoshRosen commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

JoshRosen commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

tdas commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

JoshRosen commented May 18, 2015

andrewor14 commented May 18, 2015

JoshRosen commented May 19, 2015

andrewor14 commented May 19, 2015

tdas commented May 21, 2015

JoshRosen commented May 21, 2015

tdas commented May 21, 2015

andrewor14 commented May 21, 2015

andrewor14 Jun 3, 2015

JoshRosen Jun 3, 2015

andrewor14 Jun 3, 2015

JoshRosen commented Jun 18, 2015

[SPARK-7689] Deprecate spark.cleaner.ttl #6220

[SPARK-7689] Deprecate spark.cleaner.ttl #6220

Conversation

JoshRosen commented May 17, 2015

JoshRosen commented May 17, 2015

AmplabJenkins commented May 17, 2015

JoshRosen commented May 17, 2015

AmplabJenkins commented May 17, 2015

AmplabJenkins commented May 17, 2015

AmplabJenkins commented May 17, 2015

SparkQA commented May 17, 2015

JoshRosen commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

JoshRosen commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

tdas commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

JoshRosen commented May 18, 2015

andrewor14 commented May 18, 2015

JoshRosen commented May 19, 2015

andrewor14 commented May 19, 2015

tdas commented May 21, 2015

JoshRosen commented May 21, 2015

tdas commented May 21, 2015

andrewor14 commented May 21, 2015

andrewor14 Jun 3, 2015

Choose a reason for hiding this comment

JoshRosen Jun 3, 2015

Choose a reason for hiding this comment

andrewor14 Jun 3, 2015

Choose a reason for hiding this comment

JoshRosen commented Jun 18, 2015