Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-59587] Use weakValues on the inner caches in SandboxResolvingClassLoader.parentClassCache instead of on the outer cache #265

Merged
merged 4 commits into from
Oct 1, 2019

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Sep 23, 2019

See JENKINS-59587.

While investigating a user-reported performance issue, I noticed among other things that their Pipelines were spending a ton of time blocked waiting to acquire the lock for PluginManager$UberClassLoader while loading classes in SandboxResolvingClassLoader. This was a little surprising, because this user has a single shared library that is used by all of their Pipeline builds, so I would've thought that they would have a very high cache hit ratio, but that did not seem to be the case.

Looking into it, I noticed that in SandboxResolvingClassLoader, we were using weakValues on parentClassCache itself as of #252. This didn't make sense to me, because the values of parentClassCache are Cache<String, Class<?>>, and from what I can tell, these caches are only referenced internally by parentClassCache, which meant that the inner caches would always be removed when the garbage collector runs, making the cache not very useful when Jenkins is under significant memory pressure (which was definitely case for this user for other reasons).

This PR switches to using weakValues on parentClassCache's inner caches, rather than parentClassCache itself. This way, the classes referenced by the inner cache are still only weakly referenced, so that their class loader, which may itself be a key of parentClassCache, is still able to be garbage collected. Looking into it some more, I think #252 was doing this already before 36d3b3a, but that commit changed the semantics from Cache<ClassLoader, Cache<String, Optional<WeakReference<Class<?>>>>> to Cache<ClassLoader, WeakReference<Cache<String, Optional<Class<?>>>>>, and unfortunately we didn't notice that change in semantics during review.

I added a basic smoke test that fails if you revert the changes in src/main. GroovyMemoryLeakTest still passes, but I'm not sure what other testing would be useful to make sure this doesn't cause any regressions.

…entClassCache instead of on the outer caches
}

return builder.build(parentLoader -> Caffeine.newBuilder()./* allow new plugins to be used, and clean up memory */expireAfterWrite(Duration.ofMinutes(15)).build());
return outerBuilder.build(parentLoader -> innerBuilder.build());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that reusing innerBuilder.build should be safe according to its Javadoc.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT

@dwnusbaum
Copy link
Member Author

One other thing that seemed a bit concerning in the thread dumps I saw was that some class loading threads held locks on a ConcurrentHashMap$Node while they were waiting for the lock for PluginManager$UberClassLoader, and other threads were then blocked waiting for that same ConcurrentHashMap$Node. Caffeine's Cache.get Javadoc states that "Some attempted update operations on this cache by other threads may be blocked while the computation is in progress, so the computation should be short and simple, and must not attempt to update any other mappings of this cache".

Looking into the code, I think BoundedLocalCache.computeIfAbsent first tries to do a lookup without locking, and the locking only occurs if the value needs to be computed. In our case, I don't think the inner cache can load more than value concurrently from PluginManager$UberClassLoader anyway, so I don't know if this actually matters. IIUC, the preferred approach for slow mapping functions is to use AsyncCache, but it seems like in our case that would just move some of the locking from ConcurrentHashMap$Node to PluginManager$UberClassLoader for cache misses while cache hits would be unaffected because of the no-lock happy path.

Example thread holding a ConcurrentHashMap$Node lock:

java.lang.Thread.State: BLOCKED (on object monitor)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
  - waiting to lock <0x00000003c21e35f0> (a hudson.PluginManager$UberClassLoader)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader.lambda$loadClass$0(SandboxResolvingClassLoader.java:51)
  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader$$Lambda$269/265156624.get(Unknown Source)
  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader.lambda$load$2(SandboxResolvingClassLoader.java:85)
  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader$$Lambda$271/201356224.apply(Unknown Source)
  at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$14(BoundedLocalCache.java:2337)
  at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$270/1614184557.apply(Unknown Source)
  at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892)
  - locked <0x000000078a3ed640> (a java.util.concurrent.ConcurrentHashMap$Node)
  at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2335)
  at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2318)
  at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:111)
  at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:54)
  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader.load(SandboxResolvingClassLoader.java:79)

Example thread blocked waiting for that ConcurrentHashMap$Node:

java.lang.Thread.State: BLOCKED (on object monitor)
  at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1868)
  - waiting to lock <0x000000078a3ed640> (a java.util.concurrent.ConcurrentHashMap$Node)
  at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2335)
  at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2318)
  at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:111)
  at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:54)
  at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader.load(SandboxResolvingClassLoader.java:79)

@dwnusbaum dwnusbaum changed the title Use weakValues on the inner caches in SandboxResolvingClassLoader.parentClassCache instead of on the outer cache [JENKINS-59587] Use weakValues on the inner caches in SandboxResolvingClassLoader.parentClassCache instead of on the outer cache Sep 30, 2019
@dwnusbaum
Copy link
Member Author

I enabled statistics collection so that we can easily evaluate cache performance through the script console with a snippet like this:

import com.github.benmanes.caffeine.cache.stats.CacheStats;
import org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoader;
ClassLoader loader = Jenkins.getInstance().getPluginManager().uberClassLoader;
CacheStats stats = SandboxResolvingClassLoader.parentClassCache.get(loader).stats();
println(stats.hitRate());

@jglick
Copy link
Member

jglick commented Sep 30, 2019

java.lang.AssertionError: 

Expected: <2L>
     but: was <795L>
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxResolvingClassLoaderTest.classCacheDoesNotHoldClassValuesTooWeakly(SandboxResolvingClassLoaderTest.java:64)

@jglick
Copy link
Member

jglick commented Sep 30, 2019

evaluate cache performance through the script console

Can this be done via metrics, or support-core, or something?

@dwnusbaum
Copy link
Member Author

dwnusbaum commented Sep 30, 2019

Can this be done via metrics, or support-core, or something?

Sure, now that we are recording stats, we could hook them up to whatever extension points might use them, although I'd prefer to not add any dependencies to this plugin if possible.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CI passes!

@dwnusbaum
Copy link
Member Author

I explored the things mentioned in #265 (comment) in #271. In summary, unless JENKINS-23784 was addressed, using an asynchronous cache makes no difference, so I think it makes sense to keep using the manual cache for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants