-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworked the way allocated memory is tracked #1239
Conversation
Replaced the WeakHashMap with a doubly linked list of WeakReferences to improve performance by reducing the time spent in synchronized blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This performance improvement looks great! I've left a few comments/suggestions inline for your consideration.
src/com/sun/jna/Memory.java
Outdated
} | ||
} | ||
|
||
private static LinkedReference HEAD; // the head of the doubly linked list used for instance tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd generally prefer to see the variable definition before it's used in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done easily, LinkedReference would be not defined at that point. I think this is a chicken-and-egg situation, how to solve best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matter of preference/style. I'm not sure there's a right answer.
In this case, LinkedReference
is a class, and there's not generally any need to define a class first (I suppose an import
statement would normally do that) or in any specific order, compared to a field that's accessed which does have an impact for later fields. When reading code, if I see a class definition, I know "that class is defined somewhere, I can eventually find it either in this file or another one" whereas when I see a variable referenced, I generally scroll up to see where it's defined.
Not a big deal, just expressing my preference! :)
The suggested changes should be measured first. The suggested finer grained locking adds additional complexity to the code and nested locks could lead to dead-locks, if not all callers follow the same lock order. I would also not care to much about the performance of |
The nesting only goes one way, so I don't see how a deadlock is possible. It will be interesting to see if this helps or hurts (or has no impact) on the performance.
I wasn't suggesting to remove the locks, just the re-linking only to destroy the links. But since no new objects are involved even that process would be a tiny impact, so I agree the current implementation is sufficient. |
Well, I did some benchmarking of my suggestion tonight and got results that, in retrospect, are probably exactly what one should expect. Of relevance, this is an 8 core (16 hyperthreaded) machine.
|
Thank you for the measurements. For the first measurement (one vs. two locks): We should stay with the currently proposed version from @joerg1985. The second measurement (array of locked lists): I assume this is what you mean:
This comes at a price though: The |
I agree. I'm ok to merge it as is, although after playing with these classes for a bit I'm still unhappy with the style with regards to ordering of things. But that's just me. :)
All correct except I stored the list number in the I'm not suggesting that we merge the array version, though. This is mostly observational regarding the impact of locking that we really only need to enable |
For fun, I tried to "improve" the performance by getting rid of the AtomicInteger increment, using some internal value in the object to choose which map to use. Neither the
So I think there is something to the round-robin nature of the "add to head" locks that's the limiting performance factor here, up to a certain point (4?). Anyway, don't think we should hold up this PR, it's just an interesting experiment that may inform future attempts to tweak this. |
@dbwiddis I think I know why using |
Ok - if noone objects, I would merge in a day or so. Indeed we can bikeshed about the right place for the definition of the inner class, but in the end, all options (extract the class to an external file, move it to the end, move it before the usage) have their benefits and drawbacks. @joerg1985 do you have changes in flight for this or are you ok to merge as-is? |
Please hold off on merging for a bit, I think I might have identified a memory leak. Less importantly, a note on the earlier modulo conversation: I did bitshift the |
@matthiasblaesing I am done with this, except that a memory leak is confirmed by @dbwiddis |
No, I haven't confirmed it. I just have suspicion that I'm trying to confirm with data. I'm seeing iterations slow down a lot as the length of the benchmark increases, and I'm not sure the performance testing has properly accounted for the full impact of GC when max heap is a limiting factor. |
Well, I'm not sure there's a memory leak, or just an error in how I'm running the profiler. Both the existing method and new method eventually cause a However, I am concerned that the benchmarking I have previously done was not in a memory-limited (GC) scenario and thus was heavily biased toward insertion (head) performance. By setting smaller heap limits ( @joerg1985 can you consider redoing your own benchmarks with longer iterations and/or more limiting heap size? |
I will rerun the benchmarks on Fedora 32 / OpenJDK 1.8.0_262 with the following settings and post the results here. |
Dispose or unlink entries as early as possible to avoid OOM exceptions under high load with low memory.
I've been playing around a bit more with the benchmarks. I think running close to max heap ends up creating too much garbage, but that's not realistic in a real scenario when many more objects than the
It's not perfect but I think it's "fair". The new LinkedList still performs well. (JDK14, macOS)
I see another push, though, and will test that as well. |
What is this magic you are working?
|
Here's the non-memory-limited case (shorter runs). Still better than current implementation.
|
@dbwiddis I was digging around why the WeakHashMap was performing better with limited memory. I can't even tell exactly why this is faster... Well i think there is one thing to check, if alot of instances are released and no new are created there might references in the queue waiting to be processed. I need to check this and hopefully this will not destroy all the benefits. |
Okay, i think the described effect of a unprocessed ReferenceQueue is stealing memory doesn't exist. This is how i unsuccessfully tried to get into this state: public static void main(String[] args) throws InterruptedException {
|
Thanks for this analysis! I'm not quite sure what it all means. Can you summarize what you think is the best path for this PR and why? |
I think this PR is working fine and no side effects are obvious. The numbers of the benchmarks are looking good and should now be confirmed by some real world numbers. It would be great if the author of #1211 (@Karasiq) could build and benchmark the master and the branch of this PR to see if it helps there. I don't have any concurrent running code using JNA to get real world numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two inline comments. My numbers on OpenJDK 11:
Benchmark | Mode | Threads | Samples | Score | Score Error (99,9%) | Unit |
---|---|---|---|---|---|---|
5.6.0 | thrpt | 4 | 25 | 1418278,296963 | 79754,059403 | ops/s |
5.7.0-SNAPSHOT | thrpt | 4 | 25 | 1639614,860428 | 301953,00327 | ops/s |
The benchmark is simple:
public class MemoryBenchmark {
@Benchmark()
public void allocateWithMemory(Blackhole blackhole) {
blackhole.consume(new Memory(1));
}
}
Run with:
java -jar target/benchmarks.jar -t max -gc true -rff result.csv
The big error margin for the updated code surprises me though.
*/ | ||
static LinkedReference track(Memory instance) { | ||
// use a different lock here to allow the finialzier to unlink elements too | ||
synchronized (QUEUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this synchronization needed? While the class lacks documentation regarding threading, the OpenJDK implementation protects the reference queue. There are fixed bugs regarding multithreaded enqueue, so I assume, that ReferenceQueue should be save. From my POV it does not matter, which thread unlinks, as the unlink will be protected again by the HEAD lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this lock is not needed to avoid concurrency effects with the current OpenJDK implementations. Removing it had a negative effect on performance in my digging around why the WeakHashMap was performing better with low memory.
The references returned by ReferenceQueue.poll() will allways return null on get(). Just unlink, without trying to get the referent.
I remeasured on a desktop class machine and it showed about a 25% improvement. Even in the worst-case an improvement was visible. I also did a totally unscientific experiment with a smaller heap (512MB) and noticed, that the OOME happend later with the new code, that with the weak hashmap. This change was squashed and merged as: Thank you. |
Replaced the WeakHashMap with a doubly linked list of WeakReferences to
improve performance by reducing the time spent in synchronized blocks.