Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked the way allocated memory is tracked #1239

Closed

Conversation

joerg1985
Copy link
Contributor

Replaced the WeakHashMap with a doubly linked list of WeakReferences to
improve performance by reducing the time spent in synchronized blocks.

Benchmark Branch Java version Mode Cnt Score Error Units
MemoryBenchmark.allocateWithMemory master 1.8 thrpt 25 1420406.020 ± 46239.151 ops/s
MemoryBenchmark.allocateWithMemory linked-reference 1.8 thrpt 25 2020881.615 ± 33498.284 ops/s
MemoryBenchmark.allocateWithMemory master 11 thrpt 25 1526824.044 ± 106555.120 ops/s
MemoryBenchmark.allocateWithMemory linked-reference 11 thrpt 25 2615836.523 ± 171918.750 ops/s
MemoryBenchmark.allocateWithMemory master 14 thrpt 25 1877610.032 ± 108837.116 ops/s
MemoryBenchmark.allocateWithMemory linked-reference 14 thrpt 25 2289198.917 ± 175390.679 ops/s

Replaced the WeakHashMap with a doubly linked list of WeakReferences to
improve performance by reducing the time spent in synchronized blocks.
Copy link
Contributor

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This performance improvement looks great! I've left a few comments/suggestions inline for your consideration.

}
}

private static LinkedReference HEAD; // the head of the doubly linked list used for instance tracking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd generally prefer to see the variable definition before it's used in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done easily, LinkedReference would be not defined at that point. I think this is a chicken-and-egg situation, how to solve best?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of preference/style. I'm not sure there's a right answer.

In this case, LinkedReference is a class, and there's not generally any need to define a class first (I suppose an import statement would normally do that) or in any specific order, compared to a field that's accessed which does have an impact for later fields. When reading code, if I see a class definition, I know "that class is defined somewhere, I can eventually find it either in this file or another one" whereas when I see a variable referenced, I generally scroll up to see where it's defined.

Not a big deal, just expressing my preference! :)

src/com/sun/jna/Memory.java Show resolved Hide resolved
src/com/sun/jna/Memory.java Show resolved Hide resolved
src/com/sun/jna/Memory.java Show resolved Hide resolved
@matthiasblaesing
Copy link
Member

The suggested changes should be measured first.

The suggested finer grained locking adds additional complexity to the code and nested locks could lead to dead-locks, if not all callers follow the same lock order.

I would also not care to much about the performance of disposeAll. That method is a last effort and if that is in a lock-free hot-path there are bigger problems in the application, than performance. The nested synchronized block should not matter, as the objects monitor is already taken by the calling thread.

@dbwiddis
Copy link
Contributor

dbwiddis commented Jul 29, 2020

The suggested finer grained locking adds additional complexity to the code and nested locks could lead to dead-locks, if not all callers follow the same lock order.

The nesting only goes one way, so I don't see how a deadlock is possible. It will be interesting to see if this helps or hurts (or has no impact) on the performance.

I would also not care to much about the performance of disposeAll. That method is a last effort and if that is in a lock-free hot-path there are bigger problems in the application, than performance.

I wasn't suggesting to remove the locks, just the re-linking only to destroy the links. But since no new objects are involved even that process would be a tiny impact, so I agree the current implementation is sufficient.

@dbwiddis
Copy link
Contributor

Well, I did some benchmarking of my suggestion tonight and got results that, in retrospect, are probably exactly what one should expect. Of relevance, this is an 8 core (16 hyperthreaded) machine.

  • With only 1 thread, obviously, the additional lock doesn't help, and hurts slightly.
  • With a small number of threads (2-3) the added benefit is visible in a benchmark. However, that doesn't translate to better system performance, as that's not the limiting case we're trying to fix.
  • Around 4-7 threads there's very little difference.
  • At 8 threads (fully utilized CPUs, where the allocation tracking becomes more limiting), the benefit disappears and maintaining multiple locks once again hurts performance.
  • Results for threads above 8 were very close to the results for 8 threads.

image

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 1, 2020

Played around with a slightly different variation. Instead of a single linked list, created an array of linked lists, using an atomic integer to increment for each new item added, to cycle through all the lists. Each list has its own lock. This also showed improved performance, maxing out about 3-4 lists in the array. This somewhat matches the (unmeasured, but observed on a profiler) improvement from the experiments I did using the hashmap, creating multiple hashmaps to reduce lock contention.

These are all using 8 threads, the X axis is number of linkedlists/locks.
image

@matthiasblaesing
Copy link
Member

Thank you for the measurements.

For the first measurement (one vs. two locks): We should stay with the currently proposed version from @joerg1985.

The second measurement (array of locked lists): I assume this is what you mean:

  • when a new memory object is created, a static AtomicInteger is incrementAndGet()ed
  • the fetched values is taken module the number of lists to use
  • the calculated value is used as a list index and the list is fetched and stored with the Memory object
  • the tracking of allocated memory is done in the local list, only disposeAll needs to lock all lists

This comes at a price though: The Memory object literaly doubles in size, from a single long to a long and a reference.

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 1, 2020

For the first measurement (one vs. two locks): We should stay with the currently proposed version from @joerg1985.

I agree. I'm ok to merge it as is, although after playing with these classes for a bit I'm still unhappy with the style with regards to ordering of things. But that's just me. :)

The second measurement (array of locked lists): I assume this is what you mean ...

All correct except I stored the list number in the WeakReference object as an int, alongside the prev and next pointers, all of which are ultimately associated with the Memory object.

I'm not suggesting that we merge the array version, though. This is mostly observational regarding the impact of locking that we really only need to enable diposeAll(). My preferred solution is to provide a configurable option that completely disables the memory tracking functionality.

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 1, 2020

For fun, I tried to "improve" the performance by getting rid of the AtomicInteger increment, using some internal value in the object to choose which map to use. Neither the peer nor hashcode helps. In fact they make it worse. Results for 8 threads, arraysize 4:

Benchmark                                   Mode  Cnt        Score        Error  Units
BenchmarkLoop.linkedListLockArray          thrpt   25  1655971.711 ±  92967.766  ops/s
BenchmarkLoop.linkedListLockArrayHashcode  thrpt   25  1411952.577 ± 114018.178  ops/s
BenchmarkLoop.linkedListOneLock            thrpt   25  1514096.333 ±  35480.170  ops/s

So I think there is something to the round-robin nature of the "add to head" locks that's the limiting performance factor here, up to a certain point (4?). Anyway, don't think we should hold up this PR, it's just an interesting experiment that may inform future attempts to tweak this.

@matthiasblaesing
Copy link
Member

@dbwiddis I think I know why using peer or hashCode does not help. The memory is allocated aligned and in my test the alignment was at least 32bit, running this through the modulo operation will always yield the same value. So the multi list case degrades to the single list case with additional overhead.

@matthiasblaesing
Copy link
Member

Ok - if noone objects, I would merge in a day or so. Indeed we can bikeshed about the right place for the definition of the inner class, but in the end, all options (extract the class to an external file, move it to the end, move it before the usage) have their benefits and drawbacks. @joerg1985 do you have changes in flight for this or are you ok to merge as-is?

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 2, 2020

Please hold off on merging for a bit, I think I might have identified a memory leak.

Less importantly, a note on the earlier modulo conversation: I did bitshift the peer value by 8 because of alignment expectations, and the hashcode calculation involves enough internal math (involving primes) that it distributes well across the values, e.g., hashcode % 4 yields the sequence 0, 3, 2, 0, 2, 1, 3, 1, 0, 1, 3, 2, 2, 0.

@joerg1985
Copy link
Contributor Author

@matthiasblaesing I am done with this, except that a memory leak is confirmed by @dbwiddis

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 2, 2020

a memory leak is confirmed

No, I haven't confirmed it. I just have suspicion that I'm trying to confirm with data. I'm seeing iterations slow down a lot as the length of the benchmark increases, and I'm not sure the performance testing has properly accounted for the full impact of GC when max heap is a limiting factor.

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 2, 2020

Well, I'm not sure there's a memory leak, or just an error in how I'm running the profiler. Both the existing method and new method eventually cause a java.lang.OutOfMemoryError if allowed to run long enough. That "long enough" point arrives faster for the linked list variant due to the additional tracking the linked list requires (prev and next pointers).

However, I am concerned that the benchmarking I have previously done was not in a memory-limited (GC) scenario and thus was heavily biased toward insertion (head) performance. By setting smaller heap limits (@Fork(value = 1, jvmArgs = { "-Xms500M", "-Xmx500M" })) I see iterations slow considerably when the limit is reached, and in that scenario, the added delay due to GC ends up causing worse performance on average.

@joerg1985 can you consider redoing your own benchmarks with longer iterations and/or more limiting heap size?

@joerg1985
Copy link
Contributor Author

@joerg1985 can you consider redoing your own benchmarks with longer iterations and/or more limiting heap size?

I will rerun the benchmarks on Fedora 32 / OpenJDK 1.8.0_262 with the following settings and post the results here.
@Benchmark
@Fork(value = 1, jvmArgs = { "-Xms500M", "-Xmx500M" })
@Timeout(time = 32, timeUnit = TimeUnit.MINUTES)
@Measurement(time = 30, timeUnit = TimeUnit.MINUTES)

Dispose or unlink entries as early as possible to avoid OOM exceptions 
under high load with low memory.
@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 4, 2020

I've been playing around a bit more with the benchmarks. I think running close to max heap ends up creating too much garbage, but that's not realistic in a real scenario when many more objects than the Memory will be created and disposed of. I tried to create a "middle ground" scenario that would still measure the impact of "a little more GC" while staying away from heap limits. I set the max to 1G but manually forced a GC at 500M, basically this:

    @Benchmark
    public void weakHashMap(Blackhole blackhole) {
        Memory foo = new Memory(1);
        blackhole.consume(foo);
        if (counter.incrementAndGet() % 5000 == 0
                && (Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory()) > 500_000_000) {
            System.gc();
        }
    }

It's not perfect but I think it's "fair". The new LinkedList still performs well. (JDK14, macOS)

Benchmark                         Mode  Cnt      Score     Error  Units
BenchmarkLoop.linkedListOneLock  thrpt   25  45210.650 ± 302.708  ops/s
BenchmarkLoop.weakHashMap        thrpt   25  36944.042 ± 216.455  ops/s

I see another push, though, and will test that as well.

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 4, 2020

What is this magic you are working?

Benchmark                          Mode  Cnt       Score        Error  Units
BenchmarkLoop.linkedListOneLock   thrpt   25   45388.427 ±    285.035  ops/s
BenchmarkLoop.linkedListRefQueue  thrpt   25  459204.674 ± 266253.633  ops/s
BenchmarkLoop.weakHashMap         thrpt   25   37046.341 ±    159.251  ops/s

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 4, 2020

Here's the non-memory-limited case (shorter runs). Still better than current implementation.

Benchmark                          Mode  Cnt        Score       Error  Units
BenchmarkLoop.linkedListOneLock   thrpt   25  1497384.617 ± 32869.763  ops/s
BenchmarkLoop.linkedListRefQueue  thrpt   25  1319160.163 ± 75901.557  ops/s
BenchmarkLoop.weakHashMap         thrpt   25  1267802.499 ± 78902.609  ops/s

@joerg1985
Copy link
Contributor Author

@dbwiddis I was digging around why the WeakHashMap was performing better with limited memory. I can't even tell exactly why this is faster...

Well i think there is one thing to check, if alot of instances are released and no new are created there might references in the queue waiting to be processed. I need to check this and hopefully this will not destroy all the benefits.

@joerg1985
Copy link
Contributor Author

Okay, i think the described effect of a unprocessed ReferenceQueue is stealing memory doesn't exist.
Found nothing online (except for PhantomReferences) and could not produce this effect on JDK 8 or 11.

This is how i unsuccessfully tried to get into this state:

public static void main(String[] args) throws InterruptedException {
List<WeakReference> refs = new ArrayList<>();
ReferenceQueue queue = new ReferenceQueue<>();

    for (long i = 0; i < 10000000; i++) {
        // keep a strong reference to the WeakReference
        refs.add(new WeakReference<>(new String("add some weight"), queue));

        if (i % 1000 == 0) {
            // release the strong reference, like the unlink call does in finalizer
            refs.clear();
        }
    }

    long found = 0;
    List<String> waste = new ArrayList<>();

    while(true) {
        if (queue.poll() != null) {
            if (++found % 1000 == 0) {
                System.out.println("FOUND " + found);
            }
        } else if (waste.size() % 100000 == 0) {
            waste.clear();
        } else {
            waste.add(found + " force some GC " + found);
        }
    }
}

@dbwiddis
Copy link
Contributor

dbwiddis commented Aug 6, 2020

Thanks for this analysis! I'm not quite sure what it all means. Can you summarize what you think is the best path for this PR and why?

@joerg1985
Copy link
Contributor Author

I think this PR is working fine and no side effects are obvious.

The numbers of the benchmarks are looking good and should now be confirmed by some real world numbers.

It would be great if the author of #1211 (@Karasiq) could build and benchmark the master and the branch of this PR to see if it helps there.

I don't have any concurrent running code using JNA to get real world numbers.

Copy link
Member

@matthiasblaesing matthiasblaesing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two inline comments. My numbers on OpenJDK 11:

Benchmark Mode Threads Samples Score Score Error (99,9%) Unit
5.6.0 thrpt 4 25 1418278,296963 79754,059403 ops/s
5.7.0-SNAPSHOT thrpt 4 25 1639614,860428 301953,00327 ops/s

The benchmark is simple:

public class MemoryBenchmark {
    @Benchmark()
    public void allocateWithMemory(Blackhole blackhole) {
        blackhole.consume(new Memory(1));
    }
}

Run with:

java -jar target/benchmarks.jar -t max -gc true -rff result.csv

The big error margin for the updated code surprises me though.

src/com/sun/jna/Memory.java Outdated Show resolved Hide resolved
*/
static LinkedReference track(Memory instance) {
// use a different lock here to allow the finialzier to unlink elements too
synchronized (QUEUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this synchronization needed? While the class lacks documentation regarding threading, the OpenJDK implementation protects the reference queue. There are fixed bugs regarding multithreaded enqueue, so I assume, that ReferenceQueue should be save. From my POV it does not matter, which thread unlinks, as the unlink will be protected again by the HEAD lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this lock is not needed to avoid concurrency effects with the current OpenJDK implementations. Removing it had a negative effect on performance in my digging around why the WeakHashMap was performing better with low memory.

The references returned by ReferenceQueue.poll() will allways return 
null on get(). Just unlink, without trying to get the referent.
@matthiasblaesing
Copy link
Member

I remeasured on a desktop class machine and it showed about a 25% improvement. Even in the worst-case an improvement was visible. I also did a totally unscientific experiment with a smaller heap (512MB) and noticed, that the OOME happend later with the new code, that with the weak hashmap.

This change was squashed and merged as:
59b052a

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants