Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronized map poor performance #1211

Closed
Karasiq opened this issue Jun 16, 2020 · 20 comments
Closed

Synchronized map poor performance #1211

Karasiq opened this issue Jun 16, 2020 · 20 comments

Comments

@Karasiq
Copy link

Karasiq commented Jun 16, 2020

We have issues with the performance because of this: https://github.com/java-native-access/jna/blob/master/src/com/sun/jna/Memory.java#L57-L58
Lock waiting can get up to 99% of the time under a heavy load. SynchronizedMap works very slow in a multi-threaded environment. I suggest replacing it with any parallel map implementation.
Screenshot 2020-06-11 at 12 16 18

@matthiasblaesing
Copy link
Member

This is a valid request. I see two options:

a) drop disposeAll
b) find a better map implementation. Requirements: self-contained (JNA has no dependency and it should stay that way), compatible license (must be ALv2 + LGPL compatible or better donated to the project), faster :-)

@dbwiddis
Copy link
Contributor

dbwiddis commented Jun 16, 2020

There is a backport project jsr166e of the java.util.concurrent classes that includes Java 8's ConcurrentHashMap implementation to Java 6 (5, even). It's CC0/Public-domain, so could be dropped into JNA source and used as-is, I believe.

@dbwiddis
Copy link
Contributor

Another possible idea: maintain multiple maps. The number of maps can be determined from Runtime.getRuntime().availableProcessors() (an estimate of simultaneous threads wanting access) with choice of which one to use based on peer's hash value. This distributes the locking scaled to potential threads but preserves the weak references.

@matthiasblaesing
Copy link
Member

ConcurrentHashMap is already in the JDK (since 1.5). However, the naive exchange to:

   private static final Map<Long, Reference<Memory2>> allocatedMemory
        = new ConcurrentHashMap<Long, Reference<Memory2>>();

does not help. The WeakKey part is important here. In my naive test performance regressed around 50%.

Hot methods could use Native#malloc directly and wrap that into a pointer. That way the memory needs to be manually managed, but there is also no contented path.

@dbwiddis
Copy link
Contributor

The class name is the same but JDK8 made significant map performance improvements using trees instead of linked lists.

But as you say, we'd have to modify the code to handle weak references. The triple-combination of Weak references, concurrency, and Java 6 don't leave many options.

@matthiasblaesing
Copy link
Member

My measurement happend on JDK 11, so the worst case is not even hit.

@neilcsmith-net
Copy link
Contributor

neilcsmith-net commented Jun 19, 2020

@matthiasblaesing what was your replacement exactly? Why do you need a weak key when you already have a weak reference as value? You could have a WeakReference (or without disposeAll maybe PhantomReference) subclass, directly held by Memory, and keyed in a concurrent map by Pointer or Long. You then use the reference queue to remove the reference from the map, and ideally (longer term) replace finalization the same way. And you still have explicit memory management as well (which really is the better way to handle).

It's the change we made in gst1-java-core in https://github.com/gstreamer-java/gst1-java-core/blob/master/src/org/freedesktop/gstreamer/glib/NativeObject.java although the requirements may be somewhat more complex there due to native ref counting - pointer is in a handle class that is held by both weak reference and main object.

EDIT : incidentally, it's loosely based on Cleaner approach, obviously, but Java 8 and I'm not sure Cleaner would do everything required in Memory anyway - https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ref/Cleaner.html

@Karasiq
Copy link
Author

Karasiq commented Jun 19, 2020

What if you simply drop the disposeAll (and therefore - the map)?
Is it used anywhere except the JNA finalizer? Looks like the Memory objects finalizer can do the job by itself.

@twall
Copy link
Contributor

twall commented Jun 19, 2020 via email

@matthiasblaesing
Copy link
Member

So I measured and here are the results:

This adds four new memory implementations: matthiasblaesing@40a3cf7

The benchmark is here:

https://github.com/matthiasblaesing/jnamemorybenchmark

The repository contains the JNA binary, if you want to modify, replace jna-5.6.0-SNAPSHOT.jar. The benchmark is build with:

mvn package

and is run with:

java -jar target/benchmarks.jar -t max -gc true -rff result.csv

The implementations:

  • allocateWithMemory uses plain Memory (usage tracking via WeakHashMap wrapped into Collections.synchronizedMap) as it is today in JNA
  • allocateWithMemory2uses Memory 2 (usage tracking viaConcurrentHashMap<Long,Reference>`)
  • allocateWithMemory3 uses Memory3 (usage tracking via Threadlocals holding ConcurrentHashMap<Long,Reference<Memory3>>
  • allocateWithMemory4 uses Memory4 usage tracking via 4 WeakHashMap wrapped into Collections.synchronizedMap
  • allocateWithMemory5 uses Memory5 and drops usage tracking

The results:

Benchmark Mode Threads Samples Score Score Error (99,9%) Unit
allocateWithMemory thrpt 4 25 1231814,614716 109362,973487 ops/s
allocateWithMemory2 thrpt 4 25 926192,069017 94586,318779 ops/s
allocateWithMemory3 thrpt 4 25 1184987,524322 50791,888379 ops/s
allocateWithMemory4 thrpt 4 25 1187076,022115 91644,314653 ops/s
allocateWithMemory5 thrpt 4 25 2262093,522131 56417,696075 ops/s

screen

Numbers are also in the Benchmark Repository:

@neilcsmith-net
Copy link
Contributor

Thanks for sharing code and benchmark. Interesting results. Wonder whether it shows quite a different scenario to the issue reporter's, or whether the map lock really isn't the bottleneck it appears.

I wonder what Memory2 using HashMap<Long, Reference> wrapped in Collections.synchronizedMap is like - does the map payload or map implementation make most difference?

Does the benchmark reflect real-world memory pressures enough? Just wondering if GC is clearing the values often enough to show a real-world idea of contention?

Will try and find some time to play with this over next few weeks - I'm more interested if using references as values in the map as to the benefits (or not) in removing finalization completely.

@matthiasblaesing
Copy link
Member

Thanks for sharing code and benchmark. Interesting results. Wonder whether it shows quite a different scenario to the issue reporter's, or whether the map lock really isn't the bottleneck it appears.

Have a look at allocateWithMemory5 that is about twice the score than the current implementation. It could be, that now another limit is hit (GC?).

Does the benchmark reflect real-world memory pressures enough? Just wondering if GC is clearing the values often enough to show a real-world idea of contention.

The allocated memory is 1 Byte so GC pressure will start before native memory becomes an issue. The objects become eligible for GC directly after they are allocated and consumed by the blackhole. The idea is, that we are interested in the memory allocation hotpath. I admit, that the dispose method might be another bottleneck.

Will try and find some time to play with this over next few weeks - I'm more interested if using references as values in the map as to the benefits (or not) in removing finalization completely.

From my POV removing finalization can't be done without changing the API. This was my first try (after the measurements I would implement it different today):

https://github.com/matthiasblaesing/jna/commits/remove_finalizer

It changes the meaning of the dispose method. And I think it would be better to drop dispose and move to a closeable implementation. The big question: Do we need to enable users to add custom deallocators?

@neilcsmith-net
Copy link
Contributor

Have a look at allocateWithMemory5 that is about twice the score than the current implementation.

Well, that has no map at all. Not suggesting that usage tracking isn't going to add overhead. I meant a proper comparison of usage tracking and map locking with the same data <Long, Reference>. The benchmark might suggest that there isn't much benefit to ConcurrentHashMap in this case - not really enough contention, always modifying rather than just retrieving, etc.? But it would be useful to prove that a synchronized HashMap<Long, Reference> also performs better?

The allocated memory is 1 Byte so GC pressure will start before native memory becomes an issue. ... I admit, that the dispose method might be another bottleneck.

That's what I meant - isn't there very little GC pressure full stop, so the ratio of calls to dispose (and potentially size of the maps) may be quite different to typical usage?

From my POV removing finalization can't be done without changing the API. ... It changes the meaning of the dispose method. And I think it would be better to drop dispose and move to a closeable implementation. The big question: Do we need to enable users to add custom deallocators?

How does it change the meaning of dispose()? I can see it slightly changes API if people have overridden finalize() rather than dispose(), although if they've done that they should probably expect it to break! 😄

When doing this myself I implemented AutoCloseable by delegating to dispose() - that seems a more useful terminology for deallocating memory to me. Can they not implement custom deallocating by overriding dispose? Anyway, that's a possibly wider concern than this issue and better discussed on the mailing list? Interested in all this because it relates to work I'm doing at the moment.

@joerg1985
Copy link
Contributor

Well, that has no map at all. Not suggesting that usage tracking isn't going to add overhead. I meant a proper comparison of usage tracking and map locking with the same data <Long, Reference>. The benchmark might suggest that there isn't much benefit to ConcurrentHashMap in this case - not really enough contention, always modifying rather than just retrieving, etc.? But it would be useful to prove that a synchronized HashMap<Long, Reference> also performs better?

I think there is no need to maintain a map, a list could be used here. The allocatedMemory.remove(this) call is only needed if disposeAll is called and could be moved to the loop in disposeAll. If the dispose method is call from the finializer the Memory instance is not reachable through the allocatedMemory map, the only thing happening here is the dead entries of the map are removed. A call to allocatedMemory.size() will remove these entries with less overhead.

A proof of concept implemetation using a linkedlist can be found here, i need to double check everything is still working before running the benchmarks: https://github.com/joerg1985/jna/blob/memory-map/src/com/sun/jna/Memory6.java

When doing this myself I implemented AutoCloseable by delegating to dispose() - that seems a more useful terminology for deallocating memory to me. Can they not implement custom deallocating by overriding dispose? Anyway, that's a possibly wider concern than this issue and better discussed on the mailing list? Interested in all this because it relates to work I'm doing at the moment.

Explicit calling dispose method will not stop the finializer from calling dispose again, this will again call allocatedMemory.remove(this) and will increase the number of calls to the synchronized map.
This could be avoided by adding if (peer == 0) return; to the dispose method, see Memory6.

@dbwiddis
Copy link
Contributor

The allocatedMemory.remove(this) call is only needed if disposeAll is called

Which is why one of the available options is removing disposeAll. Or at least permitting users to somehow disable tracking if they will never use disposeAll.

@neilcsmith-net
Copy link
Contributor

Explicit calling dispose method will not stop the finializer from calling dispose again,

It does if you get rid of finalize(), which was the context of that comment!

I think there is no need to maintain a map, a list could be used here.

Again, until you want to get rid of finalize(). Which I would prioritise over arguing over list vs map.

@joerg1985
Copy link
Contributor

My intention was not to disrupt the "finalizer or not in future versions" discussion and i messed up the context of the comments, sorry for this.

The only thing i would like to contribute is the idea of a faster implementation of the current behaviour.
As mentioned above i gave it a try in Memory6 and now i have some numbers to share, these numbers are based on revision 72dffcb.

Benchmark Mode Cnt Score Error Units
MemoryBenchmark.allocateWithMemory thrpt 25 1432560.188 ± 105192.179 ops/s
MemoryBenchmark.allocateWithMemory2 thrpt 25 1318910.324 ± 153444.187 ops/s
MemoryBenchmark.allocateWithMemory3 thrpt 25 1412166.053 ± 157307.070 ops/s
MemoryBenchmark.allocateWithMemory4 thrpt 25 1441608.320 ± 111822.872 ops/s
MemoryBenchmark.allocateWithMemory5 thrpt 25 3244573.067 ± 154981.355 ops/s
MemoryBenchmark.allocateWithMemory6 thrpt 25 2702059.448 ± 303447.446 ops/s

I hope i did not break anything.

@matthiasblaesing
Copy link
Member

My numbers are not as good as yours, but still Memory6 looks like a good improvement, that might be worth merging. Would you be willing to create a PR with the suggested patch? I looked at a diff between Memory and Memory6 and notice many unrelated changes, that made it difficult to see the real changes and a cleaner patch would make review easier.

@joerg1985
Copy link
Contributor

My numbers are not as good as yours, but still Memory6 looks like a good improvement, that might be worth merging. Would you be willing to create a PR with the suggested patch? I looked at a diff between Memory and Memory6 and notice many unrelated changes, that made it difficult to see the real changes and a cleaner patch would make review easier.

I will create a new branch without these unrelated changes, try to find better names for the internal class/methods and create some unittests.

The different numbers might be related to the Java version, my default runtime is 11. I did not consider this running the benchmark, before opening the PR i will benchmark Java 8, 11 and 14.

@matthiasblaesing
Copy link
Member

Closing, as improved implementation was merged with 59b052a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants