Synchronized map poor performance #1211

Karasiq · 2020-06-16T10:27:53Z

We have issues with the performance because of this: https://github.com/java-native-access/jna/blob/master/src/com/sun/jna/Memory.java#L57-L58
Lock waiting can get up to 99% of the time under a heavy load. SynchronizedMap works very slow in a multi-threaded environment. I suggest replacing it with any parallel map implementation.

matthiasblaesing · 2020-06-16T20:12:30Z

This is a valid request. I see two options:

a) drop disposeAll
b) find a better map implementation. Requirements: self-contained (JNA has no dependency and it should stay that way), compatible license (must be ALv2 + LGPL compatible or better donated to the project), faster :-)

dbwiddis · 2020-06-16T23:51:05Z

There is a backport project jsr166e of the java.util.concurrent classes that includes Java 8's ConcurrentHashMap implementation to Java 6 (5, even). It's CC0/Public-domain, so could be dropped into JNA source and used as-is, I believe.

dbwiddis · 2020-06-17T00:40:28Z

Another possible idea: maintain multiple maps. The number of maps can be determined from Runtime.getRuntime().availableProcessors() (an estimate of simultaneous threads wanting access) with choice of which one to use based on peer's hash value. This distributes the locking scaled to potential threads but preserves the weak references.

matthiasblaesing · 2020-06-17T19:48:55Z

ConcurrentHashMap is already in the JDK (since 1.5). However, the naive exchange to:

   private static final Map<Long, Reference<Memory2>> allocatedMemory
        = new ConcurrentHashMap<Long, Reference<Memory2>>();

does not help. The WeakKey part is important here. In my naive test performance regressed around 50%.

Hot methods could use Native#malloc directly and wrap that into a pointer. That way the memory needs to be manually managed, but there is also no contented path.

dbwiddis · 2020-06-17T22:05:44Z

The class name is the same but JDK8 made significant map performance improvements using trees instead of linked lists.

But as you say, we'd have to modify the code to handle weak references. The triple-combination of Weak references, concurrency, and Java 6 don't leave many options.

matthiasblaesing · 2020-06-18T15:18:11Z

My measurement happend on JDK 11, so the worst case is not even hit.

neilcsmith-net · 2020-06-19T10:43:12Z

@matthiasblaesing what was your replacement exactly? Why do you need a weak key when you already have a weak reference as value? You could have a WeakReference (or without disposeAll maybe PhantomReference) subclass, directly held by Memory, and keyed in a concurrent map by Pointer or Long. You then use the reference queue to remove the reference from the map, and ideally (longer term) replace finalization the same way. And you still have explicit memory management as well (which really is the better way to handle).

It's the change we made in gst1-java-core in https://github.com/gstreamer-java/gst1-java-core/blob/master/src/org/freedesktop/gstreamer/glib/NativeObject.java although the requirements may be somewhat more complex there due to native ref counting - pointer is in a handle class that is held by both weak reference and main object.

EDIT : incidentally, it's loosely based on Cleaner approach, obviously, but Java 8 and I'm not sure Cleaner would do everything required in Memory anyway - https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/ref/Cleaner.html

Karasiq · 2020-06-19T14:08:00Z

What if you simply drop the disposeAll (and therefore - the map)?
Is it used anywhere except the JNA finalizer? Looks like the Memory objects finalizer can do the job by itself.

twall · 2020-06-19T15:35:47Z

Le, Originally there were some application requirements to ensure all memory was explicitly disposed before exit (since the per-memory finalizer wasn't guaranteed to run). T.

…

On Fri, Jun 19, 2020 at 10:08 AM Le Karasique ***@***.***> wrote: What if you simply remove the disposeAll (and therefore - the map)? Is it used anywhere except the JNA finalizer? Looks like the Memory objects finalizer can do the job by itself. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1211 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFYZLIWX24KJMQPS4BKKE3RXNWNFANCNFSM4N7PK7IQ> .

matthiasblaesing · 2020-06-22T21:39:43Z

So I measured and here are the results:

This adds four new memory implementations: matthiasblaesing@40a3cf7

The benchmark is here:

https://github.com/matthiasblaesing/jnamemorybenchmark

The repository contains the JNA binary, if you want to modify, replace jna-5.6.0-SNAPSHOT.jar. The benchmark is build with:

mvn package

and is run with:

java -jar target/benchmarks.jar -t max -gc true -rff result.csv

The implementations:

allocateWithMemory uses plain Memory (usage tracking via WeakHashMap wrapped into Collections.synchronizedMap) as it is today in JNA
allocateWithMemory2uses Memory 2 (usage tracking viaConcurrentHashMap<Long,Reference>`)
allocateWithMemory3 uses Memory3 (usage tracking via Threadlocals holding ConcurrentHashMap<Long,Reference<Memory3>>
allocateWithMemory4 uses Memory4 usage tracking via 4 WeakHashMap wrapped into Collections.synchronizedMap
allocateWithMemory5 uses Memory5 and drops usage tracking

The results:

Benchmark	Mode	Threads	Samples	Score	Score Error (99,9%)	Unit
allocateWithMemory	thrpt	4	25	1231814,614716	109362,973487	ops/s
allocateWithMemory2	thrpt	4	25	926192,069017	94586,318779	ops/s
allocateWithMemory3	thrpt	4	25	1184987,524322	50791,888379	ops/s
allocateWithMemory4	thrpt	4	25	1187076,022115	91644,314653	ops/s
allocateWithMemory5	thrpt	4	25	2262093,522131	56417,696075	ops/s

Numbers are also in the Benchmark Repository:

neilcsmith-net · 2020-06-24T09:10:59Z

Thanks for sharing code and benchmark. Interesting results. Wonder whether it shows quite a different scenario to the issue reporter's, or whether the map lock really isn't the bottleneck it appears.

I wonder what Memory2 using HashMap<Long, Reference> wrapped in Collections.synchronizedMap is like - does the map payload or map implementation make most difference?

Does the benchmark reflect real-world memory pressures enough? Just wondering if GC is clearing the values often enough to show a real-world idea of contention?

Will try and find some time to play with this over next few weeks - I'm more interested if using references as values in the map as to the benefits (or not) in removing finalization completely.

matthiasblaesing · 2020-06-24T19:18:53Z

Thanks for sharing code and benchmark. Interesting results. Wonder whether it shows quite a different scenario to the issue reporter's, or whether the map lock really isn't the bottleneck it appears.

Have a look at allocateWithMemory5 that is about twice the score than the current implementation. It could be, that now another limit is hit (GC?).

Does the benchmark reflect real-world memory pressures enough? Just wondering if GC is clearing the values often enough to show a real-world idea of contention.

The allocated memory is 1 Byte so GC pressure will start before native memory becomes an issue. The objects become eligible for GC directly after they are allocated and consumed by the blackhole. The idea is, that we are interested in the memory allocation hotpath. I admit, that the dispose method might be another bottleneck.

Will try and find some time to play with this over next few weeks - I'm more interested if using references as values in the map as to the benefits (or not) in removing finalization completely.

From my POV removing finalization can't be done without changing the API. This was my first try (after the measurements I would implement it different today):

https://github.com/matthiasblaesing/jna/commits/remove_finalizer

It changes the meaning of the dispose method. And I think it would be better to drop dispose and move to a closeable implementation. The big question: Do we need to enable users to add custom deallocators?

neilcsmith-net · 2020-06-25T09:39:43Z

Have a look at allocateWithMemory5 that is about twice the score than the current implementation.

Well, that has no map at all. Not suggesting that usage tracking isn't going to add overhead. I meant a proper comparison of usage tracking and map locking with the same data <Long, Reference>. The benchmark might suggest that there isn't much benefit to ConcurrentHashMap in this case - not really enough contention, always modifying rather than just retrieving, etc.? But it would be useful to prove that a synchronized HashMap<Long, Reference> also performs better?

The allocated memory is 1 Byte so GC pressure will start before native memory becomes an issue. ... I admit, that the dispose method might be another bottleneck.

That's what I meant - isn't there very little GC pressure full stop, so the ratio of calls to dispose (and potentially size of the maps) may be quite different to typical usage?

From my POV removing finalization can't be done without changing the API. ... It changes the meaning of the dispose method. And I think it would be better to drop dispose and move to a closeable implementation. The big question: Do we need to enable users to add custom deallocators?

How does it change the meaning of dispose()? I can see it slightly changes API if people have overridden finalize() rather than dispose(), although if they've done that they should probably expect it to break! 😄

When doing this myself I implemented AutoCloseable by delegating to dispose() - that seems a more useful terminology for deallocating memory to me. Can they not implement custom deallocating by overriding dispose? Anyway, that's a possibly wider concern than this issue and better discussed on the mailing list? Interested in all this because it relates to work I'm doing at the moment.

joerg1985 · 2020-07-21T15:51:41Z

Well, that has no map at all. Not suggesting that usage tracking isn't going to add overhead. I meant a proper comparison of usage tracking and map locking with the same data <Long, Reference>. The benchmark might suggest that there isn't much benefit to ConcurrentHashMap in this case - not really enough contention, always modifying rather than just retrieving, etc.? But it would be useful to prove that a synchronized HashMap<Long, Reference> also performs better?

I think there is no need to maintain a map, a list could be used here. The allocatedMemory.remove(this) call is only needed if disposeAll is called and could be moved to the loop in disposeAll. If the dispose method is call from the finializer the Memory instance is not reachable through the allocatedMemory map, the only thing happening here is the dead entries of the map are removed. A call to allocatedMemory.size() will remove these entries with less overhead.

A proof of concept implemetation using a linkedlist can be found here, i need to double check everything is still working before running the benchmarks: https://github.com/joerg1985/jna/blob/memory-map/src/com/sun/jna/Memory6.java

When doing this myself I implemented AutoCloseable by delegating to dispose() - that seems a more useful terminology for deallocating memory to me. Can they not implement custom deallocating by overriding dispose? Anyway, that's a possibly wider concern than this issue and better discussed on the mailing list? Interested in all this because it relates to work I'm doing at the moment.

Explicit calling dispose method will not stop the finializer from calling dispose again, this will again call allocatedMemory.remove(this) and will increase the number of calls to the synchronized map.
This could be avoided by adding if (peer == 0) return; to the dispose method, see Memory6.

dbwiddis · 2020-07-21T15:58:45Z

The allocatedMemory.remove(this) call is only needed if disposeAll is called

Which is why one of the available options is removing disposeAll. Or at least permitting users to somehow disable tracking if they will never use disposeAll.

neilcsmith-net · 2020-07-21T16:02:05Z

Explicit calling dispose method will not stop the finializer from calling dispose again,

It does if you get rid of finalize(), which was the context of that comment!

I think there is no need to maintain a map, a list could be used here.

Again, until you want to get rid of finalize(). Which I would prioritise over arguing over list vs map.

joerg1985 · 2020-07-22T15:36:46Z

My intention was not to disrupt the "finalizer or not in future versions" discussion and i messed up the context of the comments, sorry for this.

The only thing i would like to contribute is the idea of a faster implementation of the current behaviour.
As mentioned above i gave it a try in Memory6 and now i have some numbers to share, these numbers are based on revision 72dffcb.

Benchmark	Mode	Cnt	Score	Error	Units
MemoryBenchmark.allocateWithMemory	thrpt	25	1432560.188	± 105192.179	ops/s
MemoryBenchmark.allocateWithMemory2	thrpt	25	1318910.324	± 153444.187	ops/s
MemoryBenchmark.allocateWithMemory3	thrpt	25	1412166.053	± 157307.070	ops/s
MemoryBenchmark.allocateWithMemory4	thrpt	25	1441608.320	± 111822.872	ops/s
MemoryBenchmark.allocateWithMemory5	thrpt	25	3244573.067	± 154981.355	ops/s
MemoryBenchmark.allocateWithMemory6	thrpt	25	2702059.448	± 303447.446	ops/s

I hope i did not break anything.

matthiasblaesing · 2020-07-23T19:08:34Z

My numbers are not as good as yours, but still Memory6 looks like a good improvement, that might be worth merging. Would you be willing to create a PR with the suggested patch? I looked at a diff between Memory and Memory6 and notice many unrelated changes, that made it difficult to see the real changes and a cleaner patch would make review easier.

joerg1985 · 2020-07-25T13:32:47Z

My numbers are not as good as yours, but still Memory6 looks like a good improvement, that might be worth merging. Would you be willing to create a PR with the suggested patch? I looked at a diff between Memory and Memory6 and notice many unrelated changes, that made it difficult to see the real changes and a cleaner patch would make review easier.

I will create a new branch without these unrelated changes, try to find better names for the internal class/methods and create some unittests.

The different numbers might be related to the Java version, my default runtime is 11. I did not consider this running the benchmark, before opening the PR i will benchmark Java 8, 11 and 14.

matthiasblaesing · 2020-08-11T21:39:01Z

Closing, as improved implementation was merged with 59b052a

dbwiddis mentioned this issue Jul 19, 2020

Replaced unused WeakReference with Boolean in Memory.allocatedMemory #1233

Merged

joerg1985 mentioned this issue Aug 9, 2020

Reworked the way allocated memory is tracked #1239

Closed

matthiasblaesing closed this as completed Aug 11, 2020

Karasiq mentioned this issue Jun 9, 2021

High monitor usage due to Finalizer.add #1353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronized map poor performance #1211

Synchronized map poor performance #1211

Karasiq commented Jun 16, 2020 •

edited

Loading

matthiasblaesing commented Jun 16, 2020

dbwiddis commented Jun 16, 2020 •

edited

Loading

dbwiddis commented Jun 17, 2020

matthiasblaesing commented Jun 17, 2020

dbwiddis commented Jun 17, 2020

matthiasblaesing commented Jun 18, 2020

neilcsmith-net commented Jun 19, 2020 •

edited

Loading

Karasiq commented Jun 19, 2020 •

edited

Loading

twall commented Jun 19, 2020 via email

matthiasblaesing commented Jun 22, 2020

neilcsmith-net commented Jun 24, 2020

matthiasblaesing commented Jun 24, 2020

neilcsmith-net commented Jun 25, 2020

joerg1985 commented Jul 21, 2020

dbwiddis commented Jul 21, 2020

neilcsmith-net commented Jul 21, 2020

joerg1985 commented Jul 22, 2020

matthiasblaesing commented Jul 23, 2020

joerg1985 commented Jul 25, 2020

matthiasblaesing commented Aug 11, 2020

Synchronized map poor performance #1211

Synchronized map poor performance #1211

Comments

Karasiq commented Jun 16, 2020 • edited Loading

matthiasblaesing commented Jun 16, 2020

dbwiddis commented Jun 16, 2020 • edited Loading

dbwiddis commented Jun 17, 2020

matthiasblaesing commented Jun 17, 2020

dbwiddis commented Jun 17, 2020

matthiasblaesing commented Jun 18, 2020

neilcsmith-net commented Jun 19, 2020 • edited Loading

Karasiq commented Jun 19, 2020 • edited Loading

twall commented Jun 19, 2020 via email

matthiasblaesing commented Jun 22, 2020

neilcsmith-net commented Jun 24, 2020

matthiasblaesing commented Jun 24, 2020

neilcsmith-net commented Jun 25, 2020

joerg1985 commented Jul 21, 2020

dbwiddis commented Jul 21, 2020

neilcsmith-net commented Jul 21, 2020

joerg1985 commented Jul 22, 2020

matthiasblaesing commented Jul 23, 2020

joerg1985 commented Jul 25, 2020

matthiasblaesing commented Aug 11, 2020

Karasiq commented Jun 16, 2020 •

edited

Loading

dbwiddis commented Jun 16, 2020 •

edited

Loading

neilcsmith-net commented Jun 19, 2020 •

edited

Loading

Karasiq commented Jun 19, 2020 •

edited

Loading