-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node location cache: off-heap storage and “array” implementation that supports parallel inserts #131
Conversation
https://github.com/onthegomap/planetiler/actions/runs/2008248729 ℹ️ Base Logs 1cfcca2
ℹ️ This Branch Logs 4005b3c
|
Are you noticing a significant gains with the parallel version? I guess that the machine used to produce the benchmarks displayed in the issue has a limited number of cores. When experimenting with your code to replace lmdb in baremaps, I tried to make the memory and the serialization of records pluggable. The resulting As the |
Yes, the parallel version is quite a bit faster (for writes at least) - but only when running over the whole planet. I ran on a machine with 32GB RAM and 16 cpus, and the memory-mapped sparse array took about 11 minutes (15-17 million nodes per second, 4-6 million ways per second) but the parallel array implementation took 4m44s (40-50 million nodes per second, 6-10 million ways per second). And the data storage ends up only a little bit larger (74 instead of 68gb). The RAM parallel array implementation only takes 4m10s.
I am also curious if there's any noticeable difference using |
I just tried to run a benchmark to answer the question about write performance. Most of the benchmark I saw online were quite outdated (4 years and older). Writing data on-disk with mapped buffers looks suspiciously fast on my 2020 macbook pro (I use the force method to ensure that the changes are written to the storage device). When it comes to byte arrays and heap buffers, an overhead can be noticed when writing less than 1mb. Above 1mb, the two solutions look in par. Direct buffers are slower than heap buffers, but the garbage collector is probably less solicited.
Do not hesitate to let me know if you spot an issue in the benchmark. For now, the results tend to suggest that using |
Thanks @bchapuis! I did some testing too with |
… (and debug logging)
Ran some performance tests on a c5ad.16xlarge (64 cpu, 128gb ram) instance: tl;dr) The parallel array nodemap implementation shaves about 5 minutes off of OSM pass 1 (8->3m) when storage=ram or 4 minutes when storage=mmap (8->4m). Also node read performance in pass 2 is about the same when RAM or MMAP storage is used, so switching to MMAP and using a smaller Xmx setting (~45g or less) will only add about 1 minute to generation time. Xmx=115g nodemap=array/ram
Xmx=115g nodemap=sparsearray/ram
Xmx=45g nodemap=array/mmap/madvise=true
|
MMap's reading performances are impressive. Is your use of madvise significantly impact performance or do you think something changed in the way memory mapped files are implemented in Java 17? |
Before Java 13 you needed to set the position on byte buffers then read in separate calls, then Java 13 added positional get methods so multiple threads can read from the same file without synchronization so I think that helped use cases like this. Now as long as you have more free memory than the file, it should be almost as fast as reading from RAM. Madvise probably didn't make much of a difference here, but it should help when the amount of free ram is less than the node cache file. |
Thanks a lot for the clarification, I was aware about the positional get methods, but I would not have thought that it would have such an impact, even if a lot of ram is available. |
… supports parallel inserts (onthegomap#131) * Add --nodemap-type=array option for 2-3x faster osm pass 1 imports * Add --nodemap-storage=direct option to experiment with direct (off-heap) memory usage * Extract ResourceUsage and OsmPhaser utilities
Add more options for node location storage (aka
LongLongMap
since lat/lon gets packed into a 64-bit long for storage) and some code improvements along the way:--nodemap-type=array
option that stores node locations in an array indexed by node ID (8 bytes * max node ID) - and also supports parallel inserts. This makes OSM pass 1 2-3x faster on larger machines--nodemap-storage=direct
option to store node location cache in native off-heap "direct" memory usingByteBuffer.allocateDirect
(you'll likely need to increase the-XX:MaxDirectMemorySize
setting to use this)ResourceUsage
class that prints a more detailed breakdown of usageOsmPhaser
class that also logs more details when a node begins/endsposix_madvise
instead ofmadvise
land checklist:
ByteBuffer.allocateDirect
"array"
nodemap type that supports parallel inserts from multiple threadsosm_pass1
phase to support a sequential and parallel execution model, based on if the node map implementation supports parallel insertsallocate
vs.allocateDirect
-XX: MaxDirectMemorySize
if set, otherwiseRuntime.getRuntime().maxMemory()
.