-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add compact DB mode (--compact-db) to de-duplicate mbtiles output #219
Conversation
planetiler-core/src/main/java/com/onthegomap/planetiler/mbtiles/MbtilesWriter.java
Outdated
Show resolved
Hide resolved
https://github.com/onthegomap/planetiler/actions/runs/2379534492 ℹ️ Base Logs 6873b98
ℹ️ This Branch Logs bb5f4c6
|
Would using a faster non-cryptographic hashing algorithm improve performance much? Or is performance so limited by IO that it doesn't matter? |
which splits the tiles table into tiles_shallow and tiles_data tiles_shallow contains the coordinates plus a reference on the data ID tiles_data contains the data ID plus the actual tile data this allows to deduplicate content since multiple tiles can reference the same data in this mode, tiles is realized as a view that joins the two tables tiles_shallow and tiles_data
I got around to testing this using Here are the file size differences:
When uncompressed, compacted mbtiles are a LOT smaller. There isn't much difference when compressing the files with zstandard, which is probably to be expected. I tried serving the tiles with tileserver-gl, and while I have no performance figures, the compacted tiles do feel slightly slower than the full tileset. But by the time you add some tile caching into the mix, it may not matter much. |
Thanks for this change @bbilger! This looks like a huge improvement
Have you tried sampling with visualvm to see how much of a hot spot the hash is? I'm also looking into generating pmtiles output, and it uses the FNV hash to solve this problem. I haven't checked for collisions but with a 64 bit hash over 200-300 million tiles it should theoretically be a 1 or 2 out of 1000 chance of a collision, but it is very simple and fast: private static final long FNV1_64_INIT = 0xcbf29ce484222325L;
private static final long FNV1_PRIME_64 = 1099511628211L;
public static long hash(byte[] data) {
long hash = FNV1_64_INIT;
for (byte datum : data) {
hash ^= (datum & 0xff);
hash *= FNV1_PRIME_64;
}
return hash;
} Maybe @bdon has a better sense of fnv hash's suitability for planet-scale tilesets? |
I guess at the scales we're talking about (millions of tiles), one would really want the chance of collisions to be so low, that you don't have to worry about checking for collisions, otherwise I think you'd have to do a binary compare to know for certain. The last thing we would want for a data tile to collide with an ocean tile and not get included... I think you end up going with a faster hash algorithm, a 128 bit hash would probably be the safer option, and I imagine it should still be faster than sha1/md5. |
Spoke offline with Brandon about hash functions - he raises a good point that it's likely to be only handful of tiles that are duplicated, and they are most likely small (oceans, parks, etc.). We could apply a cheap pre-filter before deciding to store the tile hash to contents in an in-memory map. Some possibilities:
Either of these approaches would drastically reduce the number of distinct tiles, and likely make something like the 32-bit integer FNV hash sufficient to avoid any possibility of a collision - but in any case a cryptographic hash function is likely overkill. |
@shermp @msbarry / @shermp @msbarry |
...and add 2 layers of hashing: one based on Java-hash + mod and the other based on FNV1-32 (up for discussion). The result of the first layer is just stored in a BitSet. This only allows us to tell if there's a dupe. For the first layer collisions are expected. The result of the seconds layer is a map between the hash and the data ID. The first layer, however, allows to delay the need to create a "real" hash. In addition some "filtering" was added to only perform all this hashing, if the tile's size is below a certain threshold because dupes onyl occur on simple/small tiles like ocean tiles. As part of this, the actual encoding was extracted, as well. Note: The improvement is barely noticable but it's probably the right thing to do.
Thanks for your feedback, I tried to address it in 0faa63a => please let me know what you think. Testing with Australia, I can barely notice any difference but it seems a tiny bit faster. As I said before, I don't think the hashing is the bottleneck, and at least I cannot identify it as one in VisualVM but it's probably the right thing to do to not use SHA-1. One reason I decided to use SHA-1 in the beginning was that I wanted to avoid any collisions. As such I share @shermp 's concerns that using FNV 32 might be a bit problematic since a collision would be rather bad. Maybe it would be better to use FNV 64, FNV 128, or Murmur3-128? |
On second thought, I looked at some stats... there are 280 million tiles for the planet, but only about 50 million unique tiles so only about ~7/100k chance of collision. And the size threshold might not be too helpful. Looking at Australia, about 85% of unique tiles are <1kb (gzipped, but still a good indicator).
For this approach we'd want a first pass that checks if we might have seen a tile before, and if so then consult the map from tile hash to tile ID. A bloom filter would work for this first pass, or a bitset where number of bits is > expected # of unique tiles. Looking at that ~7/100k odds of a collision, it might not even be worth the extra check. |
Instead of using a size threshold, could we limit the deduplication only to tiles that have a single layer, or a single feature on a single layer? We know a priori that the deduplication is useful for ocean squares and land squares, which might have arbitrary tags that increase the total tile size. We would need the ability to introspect on the PBF content instead of treating it as a blob. Are there good counterexamples on why this won't work? |
It is possible that we can pass through an "is fill" bit for each feature, then inspect easily if all features in a tile are fill polygons without deserializing - that's what I was planning for #168. So maybe to start on this PR, we could limit the deduping to just tiles with less than some small number of features (1? 2? 5?) and do the more robust fill identification as part of #168 ? There's a |
also get rid of the 2 layer hashing (1 simplehash+bitset, 2 fnv+map), again
"NumFeaturesToEmit" seems like the right thing to do and I implemented it accordingly. ... |
Ran some numbers today, looks like it's possible to encode tiles up to zoom level 29 in a 64 bit int. Considering that at each zoom level, each axis has 2zoom tiles, you can encode tiles such: Bits 0-4 could be Or any other similar scheme. |
👍 |
Are you generating the x,y columns as part of the SQL view? |
Well, yes but mostly no. Technically the view remains unchanged to what it is now for the compact mode:
The key change is to have generated and most importantly indexed columns in
With a view-only approach like the following we would end up with a full table scan which would hurt read performance badly
What I can say so far is that:
haven't tested impact on read yet |
For now yes although it will increase to 15 or 16 at some point in the future.
👍 I would expect some write performance increase to come from a simpler index structure, but most to come from being able to fit more tiles into each batch with the 999 limit on prepared statement params Curious how read performance turns out. To query from application layer, it would be relatively efficient to map from x/y/z to tile ID, then lookup the data for it but I'm not sure how possible it is to express that in a sqlite view... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the high-level benchmark and verification code. Planning to get down into the implementation details next...
planetiler-core/src/test/java/com/onthegomap/planetiler/PlanetilerTests.java
Outdated
Show resolved
Hide resolved
planetiler-core/src/test/java/com/onthegomap/planetiler/PlanetilerTests.java
Show resolved
Hide resolved
...er-benchmarks/src/main/java/com/onthegomap/planetiler/benchmarks/BenchmarkMbtilesWriter.java
Show resolved
Hide resolved
...iler-benchmarks/src/main/java/com/onthegomap/planetiler/benchmarks/BenchmarkMbtilesRead.java
Show resolved
Hide resolved
planetiler-core/src/main/java/com/onthegomap/planetiler/mbtiles/MbtilesWriter.java
Outdated
Show resolved
Hide resolved
I cannot see a noticeable impact on read performance (BenchmarkMbtilesRead shows something like 1% faster). The approach - let's call it approach no.1 - presented here #219 (comment) relies on generated x,y,z columns that need to be indexed. Thus the index structure is/should be (mostly) identical. Unless you want to build a sqlite file that breaks with mbtiles spec, approach no.1 is the only feasible option I can see. This bring us to approach no.3:
(bear with me if the formula is not quite correct, haven't tested it much but it worked at least for my test coordinates) |
Cool, I would expect the fastest one to be the original, so mostly interested how much read qps drops with the more compact formats. If we can get a compact format with same read performance as original that would be ideal.
OK, is there any benefit of this one then? It seems the same as x/y/z columns, but with some added complexity?
OK cool thanks for exploring this, but if it's looking like there might not a good way to express this through sqlite views then feel free to ignore! |
Also @bbilger let me know if any of those need more clarification, or if you'd rather have me push the change for one to your branch! |
@msbarry all is clear. Sorry, I just didn't have time to work on it over the weekend. I will address the majority of your comments absolutely latest by Friday. |
The advantage is that write performance increases since the prepared statement will have just 1 vs 3 arguments. Let me re-write the write benchmark then a decision can be based on that. |
the feature count might also differ => don't fail but just report as diff see discussion 214
- mbtiles#encode: generate hash (if small) + "detect" memoized - mbtiles#write: generate dataId + de-dupe (if hash + memoized*) * not sure if we should de-dupe memoized, only since we could further reduce the file size by de-duping all with a hash
- bench mbtiles#write, only - report tile writes / s - try to make test somehwat realistic
@msbarry Other than that I think I addressed all your feedback but please let me know if you want anything changed in addition or if there's anything else. The last "open" thing is whether to try to de-dupe memoized only or all with a hash. Note: for some reason (in quite some cases) reading from the compact db seems to be faster now. I cannot really explain it, cannot find a bug in the read benchmark, either, tho. Resultsgen Australia in not-compact mode
gen Australia in compact mode
gen Australia in compact-tileId mode
BenchmarkMbtilesRead
BenchmarkMbtilesWrite
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, this looks great! Leaving a few minor comments, but I see no major issues remaining before we merge. Thanks for putting this together and improving the code long the way!
@ParameterizedTest | ||
@ArgumentsSource(TestArgs.class) | ||
void testFnv32(boolean expectSame, byte[] data0, byte[] data1) { | ||
var hash0 = Hashing.fnv32(data0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion (feel free to ignore) - if you change fnv32 to take a vararg byte... data
argument then you could make these tests a bit more compact, something like:
assertEquals(fnv32(1), fnv32(1));
and
assertNotEquals(fnv32(1, 2), fnv32(2, 1));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I am a big fan of parameterized tests and the argument provider could have been re-used for other hash function but okay I rewrote it now - tho it's not so compact because of the required cast.
planetiler-basemap/src/test/java/com/onthegomap/planetiler/basemap/util/VerifyMonacoTest.java
Outdated
Show resolved
Hide resolved
...er-benchmarks/src/main/java/com/onthegomap/planetiler/benchmarks/BenchmarkMbtilesWriter.java
Outdated
Show resolved
Hide resolved
...er-benchmarks/src/main/java/com/onthegomap/planetiler/benchmarks/BenchmarkMbtilesWriter.java
Outdated
Show resolved
Hide resolved
...er-benchmarks/src/main/java/com/onthegomap/planetiler/benchmarks/BenchmarkMbtilesWriter.java
Show resolved
Hide resolved
planetiler-core/src/main/java/com/onthegomap/planetiler/mbtiles/Mbtiles.java
Outdated
Show resolved
Hide resolved
planetiler-core/src/main/java/com/onthegomap/planetiler/mbtiles/Mbtiles.java
Outdated
Show resolved
Hide resolved
planetiler-core/src/main/java/com/onthegomap/planetiler/mbtiles/Compare.java
Show resolved
Hide resolved
planetiler-core/src/main/java/com/onthegomap/planetiler/mbtiles/MbtilesWriter.java
Outdated
Show resolved
Hide resolved
getDiffJoined(features1, features0, "\n")); | ||
|
||
if (failOnFeatureDiff) { | ||
throw new RuntimeException(msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to fix sonar's complaint here and below, this could be a more specific exception like assertion error?
Absolutely no problem. Thank you for the review and the feedback. I tried to incorporate your feedback in this commit c65c7b7 . Let me know if there's anything else. |
Ran some tests this morning on the planet with this branch merged with #225.
The read benchmark shows about 67.7k reads/s with regular and 64.7k reads/s with compact (I was running short on time so did a smaller test)
Previous tests before this change took about 12m +/- 10s to write mbtiles so I don't have any concerns about |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 2 minor followup comments then it should be good to merge.
public int generateContentHash() { | ||
int hash = Hashing.FNV1_32_INIT; | ||
for (var feature : entries) { | ||
long layerId = extractLayerIdFromKey(feature.key()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layerId starts as a byte - can we just keep it a byte and avoid Longs.toByteArray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure! sorry, got fooled by #hasSameContents in which the result of #extractLayerIdFromKey is assigned to a variable of type long
@@ -0,0 +1,23 @@ | |||
package com.onthegomap.planetiler.util; | |||
|
|||
public final class Hashing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a short javadoc to this class and the public members that explain what it's doing/how to use?
...er-benchmarks/src/main/java/com/onthegomap/planetiler/benchmarks/BenchmarkMbtilesWriter.java
Show resolved
Hide resolved
Kudos, SonarCloud Quality Gate passed! |
Context
this addresses point 2 of #167 (comment)
Overview
the compact DB mode splits the tiles table into tiles_shallow and tiles_data
this allows to de-duplicate content since multiple tiles can reference the same data
in this mode, tiles is realized as a view that joins the two tables tiles_shallow and tiles_data
Impact
some more insights can be found here: PR-impact.txt
Concept
The main idea to realize this is to create a hash over the tile data, and build a map between that hash and a data ID.
Failed Approaches
I was a bit surprised that the impact on writing the mbtiles is little to none. The only explanation I have is that we have to do more inserts than before (tiles_shallow, tiles_data). I tried to offload this "dual insert" from code to the DB. So in planetiler I produced just one insert but the insert was into a view - which then either just inserted into tiles_shallow or both tiles_shallow and tiles_data. This, however, increased the overall mbtiles generation time by almost 20%.
Next
I'd like to address point 1 in #167 (comment) either in this or a follow-up PR