-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proof of concept: Higher compression mode #29
base: main
Are you sure you want to change the base?
Conversation
The initial results are very impressive! Kudos! I wonder, what are the buffering requirements here? Maybe getting ahead of ourselves here, but: in theory, would it be possible to have an algorithm that makes two passes over a 128k buffer, first to estimate the symbol distribution and the second one to actually encode data? Would it make sense to have a mode that does those two passes for the first block only, to compress small images such as website favicons with a high ratio even if they don't fit the hardcoded distribution, and have the cost of the extra pass over a single block amortized over the runtime of a large image? |
I'm happy to help with testing this, by the way. This looks like a great application for roundtrip fuzzing, and I can take care of writing the harness and running the fuzzer. |
I'm curious whether keeping backreferences pixel-aligned helps. |
Did some testing using geometric mean rather than a simple average. Now performance and compression ratio falls close to zlib-rs level 4 or 5. Still beats miniz_oxide and zlib:
|
Yes, using the actual block contents to estimate the symbol distribution does (slightly) improve compression. You can actually make more than two passes: each additional pass gives you a more accurate symbol distribution, which in turn lets you make better choices about which back-references to use. This is what the "iterations" parameter in zopfli does. The downside is that this significantly adds to compression time. Selecting which sequence of matches to use takes the majority of time when compressing a block, so doing it twice, cuts performance roughly in half.
Tried this out and got a roughly twentieth of a percentage point improvement. May be partially because the encoder is currently doing greedy parsing rather than optimal parsing, so it doesn't benefit as much from knowing symbol costs.
The data structures involved don't really allow you to query only for aligned matches, they just tell you the closest match of a given length, whether it happens to be aligned or not. You could try only doing look-ups at the start of pixel boundaries, but then you'd miss cases like a 5-byte match consisting of the green and blue channels of one pixel along with all three channels of the next pixel. Though I think a bigger problem is that the simple data structures I'm using are fast but miss way to many possible matches. Ideally you'd learn about a bunch of possible matches and then use a later step to select the best ones to use. Currently the encoder will only know about the most recent matching 4-byte sequence but not a 7-byte sequence that came just before it. (I think that the other compressors are mostly having the opposite issue. At a given compression level they examine the previous N matches that are 3+ bytes in length to find the longest one. For low compression levels, they frequently only find junk matches that are 3-4 bytes long. At high compression levels, they find the good matches but waste a really long time looking at shorter matches.) I may look into Morphing Match Chain or other data structures which might have better results |
My hypothesis is that long matches misaligned with channels are extremely rare, so you don't lose much compression, and gain ability to process 3-4 bytes at a time, and have fewer unique backreference distance+length pairs (IIRC that also helps). Why rare? The misalignment would need image to have colors shifted by R/G/B channels, and that would look like color aberrations. I don't think this happens often in real images. Even the visual effects of glitch art and lens aberrations are unlikely to have channels shifted that precisely. ClearType and anaglyph images will probably adjust for different perceived brightness of green, so they won't repeat exactly either. Additionally opaque pixels in RGBA have 255 every four bytes that is likely to "synchronize" matches a lot. |
Actually, the above applies to filter=0. Filtering may fix that. That makes me wonder whether it would help to have a different compression strategy per filter type? |
Yeah, the data looks very different after filtering than it does before hand. They do a very good job of reducing entropy. On the test corpus I've looked at, upwards of 75% of all bytes after filtering are either |
This is a proof of concept that replaces the existing encoder with a slower but higher-compression one (an actual implementation would likely expose both as separate modes). On some PNG data I experimented on, its was as fast as miniz_oxide's level=2 but got a compression ratio better than miniz_oxide's level=9.
The key insight is that PNG data has a very skewed distribution of symbols which often makes short back-references worse than simply encoding the same data with literals. This decoder doesn't even attempt to find 3-byte back-references and 4-7 byte back-references are only used if they are likely to require fewer bits than the equivalent sequence of literals. The encoder uses a hash-table of 8 byte sequences and a separate one for 4 byte sequences. This means that with only two hash look-ups per-byte the encoder has a good chance of finding matches for most look-ups (due to the 4-byte sequence table) but doesn't risk missing 8+ byte sequences that give the biggest compression wins.
To estimate the frequency of symbols (which determines whether literals or back-references are better), the first 128 KB of the data assumes the hard-coded symbol frequencies used for the existing ultra-fast mode. Then each subsequent block of 128 KB uses the frequencies from the previous block.
Results on raw PNG IDAT data produced by re-encoding the QOI benchmark suite images...
zlib and zlib-ng