Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: Higher compression mode #29

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

fintelia
Copy link
Contributor

@fintelia fintelia commented Sep 22, 2024

This is a proof of concept that replaces the existing encoder with a slower but higher-compression one (an actual implementation would likely expose both as separate modes). On some PNG data I experimented on, its was as fast as miniz_oxide's level=2 but got a compression ratio better than miniz_oxide's level=9.

The key insight is that PNG data has a very skewed distribution of symbols which often makes short back-references worse than simply encoding the same data with literals. This decoder doesn't even attempt to find 3-byte back-references and 4-7 byte back-references are only used if they are likely to require fewer bits than the equivalent sequence of literals. The encoder uses a hash-table of 8 byte sequences and a separate one for 4 byte sequences. This means that with only two hash look-ups per-byte the encoder has a good chance of finding matches for most look-ups (due to the 4-byte sequence table) but doesn't risk missing 8+ byte sequences that give the biggest compression wins.

To estimate the frequency of symbols (which determines whether literals or back-references are better), the first 128 KB of the data assumes the hard-coded symbol frequencies used for the existing ultra-fast mode. Then each subsequent block of 128 KB uses the frequencies from the previous block.

Results on raw PNG IDAT data produced by re-encoding the QOI benchmark suite images...

Encoder[level]     Speed          Ratio
--------------     -----------    ------
fdeflate:          0.128 GiB/s    24.14%

miniz_oxide[0]:    0.123 GiB/s    100.02%
miniz_oxide[1]:    0.285 GiB/s    28.17%
miniz_oxide[2]:    0.128 GiB/s    26.12%
miniz_oxide[3]:    0.081 GiB/s    25.36%
miniz_oxide[4]:    0.069 GiB/s    25.24%
miniz_oxide[5]:    0.054 GiB/s    24.97%
miniz_oxide[6]:    0.028 GiB/s    24.55%
miniz_oxide[7]:    0.020 GiB/s    24.41%
miniz_oxide[8]:    0.015 GiB/s    24.34%
miniz_oxide[9]:    0.012 GiB/s    24.30%

zlib-rs[0]:        5.492 GiB/s    100.02%
zlib-rs[1]:        0.305 GiB/s    36.79%
zlib-rs[2]:        0.165 GiB/s    25.78%
zlib-rs[3]:        0.140 GiB/s    25.15%
zlib-rs[4]:        0.115 GiB/s    24.77%
zlib-rs[5]:        0.103 GiB/s    24.58%
zlib-rs[6]:        0.074 GiB/s    24.35%
zlib-rs[7]:        0.046 GiB/s    24.05%
zlib-rs[8]:        0.024 GiB/s    23.92%
zlib-rs[9]:        0.015 GiB/s    24.12%
zlib and zlib-ng
zlib[0]:           2.359 GiB/s    100.02%
zlib[1]:           0.167 GiB/s    26.99%
zlib[2]:           0.150 GiB/s    26.55%
zlib[3]:           0.098 GiB/s    25.97%
zlib[4]:           0.094 GiB/s    25.45%
zlib[5]:           0.060 GiB/s    24.97%
zlib[6]:           0.033 GiB/s    24.54%
zlib[7]:           0.024 GiB/s    24.41%
zlib[8]:           0.012 GiB/s    24.23%
zlib[9]:           0.007 GiB/s    24.16%

zlib-ng[0]:        6.785 GiB/s    100.02%
zlib-ng[1]:        0.416 GiB/s    36.79%
zlib-ng[2]:        0.217 GiB/s    25.78%
zlib-ng[3]:        0.188 GiB/s    25.15%
zlib-ng[4]:        0.146 GiB/s    24.77%
zlib-ng[5]:        0.129 GiB/s    24.58%
zlib-ng[6]:        0.087 GiB/s    24.35%
zlib-ng[7]:        0.050 GiB/s    24.05%
zlib-ng[8]:        0.026 GiB/s    23.92%
zlib-ng[9]:        0.016 GiB/s    24.12%

@Shnatsel
Copy link
Contributor

The initial results are very impressive! Kudos!

I wonder, what are the buffering requirements here? Maybe getting ahead of ourselves here, but: in theory, would it be possible to have an algorithm that makes two passes over a 128k buffer, first to estimate the symbol distribution and the second one to actually encode data? Would it make sense to have a mode that does those two passes for the first block only, to compress small images such as website favicons with a high ratio even if they don't fit the hardcoded distribution, and have the cost of the extra pass over a single block amortized over the runtime of a large image?

@Shnatsel
Copy link
Contributor

I'm happy to help with testing this, by the way. This looks like a great application for roundtrip fuzzing, and I can take care of writing the harness and running the fuzzer.

@kornelski
Copy link
Contributor

I'm curious whether keeping backreferences pixel-aligned helps.

@fintelia
Copy link
Contributor Author

Did some testing using geometric mean rather than a simple average. Now performance and compression ratio falls close to zlib-rs level 4 or 5. Still beats miniz_oxide and zlib:

fdeflate:          110.5 MiB/s    24.05%

miniz_oxide[0]     124.8 MiB/s    100.03%
miniz_oxide[1]     270.6 MiB/s    28.34%
miniz_oxide[2]     112.3 MiB/s    25.29%
miniz_oxide[3]      80.6 MiB/s    24.38%
miniz_oxide[4]      69.8 MiB/s    24.30%
miniz_oxide[5]      59.3 MiB/s    24.00%
miniz_oxide[6]      38.6 MiB/s    23.57%
miniz_oxide[7]      30.3 MiB/s    23.43%
miniz_oxide[8]      23.7 MiB/s    23.35%
miniz_oxide[9]      20.5 MiB/s    23.31%

zlib[0]           2417.0 MiB/s    100.03%
zlib[1]            153.2 MiB/s    26.97%
zlib[2]            142.3 MiB/s    26.45%
zlib[3]            106.6 MiB/s    25.78%
zlib[4]             88.6 MiB/s    24.68%
zlib[5]             65.8 MiB/s    24.17%
zlib[6]             43.9 MiB/s    23.63%
zlib[7]             35.4 MiB/s    23.48%
zlib[8]             20.0 MiB/s    23.20%
zlib[9]             12.5 MiB/s    22.99%

zlib-rs[0]        5591.9 MiB/s    100.03%
zlib-rs[1]         292.0 MiB/s    36.06%
zlib-rs[2]         162.3 MiB/s    25.29%
zlib-rs[3]         138.0 MiB/s    24.63%
zlib-rs[4]         114.9 MiB/s    24.16%
zlib-rs[5]         103.9 MiB/s    23.74%
zlib-rs[6]          81.0 MiB/s    23.40%
zlib-rs[7]          57.3 MiB/s    23.28%
zlib-rs[8]          34.4 MiB/s    23.03%
zlib-rs[9]          22.5 MiB/s    22.85%

@fintelia
Copy link
Contributor Author

I wonder, what are the buffering requirements here? Maybe getting ahead of ourselves here, but: in theory, would it be possible to have an algorithm that makes two passes over a 128k buffer, first to estimate the symbol distribution and the second one to actually encode data?

Yes, using the actual block contents to estimate the symbol distribution does (slightly) improve compression. You can actually make more than two passes: each additional pass gives you a more accurate symbol distribution, which in turn lets you make better choices about which back-references to use. This is what the "iterations" parameter in zopfli does.

The downside is that this significantly adds to compression time. Selecting which sequence of matches to use takes the majority of time when compressing a block, so doing it twice, cuts performance roughly in half.

Would it make sense to have a mode that does those two passes for the first block only, to compress small images such as website favicons with a high ratio even if they don't fit the hardcoded distribution, and have the cost of the extra pass over a single block amortized over the runtime of a large image?

Tried this out and got a roughly twentieth of a percentage point improvement. May be partially because the encoder is currently doing greedy parsing rather than optimal parsing, so it doesn't benefit as much from knowing symbol costs.

I'm curious whether keeping backreferences pixel-aligned helps.

The data structures involved don't really allow you to query only for aligned matches, they just tell you the closest match of a given length, whether it happens to be aligned or not. You could try only doing look-ups at the start of pixel boundaries, but then you'd miss cases like a 5-byte match consisting of the green and blue channels of one pixel along with all three channels of the next pixel.

Though I think a bigger problem is that the simple data structures I'm using are fast but miss way to many possible matches. Ideally you'd learn about a bunch of possible matches and then use a later step to select the best ones to use. Currently the encoder will only know about the most recent matching 4-byte sequence but not a 7-byte sequence that came just before it.

(I think that the other compressors are mostly having the opposite issue. At a given compression level they examine the previous N matches that are 3+ bytes in length to find the longest one. For low compression levels, they frequently only find junk matches that are 3-4 bytes long. At high compression levels, they find the good matches but waste a really long time looking at shorter matches.)

I may look into Morphing Match Chain or other data structures which might have better results

@kornelski
Copy link
Contributor

You could try only doing look-ups at the start of pixel boundaries, but then you'd miss cases like a 5-byte match consisting of the green and blue channels of one pixel along with all three channels of the next pixel.

My hypothesis is that long matches misaligned with channels are extremely rare, so you don't lose much compression, and gain ability to process 3-4 bytes at a time, and have fewer unique backreference distance+length pairs (IIRC that also helps).

Why rare? The misalignment would need image to have colors shifted by R/G/B channels, and that would look like color aberrations. I don't think this happens often in real images. Even the visual effects of glitch art and lens aberrations are unlikely to have channels shifted that precisely. ClearType and anaglyph images will probably adjust for different perceived brightness of green, so they won't repeat exactly either. Additionally opaque pixels in RGBA have 255 every four bytes that is likely to "synchronize" matches a lot.

@kornelski
Copy link
Contributor

Actually, the above applies to filter=0. Filtering may fix that. That makes me wonder whether it would help to have a different compression strategy per filter type?

@fintelia
Copy link
Contributor Author

Yeah, the data looks very different after filtering than it does before hand. They do a very good job of reducing entropy. On the test corpus I've looked at, upwards of 75% of all bytes after filtering are either 0, 1 or -1. The net result is that backreferences aren't really "the dark blue pixel followed by the lighter blue pixel", but more like "two perfectly predicted channels, then the next two channels off by 1"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants