Proof of concept: Higher compression mode #29

fintelia · 2024-09-22T07:39:56Z

This is a proof of concept that replaces the existing encoder with a slower but higher-compression one (an actual implementation would likely expose both as separate modes). On some PNG data I experimented on, its was as fast as miniz_oxide's level=2 but got a compression ratio better than miniz_oxide's level=9.

The key insight is that PNG data has a very skewed distribution of symbols which often makes short back-references worse than simply encoding the same data with literals. This decoder doesn't even attempt to find 3-byte back-references and 4-7 byte back-references are only used if they are likely to require fewer bits than the equivalent sequence of literals. The encoder uses a hash-table of 8 byte sequences and a separate one for 4 byte sequences. This means that with only two hash look-ups per-byte the encoder has a good chance of finding matches for most look-ups (due to the 4-byte sequence table) but doesn't risk missing 8+ byte sequences that give the biggest compression wins.

To estimate the frequency of symbols (which determines whether literals or back-references are better), the first 128 KB of the data assumes the hard-coded symbol frequencies used for the existing ultra-fast mode. Then each subsequent block of 128 KB uses the frequencies from the previous block.

Results on raw PNG IDAT data produced by re-encoding the QOI benchmark suite images...

Encoder[level]     Speed          Ratio
--------------     -----------    ------
fdeflate:          0.128 GiB/s    24.14%

miniz_oxide[0]:    0.123 GiB/s    100.02%
miniz_oxide[1]:    0.285 GiB/s    28.17%
miniz_oxide[2]:    0.128 GiB/s    26.12%
miniz_oxide[3]:    0.081 GiB/s    25.36%
miniz_oxide[4]:    0.069 GiB/s    25.24%
miniz_oxide[5]:    0.054 GiB/s    24.97%
miniz_oxide[6]:    0.028 GiB/s    24.55%
miniz_oxide[7]:    0.020 GiB/s    24.41%
miniz_oxide[8]:    0.015 GiB/s    24.34%
miniz_oxide[9]:    0.012 GiB/s    24.30%

zlib-rs[0]:        5.492 GiB/s    100.02%
zlib-rs[1]:        0.305 GiB/s    36.79%
zlib-rs[2]:        0.165 GiB/s    25.78%
zlib-rs[3]:        0.140 GiB/s    25.15%
zlib-rs[4]:        0.115 GiB/s    24.77%
zlib-rs[5]:        0.103 GiB/s    24.58%
zlib-rs[6]:        0.074 GiB/s    24.35%
zlib-rs[7]:        0.046 GiB/s    24.05%
zlib-rs[8]:        0.024 GiB/s    23.92%
zlib-rs[9]:        0.015 GiB/s    24.12%

zlib and zlib-ng

zlib[0]:           2.359 GiB/s    100.02%
zlib[1]:           0.167 GiB/s    26.99%
zlib[2]:           0.150 GiB/s    26.55%
zlib[3]:           0.098 GiB/s    25.97%
zlib[4]:           0.094 GiB/s    25.45%
zlib[5]:           0.060 GiB/s    24.97%
zlib[6]:           0.033 GiB/s    24.54%
zlib[7]:           0.024 GiB/s    24.41%
zlib[8]:           0.012 GiB/s    24.23%
zlib[9]:           0.007 GiB/s    24.16%

zlib-ng[0]:        6.785 GiB/s    100.02%
zlib-ng[1]:        0.416 GiB/s    36.79%
zlib-ng[2]:        0.217 GiB/s    25.78%
zlib-ng[3]:        0.188 GiB/s    25.15%
zlib-ng[4]:        0.146 GiB/s    24.77%
zlib-ng[5]:        0.129 GiB/s    24.58%
zlib-ng[6]:        0.087 GiB/s    24.35%
zlib-ng[7]:        0.050 GiB/s    24.05%
zlib-ng[8]:        0.026 GiB/s    23.92%
zlib-ng[9]:        0.016 GiB/s    24.12%

Shnatsel · 2024-09-22T13:07:56Z

The initial results are very impressive! Kudos!

I wonder, what are the buffering requirements here? Maybe getting ahead of ourselves here, but: in theory, would it be possible to have an algorithm that makes two passes over a 128k buffer, first to estimate the symbol distribution and the second one to actually encode data? Would it make sense to have a mode that does those two passes for the first block only, to compress small images such as website favicons with a high ratio even if they don't fit the hardcoded distribution, and have the cost of the extra pass over a single block amortized over the runtime of a large image?

Shnatsel · 2024-09-22T13:10:28Z

I'm happy to help with testing this, by the way. This looks like a great application for roundtrip fuzzing, and I can take care of writing the harness and running the fuzzer.

kornelski · 2024-09-22T13:16:52Z

I'm curious whether keeping backreferences pixel-aligned helps.

fintelia · 2024-09-22T22:51:26Z

Did some testing using geometric mean rather than a simple average. Now performance and compression ratio falls close to zlib-rs level 4 or 5. Still beats miniz_oxide and zlib:

fdeflate:          110.5 MiB/s    24.05%

miniz_oxide[0]     124.8 MiB/s    100.03%
miniz_oxide[1]     270.6 MiB/s    28.34%
miniz_oxide[2]     112.3 MiB/s    25.29%
miniz_oxide[3]      80.6 MiB/s    24.38%
miniz_oxide[4]      69.8 MiB/s    24.30%
miniz_oxide[5]      59.3 MiB/s    24.00%
miniz_oxide[6]      38.6 MiB/s    23.57%
miniz_oxide[7]      30.3 MiB/s    23.43%
miniz_oxide[8]      23.7 MiB/s    23.35%
miniz_oxide[9]      20.5 MiB/s    23.31%

zlib[0]           2417.0 MiB/s    100.03%
zlib[1]            153.2 MiB/s    26.97%
zlib[2]            142.3 MiB/s    26.45%
zlib[3]            106.6 MiB/s    25.78%
zlib[4]             88.6 MiB/s    24.68%
zlib[5]             65.8 MiB/s    24.17%
zlib[6]             43.9 MiB/s    23.63%
zlib[7]             35.4 MiB/s    23.48%
zlib[8]             20.0 MiB/s    23.20%
zlib[9]             12.5 MiB/s    22.99%

zlib-rs[0]        5591.9 MiB/s    100.03%
zlib-rs[1]         292.0 MiB/s    36.06%
zlib-rs[2]         162.3 MiB/s    25.29%
zlib-rs[3]         138.0 MiB/s    24.63%
zlib-rs[4]         114.9 MiB/s    24.16%
zlib-rs[5]         103.9 MiB/s    23.74%
zlib-rs[6]          81.0 MiB/s    23.40%
zlib-rs[7]          57.3 MiB/s    23.28%
zlib-rs[8]          34.4 MiB/s    23.03%
zlib-rs[9]          22.5 MiB/s    22.85%

fintelia · 2024-09-22T23:58:03Z

I wonder, what are the buffering requirements here? Maybe getting ahead of ourselves here, but: in theory, would it be possible to have an algorithm that makes two passes over a 128k buffer, first to estimate the symbol distribution and the second one to actually encode data?

Yes, using the actual block contents to estimate the symbol distribution does (slightly) improve compression. You can actually make more than two passes: each additional pass gives you a more accurate symbol distribution, which in turn lets you make better choices about which back-references to use. This is what the "iterations" parameter in zopfli does.

The downside is that this significantly adds to compression time. Selecting which sequence of matches to use takes the majority of time when compressing a block, so doing it twice, cuts performance roughly in half.

Would it make sense to have a mode that does those two passes for the first block only, to compress small images such as website favicons with a high ratio even if they don't fit the hardcoded distribution, and have the cost of the extra pass over a single block amortized over the runtime of a large image?

Tried this out and got a roughly twentieth of a percentage point improvement. May be partially because the encoder is currently doing greedy parsing rather than optimal parsing, so it doesn't benefit as much from knowing symbol costs.

I'm curious whether keeping backreferences pixel-aligned helps.

The data structures involved don't really allow you to query only for aligned matches, they just tell you the closest match of a given length, whether it happens to be aligned or not. You could try only doing look-ups at the start of pixel boundaries, but then you'd miss cases like a 5-byte match consisting of the green and blue channels of one pixel along with all three channels of the next pixel.

Though I think a bigger problem is that the simple data structures I'm using are fast but miss way to many possible matches. Ideally you'd learn about a bunch of possible matches and then use a later step to select the best ones to use. Currently the encoder will only know about the most recent matching 4-byte sequence but not a 7-byte sequence that came just before it.

(I think that the other compressors are mostly having the opposite issue. At a given compression level they examine the previous N matches that are 3+ bytes in length to find the longest one. For low compression levels, they frequently only find junk matches that are 3-4 bytes long. At high compression levels, they find the good matches but waste a really long time looking at shorter matches.)

I may look into Morphing Match Chain or other data structures which might have better results

kornelski · 2024-09-24T16:01:26Z

You could try only doing look-ups at the start of pixel boundaries, but then you'd miss cases like a 5-byte match consisting of the green and blue channels of one pixel along with all three channels of the next pixel.

My hypothesis is that long matches misaligned with channels are extremely rare, so you don't lose much compression, and gain ability to process 3-4 bytes at a time, and have fewer unique backreference distance+length pairs (IIRC that also helps).

Why rare? The misalignment would need image to have colors shifted by R/G/B channels, and that would look like color aberrations. I don't think this happens often in real images. Even the visual effects of glitch art and lens aberrations are unlikely to have channels shifted that precisely. ClearType and anaglyph images will probably adjust for different perceived brightness of green, so they won't repeat exactly either. Additionally opaque pixels in RGBA have 255 every four bytes that is likely to "synchronize" matches a lot.

kornelski · 2024-09-24T16:04:06Z

Actually, the above applies to filter=0. Filtering may fix that. That makes me wonder whether it would help to have a different compression strategy per filter type?

fintelia · 2024-09-25T04:17:39Z

Yeah, the data looks very different after filtering than it does before hand. They do a very good job of reducing entropy. On the test corpus I've looked at, upwards of 75% of all bytes after filtering are either 0, 1 or -1. The net result is that backreferences aren't really "the dark blue pixel followed by the lighter blue pixel", but more like "two perfectly predicted channels, then the next two channels off by 1"

fintelia added 6 commits September 20, 2024 20:02

Checkpoint experiments

1b3df02

checkpoint

ef3b099

checkpoint

2d25d48

Checkpoint

d06cd69

Fixes

17d3d58

Actually produce valid output

484eaf5

Shnatsel mentioned this pull request Sep 25, 2024

High-level compression options API image-rs/image-png#503

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept: Higher compression mode #29

Proof of concept: Higher compression mode #29

fintelia commented Sep 22, 2024 •

edited

Loading

Shnatsel commented Sep 22, 2024

Shnatsel commented Sep 22, 2024

kornelski commented Sep 22, 2024

fintelia commented Sep 22, 2024

fintelia commented Sep 22, 2024

kornelski commented Sep 24, 2024

kornelski commented Sep 24, 2024

fintelia commented Sep 25, 2024

Proof of concept: Higher compression mode #29

Are you sure you want to change the base?

Proof of concept: Higher compression mode #29

Conversation

fintelia commented Sep 22, 2024 • edited Loading

Shnatsel commented Sep 22, 2024

Shnatsel commented Sep 22, 2024

kornelski commented Sep 22, 2024

fintelia commented Sep 22, 2024

fintelia commented Sep 22, 2024

kornelski commented Sep 24, 2024

kornelski commented Sep 24, 2024

fintelia commented Sep 25, 2024

fintelia commented Sep 22, 2024 •

edited

Loading