[opt] Fix oss-fuzz bug in optimal parser #2980

terrelln · 2022-01-07T00:04:54Z

oss-fuzz uncovered a scenario where we're evaluating the cost of litLength = 131072,
which can't be represented in the zstd format, so we accessed 1 beyond LL_bits.

Fix the issue by making it cost 1 bit more than litLength = 131071.

There are still follow ups:

This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals
over a match. Should we bound literals_cost[literal] > 0, unless the block truly only
has one literal value?
When no matches are found, the cost model isn't updated. In this case no matches were
found for an entire block. So the literals cost model wasn't updated at all. That made
the optimal parser think literals_cost[0] = 0, where it is actually quite high, since
the block was entirely random noise.

Credit to OSS-Fuzz.

oss-fuzz uncovered a scenario where we're evaluating the cost of litLength = 131072, which can't be represented in the zstd format, so we accessed 1 beyond LL_bits. Fix the issue by making it cost 1 bit more than litLength = 131071. There are still follow ups: 1. This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals over a match. Should we bound literals_cost[literal] > 0, unless the block truly only has one literal value? 2. When no matches are found, the cost model isn't updated. In this case no matches were found for an entire block. So the literals cost model wasn't updated at all. That made the optimal parser think literals_cost[0] = 0, where it is actually quite high, since the block was entirely random noise. Credit to OSS-Fuzz.

felixhandte · 2022-01-07T17:11:26Z

An alternative would be to add a dummy value to the end of LL_bits. That would avoid adding a branch. What do you think?

Cyan4973 · 2022-01-07T17:23:31Z

An alternative would be to add a dummy value to the end of LL_bits. That would avoid adding a branch. What do you think?

This would be better for performance, but :

it's not a place where performance matters (at least not at this micro level)
this transfer a specific (and possibly temporary) problem from zstd_opt into baseline definitions in zstd_internal
this could be interpreted as "we have a way to encode this length", while that's actually not the case, this length should not be possible.

As everything, it's a matter of trade off, and I believe that in this case, clarity and reduced cross-dependency win the round.

terrelln · 2022-01-07T17:25:52Z

An alternative would be to add a dummy value to the end of LL_bits. That would avoid adding a branch. What do you think?

We'd also need to add a value to optPtr->litLengthFreq. And LL_bits is used outside the optimal parser. And I'd rather get an ASAN error than have a too long litLength pass elsewhere. And for clarity, I don't want people to see LL_bits and think that 131072 is representable.

Plus, I measured the performance, and don't see a difference. This branch should be 100% predictable, and cheap.

Cyan4973 · 2022-01-07T21:28:50Z

Regarding follows-up :

This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals
over a match. Should we bound literals_cost[literal] > 0, unless the block truly only
has one literal value?

Indeed,
considering that literals are compressed with huffman,
we could enforce a minimum of 1 bit per literal.
This would avoid situations where the cost of literals becomes incredibly low.

unless the block truly only has one literal value?

Well, currently, RLE block is an "after-parser" analysis,
where a block with a suspiciously small compressed outcome is scanned to check if it effectively consists of a single repeated byte value.

Making RLE-block part of the parsing logic would, in my opinion, belong to flexible blocks boundaries logic, a topic that is expected to be investigated by @binhdvo .

As for the case where the literals block consist of only a single byte value, well, this can happen, but in such case, the literals' block tends to be extremely small, like 1 or 2 bytes, so the error compared to "1-bit per byte" is not sensible.

One could imagine a corner case where all literals have same value, and because they do, they are almost "free",
therefore the parsing logic could take advantage of this, by intentionally shortening matches that start or end with the "free" literal symbol.
But this is a very fragile corner case, which is destroyed as soon as a single different byte value appear,
so "partial" parsers which have to take decisions before reaching the end of the block can't take such a risk.

Consequently, a 1-bit minimum cost for literals seems an adequate rule.

When no matches are found, the cost model isn't updated.

Well, this doesn't seem correct, indeed.
After an incompressible block, it's dubious that re-employing starting statistics from the last previously compressible block brings any value to the following block. It sounds like it would generally be more misleading than helpful.

It might be more appropriate to reset literals statistics, as if it was the beginning of a new frame.

I presume the main issue here is that sending a block in "uncompressed" mode is an "after-parsing" decision, not part of the optimal parser logic, which might be completely unaware of the decision.

terrelln · 2022-01-07T22:06:01Z

Well, this doesn't seem correct, indeed.
After an incompressible block, it's dubious that re-employing starting statistics from the last previously compressible block brings any value to the following block. It sounds like it would generally be more misleading than helpful.

It might be more appropriate to reset literals statistics, as if it was the beginning of a new frame.

I presume the main issue here is that sending a block in "uncompressed" mode is an "after-parsing" decision, not part of the optimal parser logic, which might be completely unaware of the decision.

While that could be an interesting optimization, that isn't quite the problem.

The root of the problem is this branch:

zstd/lib/compress/zstd_opt.c

Line 1086 in 5f2c3d9

if (!nbMatches) { ip++; continue; }

Imagine this branch is taken for the first 128KB - 4KB of the block, so it is incompressible
Then the rest of the block has enough matches to get through the full 4KB parse window
The parse doesn't update statistics until AFTER the first time we save our sequences. So this 4KB parse window doesn't use any of the information about the first 124KB of the block which is entirely literals.

This could make up a reasonably large error for smaller files. E.g. imagine an 8KB file which is 4KB of noise and 4KB of compressible data. We wouldn't get to use any of the information about the 4KB of literals when making cost decisions for the next chunk.

A solution would be to add a check in this branch that checks if we've gone too long without updating our literals cost model (e.g. 1KB or ZSTD_OPT_NUM), and if so proactively update it, since we're guaranteed to emit those positions as literals.

Cyan4973 · 2022-01-07T22:12:39Z

A solution would be to add a check in this branch that checks if we've gone too long without updating our literals cost model (e.g. 1KB or ZSTD_OPT_NUM), and if so proactively update it, since we're guaranteed to emit those positions as literals.

Yes, sounds like a good solution.

Cyan4973 · 2022-01-08T01:09:37Z

For information, I tried @terrelln suggestion to update literals early in the cost model, by registering all literals in front of the first match in the series. The result of which can be observed in the following commit :
88e9a98

Unfortunately,
the resulting compressed frame is not necessarily "better".
Even though the cost estimator has, by definition, more and fresher information to make more optimal choices, results are all over the place. There is essentially one big winner of this strategy (silesia/x-ray), but all other ones are either neutral or detrimental, with total average being slightly negative.

So this is a bit disappointing, and probably not worth a PR yet.

(note: this situation somehow reminds me of #2781 , they might share some common woes)

facebook-github-bot added the CLA Signed label Jan 7, 2022

terrelln force-pushed the opt-oss-fuzz-fix branch from dba1487 to 0064b70 Compare January 7, 2022 00:10

terrelln force-pushed the opt-oss-fuzz-fix branch from 0064b70 to 4d8a213 Compare January 7, 2022 00:10

Cyan4973 approved these changes Jan 7, 2022

View reviewed changes

terrelln merged commit 0677b26 into facebook:dev Jan 7, 2022

Cyan4973 mentioned this pull request Jan 7, 2022

[opt] minor compression ratio improvement #2983

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opt] Fix oss-fuzz bug in optimal parser #2980

[opt] Fix oss-fuzz bug in optimal parser #2980

terrelln commented Jan 7, 2022

felixhandte commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022

terrelln commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022 •

edited

Loading

terrelln commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022

Cyan4973 commented Jan 8, 2022

[opt] Fix oss-fuzz bug in optimal parser #2980

[opt] Fix oss-fuzz bug in optimal parser #2980

Conversation

terrelln commented Jan 7, 2022

felixhandte commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022

terrelln commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022 • edited Loading

terrelln commented Jan 7, 2022

Cyan4973 commented Jan 7, 2022

Cyan4973 commented Jan 8, 2022

Cyan4973 commented Jan 7, 2022 •

edited

Loading