Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt] Fix oss-fuzz bug in optimal parser #2980

Merged
merged 1 commit into from
Jan 7, 2022

Conversation

terrelln
Copy link
Contributor

@terrelln terrelln commented Jan 7, 2022

oss-fuzz uncovered a scenario where we're evaluating the cost of litLength = 131072,
which can't be represented in the zstd format, so we accessed 1 beyond LL_bits.

Fix the issue by making it cost 1 bit more than litLength = 131071.

There are still follow ups:

  1. This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals
    over a match. Should we bound literals_cost[literal] > 0, unless the block truly only
    has one literal value?
  2. When no matches are found, the cost model isn't updated. In this case no matches were
    found for an entire block. So the literals cost model wasn't updated at all. That made
    the optimal parser think literals_cost[0] = 0, where it is actually quite high, since
    the block was entirely random noise.

Credit to OSS-Fuzz.

oss-fuzz uncovered a scenario where we're evaluating the cost of litLength = 131072,
which can't be represented in the zstd format, so we accessed 1 beyond LL_bits.

Fix the issue by making it cost 1 bit more than litLength = 131071.

There are still follow ups:
1. This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals
   over a match. Should we bound literals_cost[literal] > 0, unless the block truly only
   has one literal value?
2. When no matches are found, the cost model isn't updated. In this case no matches were
   found for an entire block. So the literals cost model wasn't updated at all. That made
   the optimal parser think literals_cost[0] = 0, where it is actually quite high, since
   the block was entirely random noise.

Credit to OSS-Fuzz.
@felixhandte
Copy link
Contributor

An alternative would be to add a dummy value to the end of LL_bits. That would avoid adding a branch. What do you think?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 7, 2022

An alternative would be to add a dummy value to the end of LL_bits. That would avoid adding a branch. What do you think?

This would be better for performance, but :

  • it's not a place where performance matters (at least not at this micro level)
  • this transfer a specific (and possibly temporary) problem from zstd_opt into baseline definitions in zstd_internal
  • this could be interpreted as "we have a way to encode this length", while that's actually not the case, this length should not be possible.

As everything, it's a matter of trade off, and I believe that in this case, clarity and reduced cross-dependency win the round.

@terrelln
Copy link
Contributor Author

terrelln commented Jan 7, 2022

An alternative would be to add a dummy value to the end of LL_bits. That would avoid adding a branch. What do you think?

We'd also need to add a value to optPtr->litLengthFreq. And LL_bits is used outside the optimal parser. And I'd rather get an ASAN error than have a too long litLength pass elsewhere. And for clarity, I don't want people to see LL_bits and think that 131072 is representable.

Plus, I measured the performance, and don't see a difference. This branch should be 100% predictable, and cheap.

@terrelln terrelln merged commit 0677b26 into facebook:dev Jan 7, 2022
@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 7, 2022

Regarding follows-up :

  1. This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals
    over a match. Should we bound literals_cost[literal] > 0, unless the block truly only
    has one literal value?

Indeed,
considering that literals are compressed with huffman,
we could enforce a minimum of 1 bit per literal.
This would avoid situations where the cost of literals becomes incredibly low.

unless the block truly only has one literal value?

Well, currently, RLE block is an "after-parser" analysis,
where a block with a suspiciously small compressed outcome is scanned to check if it effectively consists of a single repeated byte value.

Making RLE-block part of the parsing logic would, in my opinion, belong to flexible blocks boundaries logic, a topic that is expected to be investigated by @binhdvo .

As for the case where the literals block consist of only a single byte value, well, this can happen, but in such case, the literals' block tends to be extremely small, like 1 or 2 bytes, so the error compared to "1-bit per byte" is not sensible.

One could imagine a corner case where all literals have same value, and because they do, they are almost "free",
therefore the parsing logic could take advantage of this, by intentionally shortening matches that start or end with the "free" literal symbol.
But this is a very fragile corner case, which is destroyed as soon as a single different byte value appear,
so "partial" parsers which have to take decisions before reaching the end of the block can't take such a risk.

Consequently, a 1-bit minimum cost for literals seems an adequate rule.

When no matches are found, the cost model isn't updated.

Well, this doesn't seem correct, indeed.
After an incompressible block, it's dubious that re-employing starting statistics from the last previously compressible block brings any value to the following block. It sounds like it would generally be more misleading than helpful.

It might be more appropriate to reset literals statistics, as if it was the beginning of a new frame.

I presume the main issue here is that sending a block in "uncompressed" mode is an "after-parsing" decision, not part of the optimal parser logic, which might be completely unaware of the decision.

@terrelln
Copy link
Contributor Author

terrelln commented Jan 7, 2022

Well, this doesn't seem correct, indeed.
After an incompressible block, it's dubious that re-employing starting statistics from the last previously compressible block brings any value to the following block. It sounds like it would generally be more misleading than helpful.

It might be more appropriate to reset literals statistics, as if it was the beginning of a new frame.

I presume the main issue here is that sending a block in "uncompressed" mode is an "after-parsing" decision, not part of the optimal parser logic, which might be completely unaware of the decision.

While that could be an interesting optimization, that isn't quite the problem.

The root of the problem is this branch:

if (!nbMatches) { ip++; continue; }

  • Imagine this branch is taken for the first 128KB - 4KB of the block, so it is incompressible
  • Then the rest of the block has enough matches to get through the full 4KB parse window
  • The parse doesn't update statistics until AFTER the first time we save our sequences. So this 4KB parse window doesn't use any of the information about the first 124KB of the block which is entirely literals.

This could make up a reasonably large error for smaller files. E.g. imagine an 8KB file which is 4KB of noise and 4KB of compressible data. We wouldn't get to use any of the information about the 4KB of literals when making cost decisions for the next chunk.

A solution would be to add a check in this branch that checks if we've gone too long without updating our literals cost model (e.g. 1KB or ZSTD_OPT_NUM), and if so proactively update it, since we're guaranteed to emit those positions as literals.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 7, 2022

A solution would be to add a check in this branch that checks if we've gone too long without updating our literals cost model (e.g. 1KB or ZSTD_OPT_NUM), and if so proactively update it, since we're guaranteed to emit those positions as literals.

Yes, sounds like a good solution.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jan 8, 2022

For information, I tried @terrelln suggestion to update literals early in the cost model, by registering all literals in front of the first match in the series. The result of which can be observed in the following commit :
88e9a98

Unfortunately,
the resulting compressed frame is not necessarily "better".
Even though the cost estimator has, by definition, more and fresher information to make more optimal choices, results are all over the place. There is essentially one big winner of this strategy (silesia/x-ray), but all other ones are either neutral or detrimental, with total average being slightly negative.

So this is a bit disappointing, and probably not worth a PR yet.

(note: this situation somehow reminds me of #2781 , they might share some common woes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants