-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[opt] Fix oss-fuzz bug in optimal parser #2980
Conversation
dba1487
to
0064b70
Compare
oss-fuzz uncovered a scenario where we're evaluating the cost of litLength = 131072, which can't be represented in the zstd format, so we accessed 1 beyond LL_bits. Fix the issue by making it cost 1 bit more than litLength = 131071. There are still follow ups: 1. This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals over a match. Should we bound literals_cost[literal] > 0, unless the block truly only has one literal value? 2. When no matches are found, the cost model isn't updated. In this case no matches were found for an entire block. So the literals cost model wasn't updated at all. That made the optimal parser think literals_cost[0] = 0, where it is actually quite high, since the block was entirely random noise. Credit to OSS-Fuzz.
0064b70
to
4d8a213
Compare
An alternative would be to add a dummy value to the end of |
This would be better for performance, but :
As everything, it's a matter of trade off, and I believe that in this case, clarity and reduced cross-dependency win the round. |
We'd also need to add a value to Plus, I measured the performance, and don't see a difference. This branch should be 100% predictable, and cheap. |
Regarding follows-up :
Indeed,
Well, currently, RLE block is an "after-parser" analysis, Making RLE-block part of the parsing logic would, in my opinion, belong to flexible blocks boundaries logic, a topic that is expected to be investigated by @binhdvo . As for the case where the literals block consist of only a single byte value, well, this can happen, but in such case, the literals' block tends to be extremely small, like 1 or 2 bytes, so the error compared to "1-bit per byte" is not sensible. One could imagine a corner case where all literals have same value, and because they do, they are almost "free", Consequently, a 1-bit minimum cost for literals seems an adequate rule.
Well, this doesn't seem correct, indeed. It might be more appropriate to reset literals statistics, as if it was the beginning of a new frame. I presume the main issue here is that sending a block in "uncompressed" mode is an "after-parsing" decision, not part of the optimal parser logic, which might be completely unaware of the decision. |
While that could be an interesting optimization, that isn't quite the problem. The root of the problem is this branch: Line 1086 in 5f2c3d9
This could make up a reasonably large error for smaller files. E.g. imagine an 8KB file which is 4KB of noise and 4KB of compressible data. We wouldn't get to use any of the information about the 4KB of literals when making cost decisions for the next chunk. A solution would be to add a check in this branch that checks if we've gone too long without updating our literals cost model (e.g. 1KB or |
Yes, sounds like a good solution. |
For information, I tried @terrelln suggestion to update literals early in the cost model, by registering all literals in front of the first match in the series. The result of which can be observed in the following commit : Unfortunately, So this is a bit disappointing, and probably not worth a PR yet. (note: this situation somehow reminds me of #2781 , they might share some common woes) |
oss-fuzz uncovered a scenario where we're evaluating the cost of litLength = 131072,
which can't be represented in the zstd format, so we accessed 1 beyond LL_bits.
Fix the issue by making it cost 1 bit more than litLength = 131071.
There are still follow ups:
over a match. Should we bound literals_cost[literal] > 0, unless the block truly only
has one literal value?
found for an entire block. So the literals cost model wasn't updated at all. That made
the optimal parser think literals_cost[0] = 0, where it is actually quite high, since
the block was entirely random noise.
Credit to OSS-Fuzz.