-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix repcode-related OSS-fuzz issues in block splitter #2560
Conversation
If I do understand properly, when a block is split, some of its "partitions" (== sub-blocks) can be emitted uncompressed, likely as a post-processing decision, once it's clear that sending them compressed is unfavorable. Problem is, if next partition uses a repeat code within its first 3 offset codes, it probably refers to one of the cancelled offsets of previous partition, now sent uncompressed, and therefore these offset codes are no longer valid. To avoid that problem, the code forces the previous partition to be sent compressed, so that the offset history remains unaltered, so that the following repeat codes remain valid. This is fine as long as all partitions are part of the same original block (128 KB), because it's a condition which can be controlled. However, if it's the last partition, the next one will be part of the next big block, which stats are not yet known. If it uses a repeat code, and the last partition was sent uncompressed, it now refers to a non-existing offset. To avoid this issue, force the last partition to be sent compressed, always, so that last 3 offsets are always well defined. This inspires the following questions :
|
True, this is something worth investigating, since it could improve compression ratio for these particular cases, and would be a more elegant solution than emitting an compressed block even when not advantageous to do so. I think I'll add a follow-up PR since the recomputation doesn't seem trivial. And yeah, the crux of the issue is that we don't know the offset history at any given split point, which means we'd have to recompute. The other OSS-fuzz failure is also related to repcodes. I have been investigating for a bit, and realized that in general, these were not handled correctly to begin with. The block splitter PR had been using The two new commits do the following:
|
…ropyTables() for better function signature
08dbdb7
to
6e34128
Compare
Updated again with repcode history update/tracking as we compress each individual partition. This means we no longer need to be able to force emit compressed blocks, which means now it's no longer possible to emit a compressed block > 128K, so we no longer need the fallback either. |
6e34128
to
d1284ed
Compare
… compressed blocks
@@ -2793,11 +2791,11 @@ static int ZSTD_maybeRLE(seqStore_t const* seqStore) | |||
return nbSeqs < 4 && nbLits < 10; | |||
} | |||
|
|||
static void ZSTD_confirmRepcodesAndEntropyTables(ZSTD_CCtx* zc) | |||
static void ZSTD_blockState_confirmRepcodesAndEntropyTables(ZSTD_blockState_t* const bs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Good call ! more accurate object parameter == better clarity
@@ -3180,6 +3178,26 @@ static void ZSTD_deriveSeqStoreChunk(seqStore_t* resultSeqStore, | |||
resultSeqStore->ofCode += startIdx; | |||
} | |||
|
|||
/** | |||
* ZSTD_seqStore_updateRepcodes(): Starting from an array of initial repcodes and a seqStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I like the way the logic is well encapsulated here
/* Error checking and repcodes update */ | ||
ZSTD_confirmRepcodesAndEntropyTables(zc); | ||
if (isPartition) { | ||
/* We manually update repcodes if we are currently compressing a partition. Otherwise, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Great comment, justifying the "why"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR ! The updated logic is way better
OSS-Fuzz found an edge case with the block splitter: we shouldn't emit the last partition of a split block as uncompressed in any case, since we can't determine if the next 128K block has a repcode.
We already handle this for the next partitions (i.e. don't emit uncompressed if the next partition has repcode within the first three entries), but missed the case of a repcode in the first three seqs of the next "complete" 128K block. The fuzzer found this pretty quickly, so good job oss-fuzz :)