Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix repcode-related OSS-fuzz issues in block splitter #2560

Merged
merged 4 commits into from
Mar 31, 2021

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Mar 25, 2021

OSS-Fuzz found an edge case with the block splitter: we shouldn't emit the last partition of a split block as uncompressed in any case, since we can't determine if the next 128K block has a repcode.

We already handle this for the next partitions (i.e. don't emit uncompressed if the next partition has repcode within the first three entries), but missed the case of a repcode in the first three seqs of the next "complete" 128K block. The fuzzer found this pretty quickly, so good job oss-fuzz :)

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 26, 2021

If I do understand properly, when a block is split, some of its "partitions" (== sub-blocks) can be emitted uncompressed, likely as a post-processing decision, once it's clear that sending them compressed is unfavorable.

Problem is, if next partition uses a repeat code within its first 3 offset codes, it probably refers to one of the cancelled offsets of previous partition, now sent uncompressed, and therefore these offset codes are no longer valid.

To avoid that problem, the code forces the previous partition to be sent compressed, so that the offset history remains unaltered, so that the following repeat codes remain valid.

This is fine as long as all partitions are part of the same original block (128 KB), because it's a condition which can be controlled. However, if it's the last partition, the next one will be part of the next big block, which stats are not yet known. If it uses a repeat code, and the last partition was sent uncompressed, it now refers to a non-existing offset. To avoid this issue, force the last partition to be sent compressed, always, so that last 3 offsets are always well defined.

This inspires the following questions :

  • It seems that the offset history generated by the match finder on current full block is sent verbatim to next block, no matter what.
  • Another possibility could have been to send the offset history of latest compressed partition. This way, the last partition(s) could be emitted uncompressed, there is still a valid offset history to provide to next block.
  • However this would require to know the offset history at the boundaries between partitions.
    • => I presume this is the issue, the offset history is not known at the boundary between partitions.
    • I also presume it could be rebuilt, when needed, since the offsets are known and stored in the seqStore. However, if this list contains repeat codes, it becomes more complex (the translation from rep-code to actual offset codes is not tracked within the seqStore).
  • Similarly, when a partition has a repeat code within its first 3 offsets codes, it relies on the existence of a repeat offset history, which would be altered if previous partition was sent uncompressed.
  • Another solution would be to translate these repeat codes into actual offset values, so that they are no longer dependent on previous offset history.
    • => I presume that the issue is that the translation from repeat code into actual offset values is not known (not stored into seqStore).

@senhuang42 senhuang42 changed the title Do not emit final partitions of split blocks as RLE/uncompressed Fix repcode-related OSS-fuzz issues in block splitter Mar 26, 2021
@senhuang42
Copy link
Contributor Author

senhuang42 commented Mar 26, 2021

I also presume it could be rebuilt, when needed, since the offsets are known and stored in the seqStore. However, if this list contains repeat codes, it becomes more complex (the translation from rep-code to actual offset codes is not tracked within the seqStore).

True, this is something worth investigating, since it could improve compression ratio for these particular cases, and would be a more elegant solution than emitting an compressed block even when not advantageous to do so. I think I'll add a follow-up PR since the recomputation doesn't seem trivial. And yeah, the crux of the issue is that we don't know the offset history at any given split point, which means we'd have to recompute.

The other OSS-fuzz failure is also related to repcodes. I have been investigating for a bit, and realized that in general, these were not handled correctly to begin with. The block splitter PR had been using nextCBlock->rep as sort of temporary storage for the next repcodes for the current 128K block. Hence the ZSTD_memcpy() call to copy repcodes from prevCBlock to nextCBlock after each block. This works because we call ZSTD_confirmRepcodesAndEntropyTables() at each partition (though we also do unnecessary copies). But this is wrong if we have a no-compress block, since then we do not call ZSTD_confirmRepcodesAndEntropyTables() but still do the ZSTD_memcpy(). The one-liner fix was to only call the ZSTD_memcpy() on compressed blocks, but in general this is a bit janky, so I've gone ahead and refactored (and fixed) this behavior for clarity.

The two new commits do the following:

  • Only update repcodes when we should: at the last partition, or if we decide not to split and compress in one go.
    • Note that we should not update repcodes if we emit a set of partitions that exceed 128K and re-compress, since we already would have updated the repcodes in the last partition.
    • This fixes the newer OSS-Fuzz bug
  • Changes internal function signature of ZSTD_confirmRepcodesAndEntropyTables() to take in a ZSTD_blockState_t* instead of an entire ZSTD_CCtx*, which seems like way too big of a scope for a function that is basically just doing a swap.

@senhuang42 senhuang42 force-pushed the blocksplit_fuzz_fix branch 2 times, most recently from 08dbdb7 to 6e34128 Compare March 29, 2021 21:30
@senhuang42
Copy link
Contributor Author

Updated again with repcode history update/tracking as we compress each individual partition. This means we no longer need to be able to force emit compressed blocks, which means now it's no longer possible to emit a compressed block > 128K, so we no longer need the fallback either.

@@ -2793,11 +2791,11 @@ static int ZSTD_maybeRLE(seqStore_t const* seqStore)
return nbSeqs < 4 && nbLits < 10;
}

static void ZSTD_confirmRepcodesAndEntropyTables(ZSTD_CCtx* zc)
static void ZSTD_blockState_confirmRepcodesAndEntropyTables(ZSTD_blockState_t* const bs)
Copy link
Contributor

@Cyan4973 Cyan4973 Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Good call ! more accurate object parameter == better clarity

@@ -3180,6 +3178,26 @@ static void ZSTD_deriveSeqStoreChunk(seqStore_t* resultSeqStore,
resultSeqStore->ofCode += startIdx;
}

/**
* ZSTD_seqStore_updateRepcodes(): Starting from an array of initial repcodes and a seqStore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I like the way the logic is well encapsulated here

/* Error checking and repcodes update */
ZSTD_confirmRepcodesAndEntropyTables(zc);
if (isPartition) {
/* We manually update repcodes if we are currently compressing a partition. Otherwise,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Great comment, justifying the "why"

Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR ! The updated logic is way better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants