Fix repcode-related OSS-fuzz issues in block splitter #2560

senhuang42 · 2021-03-25T23:25:15Z

OSS-Fuzz found an edge case with the block splitter: we shouldn't emit the last partition of a split block as uncompressed in any case, since we can't determine if the next 128K block has a repcode.

We already handle this for the next partitions (i.e. don't emit uncompressed if the next partition has repcode within the first three entries), but missed the case of a repcode in the first three seqs of the next "complete" 128K block. The fuzzer found this pretty quickly, so good job oss-fuzz :)

Cyan4973 · 2021-03-26T00:35:33Z

If I do understand properly, when a block is split, some of its "partitions" (== sub-blocks) can be emitted uncompressed, likely as a post-processing decision, once it's clear that sending them compressed is unfavorable.

Problem is, if next partition uses a repeat code within its first 3 offset codes, it probably refers to one of the cancelled offsets of previous partition, now sent uncompressed, and therefore these offset codes are no longer valid.

To avoid that problem, the code forces the previous partition to be sent compressed, so that the offset history remains unaltered, so that the following repeat codes remain valid.

This is fine as long as all partitions are part of the same original block (128 KB), because it's a condition which can be controlled. However, if it's the last partition, the next one will be part of the next big block, which stats are not yet known. If it uses a repeat code, and the last partition was sent uncompressed, it now refers to a non-existing offset. To avoid this issue, force the last partition to be sent compressed, always, so that last 3 offsets are always well defined.

This inspires the following questions :

It seems that the offset history generated by the match finder on current full block is sent verbatim to next block, no matter what.
Another possibility could have been to send the offset history of latest compressed partition. This way, the last partition(s) could be emitted uncompressed, there is still a valid offset history to provide to next block.
However this would require to know the offset history at the boundaries between partitions.
- => I presume this is the issue, the offset history is not known at the boundary between partitions.
- I also presume it could be rebuilt, when needed, since the offsets are known and stored in the seqStore. However, if this list contains repeat codes, it becomes more complex (the translation from rep-code to actual offset codes is not tracked within the seqStore).
Similarly, when a partition has a repeat code within its first 3 offsets codes, it relies on the existence of a repeat offset history, which would be altered if previous partition was sent uncompressed.
Another solution would be to translate these repeat codes into actual offset values, so that they are no longer dependent on previous offset history.
- => I presume that the issue is that the translation from repeat code into actual offset values is not known (not stored into seqStore).

senhuang42 · 2021-03-26T17:15:51Z

I also presume it could be rebuilt, when needed, since the offsets are known and stored in the seqStore. However, if this list contains repeat codes, it becomes more complex (the translation from rep-code to actual offset codes is not tracked within the seqStore).

True, this is something worth investigating, since it could improve compression ratio for these particular cases, and would be a more elegant solution than emitting an compressed block even when not advantageous to do so. I think I'll add a follow-up PR since the recomputation doesn't seem trivial. And yeah, the crux of the issue is that we don't know the offset history at any given split point, which means we'd have to recompute.

The other OSS-fuzz failure is also related to repcodes. I have been investigating for a bit, and realized that in general, these were not handled correctly to begin with. The block splitter PR had been using nextCBlock->rep as sort of temporary storage for the next repcodes for the current 128K block. Hence the ZSTD_memcpy() call to copy repcodes from prevCBlock to nextCBlock after each block. This works because we call ZSTD_confirmRepcodesAndEntropyTables() at each partition (though we also do unnecessary copies). But this is wrong if we have a no-compress block, since then we do not call ZSTD_confirmRepcodesAndEntropyTables() but still do the ZSTD_memcpy(). The one-liner fix was to only call the ZSTD_memcpy() on compressed blocks, but in general this is a bit janky, so I've gone ahead and refactored (and fixed) this behavior for clarity.

The two new commits do the following:

Only update repcodes when we should: at the last partition, or if we decide not to split and compress in one go.
- Note that we should not update repcodes if we emit a set of partitions that exceed 128K and re-compress, since we already would have updated the repcodes in the last partition.
- This fixes the newer OSS-Fuzz bug
Changes internal function signature of ZSTD_confirmRepcodesAndEntropyTables() to take in a ZSTD_blockState_t* instead of an entire ZSTD_CCtx*, which seems like way too big of a scope for a function that is basically just doing a swap.

…ropyTables() for better function signature

senhuang42 · 2021-03-29T21:33:08Z

Updated again with repcode history update/tracking as we compress each individual partition. This means we no longer need to be able to force emit compressed blocks, which means now it's no longer possible to emit a compressed block > 128K, so we no longer need the fallback either.

… compressed blocks

Cyan4973 · 2021-03-31T03:24:26Z

lib/compress/zstd_compress.c

@@ -2793,11 +2791,11 @@ static int ZSTD_maybeRLE(seqStore_t const* seqStore)
    return nbSeqs < 4 && nbLits < 10;
 }

-static void ZSTD_confirmRepcodesAndEntropyTables(ZSTD_CCtx* zc)
+static void ZSTD_blockState_confirmRepcodesAndEntropyTables(ZSTD_blockState_t* const bs)


👍
Good call ! more accurate object parameter == better clarity

Cyan4973 · 2021-03-31T03:25:45Z

lib/compress/zstd_compress.c

@@ -3180,6 +3178,26 @@ static void ZSTD_deriveSeqStoreChunk(seqStore_t* resultSeqStore,
    resultSeqStore->ofCode += startIdx;
 }

+/**
+ * ZSTD_seqStore_updateRepcodes(): Starting from an array of initial repcodes and a seqStore,


👍
I like the way the logic is well encapsulated here

Cyan4973 · 2021-03-31T03:27:50Z

lib/compress/zstd_compress.c

-        /* Error checking and repcodes update */
-        ZSTD_confirmRepcodesAndEntropyTables(zc);
+        if (isPartition) {
+            /* We manually update repcodes if we are currently compressing a partition. Otherwise,


👍
Great comment, justifying the "why"

Cyan4973

Great PR ! The updated logic is way better

facebook-github-bot added the CLA Signed label Mar 25, 2021

senhuang42 changed the title ~~Do not emit final partitions of split blocks as RLE/uncompressed~~ Fix repcode-related OSS-fuzz issues in block splitter Mar 26, 2021

senhuang42 added 3 commits March 29, 2021 11:27

Do not emit last partitions of blocks as RLE/uncompressed

0284399

Fix repcode updates within block splitter

0be16bb

Add a entropytables confirm function, redo ZSTD_confirmRepcodesAndEnt…

5687b53

…ropyTables() for better function signature

senhuang42 force-pushed the blocksplit_fuzz_fix branch 2 times, most recently from 08dbdb7 to 6e34128 Compare March 29, 2021 21:30

senhuang42 force-pushed the blocksplit_fuzz_fix branch from 6e34128 to d1284ed Compare March 29, 2021 21:42

Add a repcode updater to block splitter, no longer need to force emit…

d1284ed

… compressed blocks

Cyan4973 reviewed Mar 31, 2021

View reviewed changes

Cyan4973 approved these changes Mar 31, 2021

View reviewed changes

senhuang42 merged commit 255925c into facebook:dev Mar 31, 2021

senhuang42 mentioned this pull request Apr 5, 2021

Replace invalid repcodes during block splitting #2569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix repcode-related OSS-fuzz issues in block splitter #2560

Fix repcode-related OSS-fuzz issues in block splitter #2560

senhuang42 commented Mar 25, 2021 •

edited

Loading

Cyan4973 commented Mar 26, 2021 •

edited

Loading

senhuang42 commented Mar 26, 2021 •

edited

Loading

senhuang42 commented Mar 29, 2021

Cyan4973 Mar 31, 2021 •

edited

Loading

Cyan4973 Mar 31, 2021

Cyan4973 Mar 31, 2021

Cyan4973 left a comment •

edited

Loading

Fix repcode-related OSS-fuzz issues in block splitter #2560

Fix repcode-related OSS-fuzz issues in block splitter #2560

Conversation

senhuang42 commented Mar 25, 2021 • edited Loading

Cyan4973 commented Mar 26, 2021 • edited Loading

senhuang42 commented Mar 26, 2021 • edited Loading

senhuang42 commented Mar 29, 2021

Cyan4973 Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 Mar 31, 2021

Choose a reason for hiding this comment

Cyan4973 Mar 31, 2021

Choose a reason for hiding this comment

Cyan4973 left a comment • edited Loading

Choose a reason for hiding this comment

senhuang42 commented Mar 25, 2021 •

edited

Loading

Cyan4973 commented Mar 26, 2021 •

edited

Loading

senhuang42 commented Mar 26, 2021 •

edited

Loading

Cyan4973 Mar 31, 2021 •

edited

Loading

Cyan4973 left a comment •

edited

Loading