Skip most long matches in lazy hash table update #2755

senhuang42 · 2021-08-26T16:17:42Z

Origin:
Brought up in #2662. Fix suggested here by @terrelln, #2662 (comment).

Overview:
With this change, for the lazy matchfinder, we only ever update up to 256 positions in the hash table, which improves performance on long matches. When this happens, we need to stop and overwrite everything in the hash cache.

Current status:
I've been playing around with different settings for the threshold to skip, and 256 seems reasonable. It seems like random changes to the code can cause noticeable and consistent shifts in the performance. It's hard to say if this is generally positive or negative for the average case. We care most about level 6 since that's what the warehouse-type use cases should be using starting with 1.5.1.

Most importantly, I think it's important to understand why the performance is changing in the various ways, since I don't really have a good explanation for that yet. For example, the UNLIKELY annotation I don't think is actually that helpful, yet on level 5 it dramatically increases speed.

skip_long_matches_lazy:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.392),  115.2 MB/s, 1007.7 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.448),   93.0 MB/s, 1001.1 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.505),   71.8 MB/s, 1061.9 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   89.1 MB/s,  845.7 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.031),   68.8 MB/s,  845.8 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   53.9 MB/s,  886.0 MB/s 

// out.txt is the edge case mentioned in issue #2662
 5#out.txt           : 256 KiB -> 4.46 KiB (57.46),  790.3 MB/s, 7064.6 MB/s 
 6#out.txt           : 256 KiB -> 4.89 KiB (52.40),  606.9 MB/s, 3075.7 MB/s 
 7#out.txt           : 256 KiB -> 3.54 KiB (72.36),  582.4 MB/s, 7072.2 MB/s


dev:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.393),  113.8 MB/s, 1011.5 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.449),   94.6 MB/s, 1003.0 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.506),   70.4 MB/s, 1063.9 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   88.0 MB/s,  834.6 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.032),   70.5 MB/s,  838.3 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   52.8 MB/s,  866.1 MB/s

 5#out.txt           : 256 KiB -> 4.74 KiB (53.96),  360.7 MB/s, 7084.3 MB/s 
 6#out.txt           : 256 KiB -> 5.05 KiB (50.68),  336.8 MB/s, 3051.4 MB/s 
 7#out.txt           : 256 KiB -> 3.85 KiB (66.53),  344.7 MB/s, 7045.6 MB/s

Some variations:

skip, without UNLIKELY annotation:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.392),  111.7 MB/s, 1010.8 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.448),   94.4 MB/s, 1002.7 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.505),   71.7 MB/s, 1063.4 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   85.4 MB/s,  848.2 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.031),   70.0 MB/s,  846.2 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   54.0 MB/s,  890.2 MB/s

skip, without UNLIKELY annotation, with "p2align 5" before the update loop:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.392),  112.1 MB/s, 1010.7 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.448),   93.1 MB/s, 1002.8 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.505),   72.5 MB/s, 1063.4 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   86.6 MB/s,  848.1 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.031),   69.3 MB/s,  848.9 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   54.3 MB/s,  894.8 MB/s

// Disabling prefetching in ZSTD_row_fillHashCache() seems to be neutral/negative

Cyan4973 · 2021-08-26T22:49:13Z

lib/compress/zstd_lazy.c

+    if (useCache) {
+        /* Only skip positions when using hash cache, i.e. 
+           if we are loading a dict, don't skip anything */
+        if (UNLIKELY(target - idx > kMaxPositionsToUpdate)) {


The UNLIKELY() statement looks appropriate to me.
We don't expect many large matches,
and when there is one, the savings from not updating each and every position should outweigh the slight additional cost of the unpredicted branch.
For all other cases (normal matches), this should put this branch out of the way.

Cyan4973 · 2021-08-26T23:01:41Z

lib/compress/zstd_lazy.c

+           if we are loading a dict, don't skip anything */
+        if (UNLIKELY(target - idx > kMaxPositionsToUpdate)) {
+            idx = target - kMaxPositionsToUpdate;
+            ZSTD_row_fillHashCache(ms, base, rowLog, mls, idx, ip+1);


My understanding about this line is limited.
It looks to me that it's both caching hash results, and prefetching match positions from the corresponding rows.
However :

These rows are, at this stage, not yet updated. I presume that it's the purpose of ZSTD_row_update_internal() to do so. And there are many positions yet to add (kMaxPositionsToUpdate at this point). So is the pre-fetching helpful ? are these lines still in L1 cache after all these updates ?

Why is there no such equivalent need for smaller matches, which length is < kMaxPositionsToUpdate ?

(More general : what's the saving from storing the hash values into the hashCache ?)

Depending on the level of complexity, we may need to take this discussion off line.

The hash cache is the hashes of the next 8 positions. It is used to prefetch the hash table rows a few positions in advance. ZSTD_row_update_internal() requires the hash cache to be correctly filled, it will consume hashCache[(ip - base) & 7] to get the hash of ip - base, and re-fill it with the hash of (ip + 8 - base).

Normally, we process every position, so we only need to fill it at the beginning of the block. But, now that we are skipping positions, we need to re-fill it when we skip positions.

(More general : what's the saving from storing the hash values into the hashCache ?)

We need to compute the hash ahead of time to prefetch. We've measured both re-computing the hash when we need it, and keeping it cached in the hash-cache. The hash-cache out-performed re-computing the hash.

However, now that we are skipping positions, that calculus changes a little. But, it is still rare, so I wouldn't expect a big difference.

Yeah, this makes sense.
I now realize that I have confused idx with target in my initial reading.
Prefetching the next 8 positions starting from idx is still useful, since they are going to be used in the loop just after.

senhuang42 · 2021-09-22T16:01:50Z

Let me know if there are any other comments on this. I'll resolve the merge conflicts once this PR is accepted (since some other PRs also affect results.csv.

Cyan4973 · 2021-09-22T16:21:35Z

If I understand the algorithm correctly, whenever a large section of input (larger than threshold) is skipped, normally as a consequence of a large match (though I guess combination with ldm would trigger the same code path ?), then you only insert the last threshold bytes of the skipped section.

Have you considered some kind of "split" strategy, like updating the first 128 bytes and the last 128 bytes of the section ?
Or any variation of this scheme (64 -192, 64-64, etc.) ?

Some clues about this proposal :

Earlier analysis of the correlation of the position of matches show that they tend to cluster close to the beginning or the end of existing matches. That means that the positions right after the selected match are relatively good candidates for future matches.
I noticed that this PR has generally a negative consequence on compression ratio on almost all files. While it's small, it still feels surprisingly high, and my expectation would be a smaller difference than that.
256 bytes should be overkill. Useful ranges of bytes tend to be closer to the borders of matches.

This, to me, suggests that useful byte positions (those after the beginning of the match) tend to miss in the table after this rule is applied.

senhuang42 · 2021-09-23T14:46:33Z

That means that the positions right after the selected match are relatively good candidates for future matches.

Yeah, it seems like the following benchmarks confirm this.

I've reduced the threshold to 128 bytes and tried a few strategies: first 128 positions, last 128, and various splits of 128 between first/last. It seems like having mostly first positions as well as some last positions of the long match help. But the degraded performance on level 6 is worrisome (since it replaces level 7, which is an internally important level).

dev (no skips):
 5#silesia.tar       : 211957760 B -> 62473471 B (3.393),  112.3 MB/s, 1010.6 MB/s 
 6#silesia.tar       : 211957760 B -> 61461316 B (3.449),   95.1 MB/s, 1001.9 MB/s 
 7#silesia.tar       : 211957760 B -> 60459424 B (3.506),   71.9 MB/s, 1062.7 MB/s

Last 128 bytes (original method):
 5#silesia.tar       : 211957760 B -> 62494173 B (3.392),  116.7 MB/s, 1010.5 MB/s 
 6#silesia.tar       : 211957760 B -> 61490724 B (3.447),   93.2 MB/s, 1001.3 MB/s 
 7#silesia.tar       : 211957760 B -> 60467579 B (3.505),   72.4 MB/s, 1061.6 MB/s 

First 128 bytes:
 5#silesia.tar       : 211957760 B -> 62487359 B (3.392),  112.5 MB/s, 1010.1 MB/s 
 6#silesia.tar       : 211957760 B -> 61484383 B (3.447),   93.5 MB/s, 1000.5 MB/s 
 7#silesia.tar       : 211957760 B -> 60463389 B (3.506),   70.0 MB/s, 1061.3 MB/s 

64 first / 64 last:
 5#silesia.tar       : 211957760 B -> 62482045 B (3.392),  111.8 MB/s, 1010.6 MB/s 
 6#silesia.tar       : 211957760 B -> 61476578 B (3.448),   92.1 MB/s, 1001.4 MB/s 
 7#silesia.tar       : 211957760 B -> 60459395 B (3.506),   71.6 MB/s, 1063.2 MB/s 

96 first / 32 last:
 5#silesia.tar       : 211957760 B -> 62478591 B (3.392),  112.3 MB/s, 1010.5 MB/s 
 6#silesia.tar       : 211957760 B -> 61472760 B (3.448),   91.6 MB/s, 1000.2 MB/s 
 7#silesia.tar       : 211957760 B -> 60458728 B (3.506),   72.3 MB/s, 1061.4 MB/s

32 first / 96 last:
 5#silesia.tar       : 211957760 B -> 62484231 B (3.392),  111.7 MB/s, 1006.7 MB/s 
 6#silesia.tar       : 211957760 B -> 61478567 B (3.448),   91.0 MB/s, 1002.0 MB/s 
 7#silesia.tar       : 211957760 B -> 60461927 B (3.506),   72.4 MB/s, 1063.4 MB/s

Cyan4973 · 2021-09-23T14:58:51Z

Another strategy (that used to be employed in btlazy2) is that the threshold to trigger a "skip" event is not the same as the nb of updated positions.

So, as an example, one can decide to only update 128 positions, but this scenario is triggered only when the distance is larger than 256.

This makes it possible to preserve the "normal" loop more often.
It may have a (hopefully positive) impact on performance.

senhuang42 · 2021-09-28T15:30:34Z

With a threshold of 384 bytes to trigger the skip, and 96 bytes beginning of match, 32 bytes end of match, we have the following results:

The rightmost column includes a test where we add __asm__(".p2align 5") prior to the hot update loop. It doesn't seem to help except for on gcc-11. But generally speaking it seems like compiler/alignment has a pretty big impact on speed for levels 5/6, and that this PR definitely perturbs that a bit, but in general seems mostly neutral.

level 5, silesia.tar, MB/s

	dev	skip	skip (with alignment to 32-bytes)
gcc-11	114.3	110.9	114.3
gcc-10	113.9	114.7	111.7
gcc-9	111.8	113.1	114.0
gcc-8	111.8	109.0	109.6
clang-12	113.3	113.8	113.1

level 6, silesia.tar, MB/s

	dev	skip	skip (with alignment to 32-bytes)
gcc-11	94.8	91.9	93.1
gcc-10	92.5	94.2	92.0
gcc-9	89.7	89.2	90.2
gcc-8	88.9	93.2	92.8
clang-12	91.2	91.3	90.5

And of course, on data that's basically just long matches we still see the big speed improvement:

dev:
 5#out.txt           :    262144 ->      4858 (53.96),  366.9 MB/s, 7015.4 MB/s 
 6#out.txt           :    262144 ->      5173 (50.68),  346.0 MB/s, 3068.1 MB/s 
 7#out.txt           :    262144 ->      3940 (66.53),  339.7 MB/s, 7061.5 MB/s

skip:
 5#out.txt           :    262144 ->      5101 (51.39),  883.2 MB/s, 5300.4 MB/s 
 6#out.txt           :    262144 ->      5008 (52.35),  646.7 MB/s, 3089.1 MB/s 
 7#out.txt           :    262144 ->      3765 (69.63),  621.4 MB/s, 5908.2 MB/s

senhuang42 · 2021-09-29T13:59:17Z

Compression ratio comparison, since the old one is no longer valid due to the recent increase in skip threshold:

dev:
 5#silesia.tar       : 211950592 ->  62458865 (3.393)
 6#silesia.tar       : 211950592 ->  61446239 (3.449)
 7#silesia.tar       : 211950592 ->  60445279 (3.506)

 5#enwik7            :  10000000 ->   3363547 (2.973)
 6#enwik7            :  10000000 ->   3298642 (3.032)
 7#enwik7            :  10000000 ->   3218460 (3.107)

skip:
 5#silesia.tar       : 211950592 ->  62461337 (3.393)
 6#silesia.tar       : 211950592 ->  61452566 (3.449)
 7#silesia.tar       : 211950592 ->  60446110 (3.506)

 5#enwik7            :  10000000 ->   3363689 (2.973)
 6#enwik7            :  10000000 ->   3298903 (3.031)
 7#enwik7            :  10000000 ->   3218417 (3.107)

The regression in compressed size is quite small now. Although it's curious that the regression test is showing tiny improvements to compression ratio instead.

Cyan4973 · 2021-09-29T16:31:43Z

it's curious that the regression test is showing tiny improvements to compression ratio instead.

This method is skipping "unpromising" positions in the middle of long matches, primarily for the sake of speed.
As a consequence, these positions also do no longer occupy space in the fixed-size rows.
This space might be used instead by previous positions, which might end up being slightly more "promising" (although their efficiency is harmed by the increased distance). This is how you could end up, in some circumstances, with (slightly) better compression ratio, because the rows now contain slightly more "promising" match candidates.

This effect could be improved further by :

only skipping positions if the row is full, but continue filling it when it's not
skipping positions in the middle of long literal sections (this is partially done thanks to the increased sampling distance, but only partially).

However, both propositions above introduce complexity right in the middle of a hot update loop, while the impact on compression ratio is expected to be small, if not minimal. So these investigations have a fairly low bang for bucks ratio.

facebook-github-bot added the CLA Signed label Aug 26, 2021

senhuang42 force-pushed the skip_long_matches_lazy branch 2 times, most recently from d21fabb to 0f222a8 Compare August 26, 2021 16:22

senhuang42 marked this pull request as draft August 26, 2021 16:35

senhuang42 force-pushed the skip_long_matches_lazy branch from 0f222a8 to 3d6111f Compare August 26, 2021 16:56

Cyan4973 reviewed Aug 26, 2021

View reviewed changes

senhuang42 force-pushed the skip_long_matches_lazy branch from 3d6111f to 4e98099 Compare August 31, 2021 15:55

senhuang42 marked this pull request as ready for review August 31, 2021 17:55

senhuang42 force-pushed the skip_long_matches_lazy branch from 4e98099 to a344b90 Compare September 9, 2021 16:02

senhuang42 added 4 commits September 28, 2021 08:19

Skip most long matches in lazy hash table update

b8fd6bf

Try beginning and end of match

ccdcbf4

Pull hot loop into its own function

4b7f45c

Update regression test

9360367

senhuang42 force-pushed the skip_long_matches_lazy branch from a344b90 to 9360367 Compare September 28, 2021 15:30

Cyan4973 approved these changes Sep 29, 2021

View reviewed changes

senhuang42 merged commit 358f177 into facebook:dev Sep 29, 2021

Cyan4973 mentioned this pull request Nov 24, 2021

v1.5.0 speed regressions #2662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip most long matches in lazy hash table update #2755

Skip most long matches in lazy hash table update #2755

senhuang42 commented Aug 26, 2021 •

edited

Loading

Cyan4973 Aug 26, 2021 •

edited

Loading

Cyan4973 Aug 26, 2021 •

edited

Loading

terrelln Aug 27, 2021

Cyan4973 Aug 27, 2021

senhuang42 commented Sep 22, 2021

Cyan4973 commented Sep 22, 2021 •

edited

Loading

senhuang42 commented Sep 23, 2021 •

edited

Loading

Cyan4973 commented Sep 23, 2021 •

edited

Loading

senhuang42 commented Sep 28, 2021 •

edited

Loading

senhuang42 commented Sep 29, 2021 •

edited

Loading

Cyan4973 commented Sep 29, 2021 •

edited

Loading

Skip most long matches in lazy hash table update #2755

Skip most long matches in lazy hash table update #2755

Conversation

senhuang42 commented Aug 26, 2021 • edited Loading

Cyan4973 Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

Cyan4973 Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

terrelln Aug 27, 2021

Choose a reason for hiding this comment

Cyan4973 Aug 27, 2021

Choose a reason for hiding this comment

senhuang42 commented Sep 22, 2021

Cyan4973 commented Sep 22, 2021 • edited Loading

senhuang42 commented Sep 23, 2021 • edited Loading

Cyan4973 commented Sep 23, 2021 • edited Loading

senhuang42 commented Sep 28, 2021 • edited Loading

senhuang42 commented Sep 29, 2021 • edited Loading

Cyan4973 commented Sep 29, 2021 • edited Loading

senhuang42 commented Aug 26, 2021 •

edited

Loading

Cyan4973 Aug 26, 2021 •

edited

Loading

Cyan4973 Aug 26, 2021 •

edited

Loading

Cyan4973 commented Sep 22, 2021 •

edited

Loading

senhuang42 commented Sep 23, 2021 •

edited

Loading

Cyan4973 commented Sep 23, 2021 •

edited

Loading

senhuang42 commented Sep 28, 2021 •

edited

Loading

senhuang42 commented Sep 29, 2021 •

edited

Loading

Cyan4973 commented Sep 29, 2021 •

edited

Loading