Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip most long matches in lazy hash table update #2755

Merged
merged 4 commits into from
Sep 29, 2021

Conversation

senhuang42
Copy link
Contributor

@senhuang42 senhuang42 commented Aug 26, 2021

Origin:
Brought up in #2662. Fix suggested here by @terrelln, #2662 (comment).

Overview:
With this change, for the lazy matchfinder, we only ever update up to 256 positions in the hash table, which improves performance on long matches. When this happens, we need to stop and overwrite everything in the hash cache.

Current status:
I've been playing around with different settings for the threshold to skip, and 256 seems reasonable. It seems like random changes to the code can cause noticeable and consistent shifts in the performance. It's hard to say if this is generally positive or negative for the average case. We care most about level 6 since that's what the warehouse-type use cases should be using starting with 1.5.1.

Most importantly, I think it's important to understand why the performance is changing in the various ways, since I don't really have a good explanation for that yet. For example, the UNLIKELY annotation I don't think is actually that helpful, yet on level 5 it dramatically increases speed.

skip_long_matches_lazy:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.392),  115.2 MB/s, 1007.7 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.448),   93.0 MB/s, 1001.1 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.505),   71.8 MB/s, 1061.9 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   89.1 MB/s,  845.7 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.031),   68.8 MB/s,  845.8 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   53.9 MB/s,  886.0 MB/s 

// out.txt is the edge case mentioned in issue #2662
 5#out.txt           : 256 KiB -> 4.46 KiB (57.46),  790.3 MB/s, 7064.6 MB/s 
 6#out.txt           : 256 KiB -> 4.89 KiB (52.40),  606.9 MB/s, 3075.7 MB/s 
 7#out.txt           : 256 KiB -> 3.54 KiB (72.36),  582.4 MB/s, 7072.2 MB/s


dev:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.393),  113.8 MB/s, 1011.5 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.449),   94.6 MB/s, 1003.0 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.506),   70.4 MB/s, 1063.9 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   88.0 MB/s,  834.6 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.032),   70.5 MB/s,  838.3 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   52.8 MB/s,  866.1 MB/s

 5#out.txt           : 256 KiB -> 4.74 KiB (53.96),  360.7 MB/s, 7084.3 MB/s 
 6#out.txt           : 256 KiB -> 5.05 KiB (50.68),  336.8 MB/s, 3051.4 MB/s 
 7#out.txt           : 256 KiB -> 3.85 KiB (66.53),  344.7 MB/s, 7045.6 MB/s

Some variations:

skip, without UNLIKELY annotation:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.392),  111.7 MB/s, 1010.8 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.448),   94.4 MB/s, 1002.7 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.505),   71.7 MB/s, 1063.4 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   85.4 MB/s,  848.2 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.031),   70.0 MB/s,  846.2 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   54.0 MB/s,  890.2 MB/s

skip, without UNLIKELY annotation, with "p2align 5" before the update loop:
 5#silesia.tar       : 202 MiB -> 59.6 MiB (3.392),  112.1 MB/s, 1010.7 MB/s 
 6#silesia.tar       : 202 MiB -> 58.6 MiB (3.448),   93.1 MB/s, 1002.8 MB/s 
 7#silesia.tar       : 202 MiB -> 57.7 MiB (3.505),   72.5 MB/s, 1063.4 MB/s

 5#enwik7            : 9.54 MiB -> 3.21 MiB (2.973),   86.6 MB/s,  848.1 MB/s 
 6#enwik7            : 9.54 MiB -> 3.15 MiB (3.031),   69.3 MB/s,  848.9 MB/s 
 7#enwik7            : 9.54 MiB -> 3.07 MiB (3.107),   54.3 MB/s,  894.8 MB/s

// Disabling prefetching in ZSTD_row_fillHashCache() seems to be neutral/negative

@senhuang42 senhuang42 force-pushed the skip_long_matches_lazy branch 2 times, most recently from d21fabb to 0f222a8 Compare August 26, 2021 16:22
@senhuang42 senhuang42 marked this pull request as draft August 26, 2021 16:35
if (useCache) {
/* Only skip positions when using hash cache, i.e.
if we are loading a dict, don't skip anything */
if (UNLIKELY(target - idx > kMaxPositionsToUpdate)) {
Copy link
Contributor

@Cyan4973 Cyan4973 Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UNLIKELY() statement looks appropriate to me.
We don't expect many large matches,
and when there is one, the savings from not updating each and every position should outweigh the slight additional cost of the unpredicted branch.
For all other cases (normal matches), this should put this branch out of the way.

if we are loading a dict, don't skip anything */
if (UNLIKELY(target - idx > kMaxPositionsToUpdate)) {
idx = target - kMaxPositionsToUpdate;
ZSTD_row_fillHashCache(ms, base, rowLog, mls, idx, ip+1);
Copy link
Contributor

@Cyan4973 Cyan4973 Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding about this line is limited.
It looks to me that it's both caching hash results, and prefetching match positions from the corresponding rows.
However :

  • These rows are, at this stage, not yet updated. I presume that it's the purpose of ZSTD_row_update_internal() to do so. And there are many positions yet to add (kMaxPositionsToUpdate at this point). So is the pre-fetching helpful ? are these lines still in L1 cache after all these updates ?
  • Why is there no such equivalent need for smaller matches, which length is < kMaxPositionsToUpdate ?
  • (More general : what's the saving from storing the hash values into the hashCache ?)

Depending on the level of complexity, we may need to take this discussion off line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash cache is the hashes of the next 8 positions. It is used to prefetch the hash table rows a few positions in advance. ZSTD_row_update_internal() requires the hash cache to be correctly filled, it will consume hashCache[(ip - base) & 7] to get the hash of ip - base, and re-fill it with the hash of (ip + 8 - base).

Normally, we process every position, so we only need to fill it at the beginning of the block. But, now that we are skipping positions, we need to re-fill it when we skip positions.

(More general : what's the saving from storing the hash values into the hashCache ?)

We need to compute the hash ahead of time to prefetch. We've measured both re-computing the hash when we need it, and keeping it cached in the hash-cache. The hash-cache out-performed re-computing the hash.

However, now that we are skipping positions, that calculus changes a little. But, it is still rare, so I wouldn't expect a big difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense.
I now realize that I have confused idx with target in my initial reading.
Prefetching the next 8 positions starting from idx is still useful, since they are going to be used in the loop just after.

@senhuang42 senhuang42 marked this pull request as ready for review August 31, 2021 17:55
@senhuang42
Copy link
Contributor Author

Let me know if there are any other comments on this. I'll resolve the merge conflicts once this PR is accepted (since some other PRs also affect results.csv.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 22, 2021

If I understand the algorithm correctly, whenever a large section of input (larger than threshold) is skipped, normally as a consequence of a large match (though I guess combination with ldm would trigger the same code path ?), then you only insert the last threshold bytes of the skipped section.

Have you considered some kind of "split" strategy, like updating the first 128 bytes and the last 128 bytes of the section ?
Or any variation of this scheme (64 -192, 64-64, etc.) ?

Some clues about this proposal :

  • Earlier analysis of the correlation of the position of matches show that they tend to cluster close to the beginning or the end of existing matches. That means that the positions right after the selected match are relatively good candidates for future matches.
  • I noticed that this PR has generally a negative consequence on compression ratio on almost all files. While it's small, it still feels surprisingly high, and my expectation would be a smaller difference than that.
  • 256 bytes should be overkill. Useful ranges of bytes tend to be closer to the borders of matches.

This, to me, suggests that useful byte positions (those after the beginning of the match) tend to miss in the table after this rule is applied.

@senhuang42
Copy link
Contributor Author

senhuang42 commented Sep 23, 2021

That means that the positions right after the selected match are relatively good candidates for future matches.

Yeah, it seems like the following benchmarks confirm this.

I've reduced the threshold to 128 bytes and tried a few strategies: first 128 positions, last 128, and various splits of 128 between first/last. It seems like having mostly first positions as well as some last positions of the long match help. But the degraded performance on level 6 is worrisome (since it replaces level 7, which is an internally important level).

dev (no skips):
 5#silesia.tar       : 211957760 B -> 62473471 B (3.393),  112.3 MB/s, 1010.6 MB/s 
 6#silesia.tar       : 211957760 B -> 61461316 B (3.449),   95.1 MB/s, 1001.9 MB/s 
 7#silesia.tar       : 211957760 B -> 60459424 B (3.506),   71.9 MB/s, 1062.7 MB/s

Last 128 bytes (original method):
 5#silesia.tar       : 211957760 B -> 62494173 B (3.392),  116.7 MB/s, 1010.5 MB/s 
 6#silesia.tar       : 211957760 B -> 61490724 B (3.447),   93.2 MB/s, 1001.3 MB/s 
 7#silesia.tar       : 211957760 B -> 60467579 B (3.505),   72.4 MB/s, 1061.6 MB/s 

First 128 bytes:
 5#silesia.tar       : 211957760 B -> 62487359 B (3.392),  112.5 MB/s, 1010.1 MB/s 
 6#silesia.tar       : 211957760 B -> 61484383 B (3.447),   93.5 MB/s, 1000.5 MB/s 
 7#silesia.tar       : 211957760 B -> 60463389 B (3.506),   70.0 MB/s, 1061.3 MB/s 

64 first / 64 last:
 5#silesia.tar       : 211957760 B -> 62482045 B (3.392),  111.8 MB/s, 1010.6 MB/s 
 6#silesia.tar       : 211957760 B -> 61476578 B (3.448),   92.1 MB/s, 1001.4 MB/s 
 7#silesia.tar       : 211957760 B -> 60459395 B (3.506),   71.6 MB/s, 1063.2 MB/s 

96 first / 32 last:
 5#silesia.tar       : 211957760 B -> 62478591 B (3.392),  112.3 MB/s, 1010.5 MB/s 
 6#silesia.tar       : 211957760 B -> 61472760 B (3.448),   91.6 MB/s, 1000.2 MB/s 
 7#silesia.tar       : 211957760 B -> 60458728 B (3.506),   72.3 MB/s, 1061.4 MB/s

32 first / 96 last:
 5#silesia.tar       : 211957760 B -> 62484231 B (3.392),  111.7 MB/s, 1006.7 MB/s 
 6#silesia.tar       : 211957760 B -> 61478567 B (3.448),   91.0 MB/s, 1002.0 MB/s 
 7#silesia.tar       : 211957760 B -> 60461927 B (3.506),   72.4 MB/s, 1063.4 MB/s

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 23, 2021

Another strategy (that used to be employed in btlazy2) is that the threshold to trigger a "skip" event is not the same as the nb of updated positions.

So, as an example, one can decide to only update 128 positions, but this scenario is triggered only when the distance is larger than 256.

This makes it possible to preserve the "normal" loop more often.
It may have a (hopefully positive) impact on performance.

@senhuang42
Copy link
Contributor Author

senhuang42 commented Sep 28, 2021

With a threshold of 384 bytes to trigger the skip, and 96 bytes beginning of match, 32 bytes end of match, we have the following results:

The rightmost column includes a test where we add __asm__(".p2align 5") prior to the hot update loop. It doesn't seem to help except for on gcc-11. But generally speaking it seems like compiler/alignment has a pretty big impact on speed for levels 5/6, and that this PR definitely perturbs that a bit, but in general seems mostly neutral.

level 5, silesia.tar, MB/s

dev skip skip (with alignment to 32-bytes)
gcc-11 114.3 110.9 114.3
gcc-10 113.9 114.7 111.7
gcc-9 111.8 113.1 114.0
gcc-8 111.8 109.0 109.6
clang-12 113.3 113.8 113.1

level 6, silesia.tar, MB/s

dev skip skip (with alignment to 32-bytes)
gcc-11 94.8 91.9 93.1
gcc-10 92.5 94.2 92.0
gcc-9 89.7 89.2 90.2
gcc-8 88.9 93.2 92.8
clang-12 91.2 91.3 90.5

And of course, on data that's basically just long matches we still see the big speed improvement:

dev:
 5#out.txt           :    262144 ->      4858 (53.96),  366.9 MB/s, 7015.4 MB/s 
 6#out.txt           :    262144 ->      5173 (50.68),  346.0 MB/s, 3068.1 MB/s 
 7#out.txt           :    262144 ->      3940 (66.53),  339.7 MB/s, 7061.5 MB/s

skip:
 5#out.txt           :    262144 ->      5101 (51.39),  883.2 MB/s, 5300.4 MB/s 
 6#out.txt           :    262144 ->      5008 (52.35),  646.7 MB/s, 3089.1 MB/s 
 7#out.txt           :    262144 ->      3765 (69.63),  621.4 MB/s, 5908.2 MB/s

@senhuang42
Copy link
Contributor Author

senhuang42 commented Sep 29, 2021

Compression ratio comparison, since the old one is no longer valid due to the recent increase in skip threshold:

dev:
 5#silesia.tar       : 211950592 ->  62458865 (3.393)
 6#silesia.tar       : 211950592 ->  61446239 (3.449)
 7#silesia.tar       : 211950592 ->  60445279 (3.506)

 5#enwik7            :  10000000 ->   3363547 (2.973)
 6#enwik7            :  10000000 ->   3298642 (3.032)
 7#enwik7            :  10000000 ->   3218460 (3.107)

skip:
 5#silesia.tar       : 211950592 ->  62461337 (3.393)
 6#silesia.tar       : 211950592 ->  61452566 (3.449)
 7#silesia.tar       : 211950592 ->  60446110 (3.506)

 5#enwik7            :  10000000 ->   3363689 (2.973)
 6#enwik7            :  10000000 ->   3298903 (3.031)
 7#enwik7            :  10000000 ->   3218417 (3.107)

The regression in compressed size is quite small now. Although it's curious that the regression test is showing tiny improvements to compression ratio instead.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 29, 2021

it's curious that the regression test is showing tiny improvements to compression ratio instead.

This method is skipping "unpromising" positions in the middle of long matches, primarily for the sake of speed.
As a consequence, these positions also do no longer occupy space in the fixed-size rows.
This space might be used instead by previous positions, which might end up being slightly more "promising" (although their efficiency is harmed by the increased distance). This is how you could end up, in some circumstances, with (slightly) better compression ratio, because the rows now contain slightly more "promising" match candidates.

This effect could be improved further by :

  • only skipping positions if the row is full, but continue filling it when it's not
  • skipping positions in the middle of long literal sections (this is partially done thanks to the increased sampling distance, but only partially).

However, both propositions above introduce complexity right in the middle of a hot update loop, while the impact on compression ratio is expected to be small, if not minimal. So these investigations have a fairly low bang for bucks ratio.

@senhuang42 senhuang42 merged commit 358f177 into facebook:dev Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants