-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip most long matches in lazy hash table update #2755
Conversation
d21fabb
to
0f222a8
Compare
0f222a8
to
3d6111f
Compare
lib/compress/zstd_lazy.c
Outdated
if (useCache) { | ||
/* Only skip positions when using hash cache, i.e. | ||
if we are loading a dict, don't skip anything */ | ||
if (UNLIKELY(target - idx > kMaxPositionsToUpdate)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UNLIKELY()
statement looks appropriate to me.
We don't expect many large matches,
and when there is one, the savings from not updating each and every position should outweigh the slight additional cost of the unpredicted branch.
For all other cases (normal matches), this should put this branch out of the way.
if we are loading a dict, don't skip anything */ | ||
if (UNLIKELY(target - idx > kMaxPositionsToUpdate)) { | ||
idx = target - kMaxPositionsToUpdate; | ||
ZSTD_row_fillHashCache(ms, base, rowLog, mls, idx, ip+1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding about this line is limited.
It looks to me that it's both caching hash results, and prefetching match positions from the corresponding rows.
However :
- These rows are, at this stage, not yet updated. I presume that it's the purpose of
ZSTD_row_update_internal()
to do so. And there are many positions yet to add (kMaxPositionsToUpdate
at this point). So is the pre-fetching helpful ? are these lines still in L1 cache after all these updates ? - Why is there no such equivalent need for smaller matches, which length is <
kMaxPositionsToUpdate
? - (More general : what's the saving from storing the hash values into the
hashCache
?)
Depending on the level of complexity, we may need to take this discussion off line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash cache is the hashes of the next 8 positions. It is used to prefetch the hash table rows a few positions in advance. ZSTD_row_update_internal()
requires the hash cache to be correctly filled, it will consume hashCache[(ip - base) & 7]
to get the hash of ip - base
, and re-fill it with the hash of (ip + 8 - base)
.
Normally, we process every position, so we only need to fill it at the beginning of the block. But, now that we are skipping positions, we need to re-fill it when we skip positions.
(More general : what's the saving from storing the hash values into the hashCache ?)
We need to compute the hash ahead of time to prefetch. We've measured both re-computing the hash when we need it, and keeping it cached in the hash-cache. The hash-cache out-performed re-computing the hash.
However, now that we are skipping positions, that calculus changes a little. But, it is still rare, so I wouldn't expect a big difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this makes sense.
I now realize that I have confused idx
with target
in my initial reading.
Prefetching the next 8 positions starting from idx
is still useful, since they are going to be used in the loop just after.
3d6111f
to
4e98099
Compare
4e98099
to
a344b90
Compare
Let me know if there are any other comments on this. I'll resolve the merge conflicts once this PR is accepted (since some other PRs also affect |
If I understand the algorithm correctly, whenever a large section of input (larger than Have you considered some kind of "split" strategy, like updating the first 128 bytes and the last 128 bytes of the section ? Some clues about this proposal :
This, to me, suggests that useful byte positions (those after the beginning of the match) tend to miss in the table after this rule is applied. |
Yeah, it seems like the following benchmarks confirm this. I've reduced the threshold to 128 bytes and tried a few strategies: first 128 positions, last 128, and various splits of 128 between first/last. It seems like having mostly first positions as well as some last positions of the long match help. But the degraded performance on level 6 is worrisome (since it replaces level 7, which is an internally important level).
|
Another strategy (that used to be employed in So, as an example, one can decide to only update 128 positions, but this scenario is triggered only when the distance is larger than 256. This makes it possible to preserve the "normal" loop more often. |
With a threshold of 384 bytes to trigger the skip, and 96 bytes beginning of match, 32 bytes end of match, we have the following results: The rightmost column includes a test where we add level 5, silesia.tar, MB/s
level 6, silesia.tar, MB/s
And of course, on data that's basically just long matches we still see the big speed improvement:
|
a344b90
to
9360367
Compare
Compression ratio comparison, since the old one is no longer valid due to the recent increase in skip threshold:
The regression in compressed size is quite small now. Although it's curious that the regression test is showing tiny improvements to compression ratio instead. |
This method is skipping "unpromising" positions in the middle of long matches, primarily for the sake of speed. This effect could be improved further by :
However, both propositions above introduce complexity right in the middle of a hot update loop, while the impact on compression ratio is expected to be small, if not minimal. So these investigations have a fairly low bang for bucks ratio. |
Origin:
Brought up in #2662. Fix suggested here by @terrelln, #2662 (comment).
Overview:
With this change, for the lazy matchfinder, we only ever update up to 256 positions in the hash table, which improves performance on long matches. When this happens, we need to stop and overwrite everything in the hash cache.
Current status:
I've been playing around with different settings for the threshold to skip, and 256 seems reasonable. It seems like random changes to the code can cause noticeable and consistent shifts in the performance. It's hard to say if this is generally positive or negative for the average case. We care most about level 6 since that's what the warehouse-type use cases should be using starting with 1.5.1.
Most importantly, I think it's important to understand why the performance is changing in the various ways, since I don't really have a good explanation for that yet. For example, the
UNLIKELY
annotation I don't think is actually that helpful, yet on level 5 it dramatically increases speed.Some variations: