-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libzstd] Speed up single segment zstd_fast by 5% #1562
Conversation
ec48873
to
485dca7
Compare
I'm going to separate out the dict match state variant into a separate function, in a separate PR. It is hard to work with the two functions combined. Then I will come back to this. |
This PR is ready for review. However, it will be easier to either just review the second commit, or wait until PR #1563 has been merged, since those changes are included in this PR. Theres still a question of how this behaves on ARM. @Cyan4973 would you be able to test on your phone, since you already have it set up? I'll optimize the dictMatchState and extDict variants in separate PRs. |
Sure, I can test it on a Qualcomm |
4b3b3e3
to
58c8f38
Compare
Some benchmark results, on Qualcomm There are wild swings of performance during benchmark, depending on the cpu triggering "turbo mode" or not, which cannot be controlled and makes results unstable. This can be detected by looking at decompression speed results, which are supposed to be equivalent since this part has not changed.
All good.
SummaryResults are mixed for speed, though generally positive. The impact lies in the Impact on compression ratio is visible, and ranges from negligible (most of the time) to quite measurable. Ratio is noticeably and negatively impacted on The speed impact is expected to be platform-dependent, while the compression ratio losses are expected to be universal. |
size_t const h1 = ZSTD_hashPtr(ip1, hlog, mls); | ||
U32 const val1 = MEM_read32(ip1); | ||
U32 const current0 = (U32)(ip0-base); | ||
U32 const current1 = (U32)(ip1-base); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment :
could also be current1 = current0 + 1;
.
Simpler but longer dependency chain.
Could benefit if execution units are working at full capacity and would welcome a slight operation relief.
Likely insignificant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a ~1% loss for at gcc-8 on an intel-i9.
lib/compress/zstd_fast.c
Outdated
U32 const current1 = (U32)(ip1-base); | ||
U32 const matchIndex0 = hashTable[h0]; | ||
U32 const matchIndex1 = hashTable[h1]; | ||
const BYTE* const repMatch = ip1-offset_1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's the difference.
An "equivalent" algorithm (with current ZSTD_fast
) would also check a second repMatch, at ip1 + 1 - offset_1
.
That's more work, but also more parsing capabilities.
More work might kill the cpu advantage though.
An alternative could be to check at ip1 + 1 - offset_1
instead of ip1-offset_1
,
though I suspect it does not combine well with current strategy to always blind-trust the repeat code.
Other variant : check ip1 + 1 - offset_1
only, try to expand backward in case of a match. Necessarily better than previous proposal in ratio, but also a little speed hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an updated variant that fixes this.
The speed gains could be explained just by the fact that there are less repeat code checks (once every 2 positions, instead of once per position). So it's not completely clear if this is entirely related to wider OoO execution. |
4fb1ec5
to
3366b90
Compare
@@ -51,10 +51,12 @@ size_t ZSTD_compressBlock_fast_generic( | |||
U32* const hashTable = ms->hashTable; | |||
U32 const hlog = cParams->hashLog; | |||
/* support stepSize of 0 */ | |||
U32 const stepSize = cParams->targetLength + !(cParams->targetLength); | |||
size_t const stepSize = cParams->targetLength + !(cParams->targetLength) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this as 2 * (targetLength + !targetLength)
, but I think that +1
is a better choice.
This will make negative compression levels slightly stronger and slower, but the user can set any negative level, so that should be fine.
This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.3% | +5.4% | | i9 5 GHz gcc-8 | +6.6% | +6.6% | | i9 5 GHz clang-7 | +8.0% | +8.0% | | Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% | | Skylake 2.4 GHz clang-7 | +6.2% | +7.5% | Testing on all Silesia files on my Intel i9-9900k with gcc-8 | Silesia File | Ratio Change | Speed Change | |--------------|--------------|--------------| | silesia.tar | +0.17% | +6.6% | | dickens | +0.25% | +7.0% | | mozilla | +0.02% | +6.8% | | mr | -0.30% | +10.9% | | nci | +1.28% | +4.5% | | ooffice | -0.35% | +10.7% | | osdb | +0.75% | +9.8% | | reymont | +0.65% | +4.6% | | samba | +0.70% | +5.9% | | sao | -0.01% | +14.0% | | webster | +0.30% | +5.5% | | xml | +0.92% | +5.3% | | x-ray | -0.00% | +1.4% | Same tests on Calgary. For brevity, I've only included files where compression ratio regressed or was much better. | Calgary File | Ratio Change | Speed Change | |--------------|--------------|--------------| | calgary.tar | +0.30% | +7.1% | | geo | -0.14% | +25.0% | | obj1 | -0.46% | +15.2% | | obj2 | -0.18% | +6.0% | | pic | +1.80% | +9.3% | | trans | -0.35% | +5.5% | We gain 0.1% of compression ratio on Silesia. We gain 0.3% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. I tested the negative compression levels on Silesia on my Intel i9-9900k with gcc-8: | Level | Ratio Change | Speed Change | |-------|--------------|--------------| | -1 | +0.13% | +6.4% | | -2 | +4.6% | -1.5% | | -3 | +7.5% | -4.8% | | -4 | +8.5% | -6.9% | | -5 | +9.1% | -9.1% | Roughly, the negative levels now scale half as quickly. E.g. the new level 16 is roughly equivalent to the old level 8, but a bit quicker and smaller. If you don't think this is the right trade off, we can change it to multiply the step size by 2, instead of adding 1. I think this makes sense, because it gives a bit slower ratio decay. [igzip]: https://github.com/01org/isa-l/tree/master/igzip
The new version looks much better. Results on the
The situation on compression ratio is much improved. Situation for compression speed is a bit more subtle. It's also more difficult to assess, due to target cpu volatility. So this version feels like an overall win. |
Also :
|
This PR is based on top of PR #1563.
The optimization is to process two input pointers per loop.
It is based on ideas from igzip level 1, and talking to @gbtucker.
I see small (1-2%) gains on my AMD ThreadRipper with gcc-7.
Testing on all Silesia files on my Intel i9-9900k with gcc-8
Same tests on Calgary. For brevity, I've only included files
where compression ratio regressed or was much better.
We gain 0.1% of compression ratio on Silesia.
We gain 0.3% of compression ratio on enwik8.
I also tested on the GitHub and hg-commands datasets without a dictionary,
and we gain a small amount of compression ratio on each, as well as speed.
I tested the negative compression levels on Silesia on my
Intel i9-9900k with gcc-8:
Roughly, the negative levels now scale half as quickly. E.g. the new
level 16 is roughly equivalent to the old level 8, but a bit quicker
and smaller. If you don't think this is the right trade off, we can
change it to multiply the step size by 2, instead of adding 1. I think
this makes sense, because it gives a bit slower ratio decay.