[libzstd] Speed up single segment zstd_fast by 5% #1562

terrelln · 2019-03-29T05:41:35Z

This PR is based on top of PR #1563.

The optimization is to process two input pointers per loop.
It is based on ideas from igzip level 1, and talking to @gbtucker.

Platform	Silesia	Enwik8
OSX clang-10	+5.3%	+5.4%
i9 5 GHz gcc-8	+6.6%	+6.6%
i9 5 GHz clang-7	+8.0%	+8.0%
Skylake 2.4 GHz gcc-4.8	+6.3%	+7.9%
Skylake 2.4 GHz clang-7	+6.2%	+7.5%

I see small (1-2%) gains on my AMD ThreadRipper with gcc-7.

Testing on all Silesia files on my Intel i9-9900k with gcc-8

Silesia File	Ratio Change	Speed Change
silesia.tar	+0.17%	+6.6%
dickens	+0.25%	+7.0%
mozilla	+0.02%	+6.8%
mr	-0.30%	+10.9%
nci	+1.28%	+4.5%
ooffice	-0.35%	+10.7%
osdb	+0.75%	+9.8%
reymont	+0.65%	+4.6%
samba	+0.70%	+5.9%
sao	-0.01%	+14.0%
webster	+0.30%	+5.5%
xml	+0.92%	+5.3%
x-ray	-0.00%	+1.4%

Same tests on Calgary. For brevity, I've only included files
where compression ratio regressed or was much better.

Calgary File	Ratio Change	Speed Change
calgary.tar	+0.30%	+7.1%
geo	-0.14%	+25.0%
obj1	-0.46%	+15.2%
obj2	-0.18%	+6.0%
pic	+1.80%	+9.3%
trans	-0.35%	+5.5%

We gain 0.1% of compression ratio on Silesia.
We gain 0.3% of compression ratio on enwik8.
I also tested on the GitHub and hg-commands datasets without a dictionary,
and we gain a small amount of compression ratio on each, as well as speed.

I tested the negative compression levels on Silesia on my
Intel i9-9900k with gcc-8:

Level	Ratio Change	Speed Change
-1	+0.13%	+6.4%
-2	+4.6%	-1.5%
-3	+7.5%	-4.8%
-4	+8.5%	-6.9%
-5	+9.1%	-9.1%

Roughly, the negative levels now scale half as quickly. E.g. the new
level 16 is roughly equivalent to the old level 8, but a bit quicker
and smaller. If you don't think this is the right trade off, we can
change it to multiply the step size by 2, instead of adding 1. I think
this makes sense, because it gives a bit slower ratio decay.

terrelln · 2019-03-29T16:42:45Z

I'm going to separate out the dict match state variant into a separate function, in a separate PR. It is hard to work with the two functions combined. Then I will come back to this.

terrelln · 2019-03-29T19:10:54Z

This PR is ready for review. However, it will be easier to either just review the second commit, or wait until PR #1563 has been merged, since those changes are included in this PR.

Theres still a question of how this behaves on ARM. @Cyan4973 would you be able to test on your phone, since you already have it set up?

I'll optimize the dictMatchState and extDict variants in separate PRs.

Cyan4973 · 2019-03-29T19:33:22Z

Sure, I can test it on a Qualcomm aarch64 chip.

Cyan4973 · 2019-04-02T18:49:25Z

Some benchmark results, on Qualcomm aarch64 Kryo 2 using clang version 7.0.1 compiler.

There are wild swings of performance during benchmark, depending on the cpu triggering "turbo mode" or not, which cannot be controlled and makes results unstable. This can be detected by looking at decompression speed results, which are supposed to be equivalent since this part has not changed.
I tried to keep all results comparable, generally falling back to "economy" mode, which is about 20% slower than max.

simple test, on calgary.tar :

facebook/dev :

./zstd -b1i9 calgary.tar
1#calgary.tar       :   3265536 ->   1187418 (2.750),  71.0 MB/s , 247.8 MB/s

terrelln/2fast :

./zstd -b1i9 calgary.tar
1#calgary.tar       :   3265536 ->   1186760 (2.752),  74.1 MB/s , 237.8 MB/s

All good.

Now more complex, detailed results per file on silesia corpus :

facebook/dev :

./zstd -b1i9 -r -S dirSilesia
 1#dickens           :  10192446 ->   4279054 (2.382),  56.9 MB/s , 220.0 MB/s
 1#mozilla           :  51220480 ->  20120459 (2.546), 103.9 MB/s , 299.6 MB/s
 1#mr                :   9970564 ->   3829245 (2.604),  89.8 MB/s , 360.7 MB/s
 1#nci               :  33553445 ->   2884095 (11.63), 176.0 MB/s , 367.0 MB/s
 1#ooffice           :   6152192 ->   3579899 (1.719),  79.6 MB/s , 267.6 MB/s
 1#osdb              :  10085684 ->   3767584 (2.677),  94.7 MB/s , 340.7 MB/s
 1#reymont           :   6627202 ->   2167027 (3.058),  73.9 MB/s , 259.1 MB/s
 1#samba             :  21606400 ->   5550630 (3.893), 122.1 MB/s , 376.8 MB/s
 1#sao               :   7251944 ->   6254282 (1.160),  67.7 MB/s , 292.5 MB/s
 1#webster           :  41458703 ->  13737048 (3.018),  84.2 MB/s , 290.9 MB/s
 1#xml               :   5345280 ->    703093 (7.603), 144.4 MB/s , 385.1 MB/s
 1#x-ray             :   8474240 ->   6772289 (1.251), 144.0 MB/s , 328.9 MB/s
 1#silesia.tar       : 211948032 ->  73659442 (2.877), 101.4 MB/s , 322.3 MB/s

terrelln/2fast :

./zstd -b1i9 -r -S dirSilesia
 1#dickens           :  10192446 ->   4268428 (2.388),  60.4 MB/s , 223.4 MB/s   ++
 1#mozilla           :  51220480 ->  20433450 (2.507), 104.7 MB/s , 295.9 MB/s   -
 1#mr                :   9970564 ->   3838568 (2.597),  99.0 MB/s , 361.2 MB/s   ++
 1#nci               :  33553445 ->   2932904 (11.44), 179.2 MB/s , 367.5 MB/s   -
 1#ooffice           :   6152192 ->   3595743 (1.711),  87.3 MB/s , 264.9 MB/s   +
 1#osdb              :  10085684 ->   3738934 (2.697), 103.4 MB/s , 350.2 MB/s   ++
 1#reymont           :   6627202 ->   2159795 (3.068),  76.9 MB/s , 256.3 MB/s   ++
 1#samba             :  21606400 ->   5540418 (3.900), 126.5 MB/s , 377.6 MB/s   -
 1#sao               :   7251944 ->   6250760 (1.160),  76.1 MB/s , 295.3 MB/s   +
 1#webster           :  41458703 ->  13691279 (3.028),  85.1 MB/s , 290.6 MB/s   +
 1#xml               :   5345280 ->    706919 (7.561), 149.1 MB/s , 382.5 MB/s   +-
 1#x-ray             :   8474240 ->   6767479 (1.252), 109.7 MB/s , 315.2 MB/s   --
 1#silesia.tar       : 211948032 ->  73944642 (2.866), 104.4 MB/s , 321.2 MB/s   -

Summary

Results are mixed for speed, though generally positive. The impact lies in the +3-5% range. There are bad cases too. A particularly bad case is x-ray, which loses a lot of speed (about -30%). Nonetheless, on average it's rather a gain.

Impact on compression ratio is visible, and ranges from negligible (most of the time) to quite measurable. Ratio is noticeably and negatively impacted on mozilla, nci and xml. There are wins too, but none of them is particularly noticeable, osdb seeing the largest gains. So, overall, on this front, it's a loss.

The speed impact is expected to be platform-dependent, while the compression ratio losses are expected to be universal.

Cyan4973 · 2019-04-02T18:55:28Z

lib/compress/zstd_fast.c

+        size_t const h1 = ZSTD_hashPtr(ip1, hlog, mls);
+        U32 const val1 = MEM_read32(ip1);
+        U32 const current0 = (U32)(ip0-base);
+        U32 const current1 = (U32)(ip1-base);


minor comment :
could also be current1 = current0 + 1;.
Simpler but longer dependency chain.
Could benefit if execution units are working at full capacity and would welcome a slight operation relief.
Likely insignificant.

This is a ~1% loss for at gcc-8 on an intel-i9.

Cyan4973 · 2019-04-02T19:00:34Z

lib/compress/zstd_fast.c

+        U32 const current1 = (U32)(ip1-base);
+        U32 const matchIndex0 = hashTable[h0];
+        U32 const matchIndex1 = hashTable[h1];
+        const BYTE* const repMatch = ip1-offset_1;


So that's the difference.
An "equivalent" algorithm (with current ZSTD_fast) would also check a second repMatch, at ip1 + 1 - offset_1.
That's more work, but also more parsing capabilities.
More work might kill the cpu advantage though.

An alternative could be to check at ip1 + 1 - offset_1 instead of ip1-offset_1,
though I suspect it does not combine well with current strategy to always blind-trust the repeat code.

Other variant : check ip1 + 1 - offset_1 only, try to expand backward in case of a match. Necessarily better than previous proposal in ratio, but also a little speed hit.

I have an updated variant that fixes this.

Cyan4973 · 2019-04-02T19:04:15Z

The speed gains could be explained just by the fact that there are less repeat code checks (once every 2 positions, instead of once per position). So it's not completely clear if this is entirely related to wider OoO execution.

terrelln · 2019-04-03T01:59:17Z

lib/compress/zstd_fast.c

@@ -51,10 +51,12 @@ size_t ZSTD_compressBlock_fast_generic(
    U32* const hashTable = ms->hashTable;
    U32 const hlog = cParams->hashLog;
    /* support stepSize of 0 */
-    U32 const stepSize = cParams->targetLength + !(cParams->targetLength);
+    size_t const stepSize = cParams->targetLength + !(cParams->targetLength) + 1;


I had this as 2 * (targetLength + !targetLength), but I think that +1 is a better choice.

This will make negative compression levels slightly stronger and slower, but the user can set any negative level, so that should be fine.

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.3% | +5.4% | | i9 5 GHz gcc-8 | +6.6% | +6.6% | | i9 5 GHz clang-7 | +8.0% | +8.0% | | Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% | | Skylake 2.4 GHz clang-7 | +6.2% | +7.5% | Testing on all Silesia files on my Intel i9-9900k with gcc-8 | Silesia File | Ratio Change | Speed Change | |--------------|--------------|--------------| | silesia.tar | +0.17% | +6.6% | | dickens | +0.25% | +7.0% | | mozilla | +0.02% | +6.8% | | mr | -0.30% | +10.9% | | nci | +1.28% | +4.5% | | ooffice | -0.35% | +10.7% | | osdb | +0.75% | +9.8% | | reymont | +0.65% | +4.6% | | samba | +0.70% | +5.9% | | sao | -0.01% | +14.0% | | webster | +0.30% | +5.5% | | xml | +0.92% | +5.3% | | x-ray | -0.00% | +1.4% | Same tests on Calgary. For brevity, I've only included files where compression ratio regressed or was much better. | Calgary File | Ratio Change | Speed Change | |--------------|--------------|--------------| | calgary.tar | +0.30% | +7.1% | | geo | -0.14% | +25.0% | | obj1 | -0.46% | +15.2% | | obj2 | -0.18% | +6.0% | | pic | +1.80% | +9.3% | | trans | -0.35% | +5.5% | We gain 0.1% of compression ratio on Silesia. We gain 0.3% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. I tested the negative compression levels on Silesia on my Intel i9-9900k with gcc-8: | Level | Ratio Change | Speed Change | |-------|--------------|--------------| | -1 | +0.13% | +6.4% | | -2 | +4.6% | -1.5% | | -3 | +7.5% | -4.8% | | -4 | +8.5% | -6.9% | | -5 | +9.1% | -9.1% | Roughly, the negative levels now scale half as quickly. E.g. the new level 16 is roughly equivalent to the old level 8, but a bit quicker and smaller. If you don't think this is the right trade off, we can change it to multiply the step size by 2, instead of adding 1. I think this makes sense, because it gives a bit slower ratio decay. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

Cyan4973 · 2019-04-03T16:40:16Z

The new version looks much better.

Results on the Qualcomm aarch64 Kryo 2 using clang version 7.0.1 compiler :

 1#dickens           :  10192446 ->   4268865 (2.388),  58.2 MB/s , 221.3 MB/s
 1#mozilla           :  51220480 ->  20117517 (2.546), 108.0 MB/s , 295.1 MB/s
 1#mr                :   9970564 ->   3840242 (2.596),  95.6 MB/s , 356.1 MB/s
 1#nci               :  33553445 ->   2849306 (11.78), 187.5 MB/s , 369.2 MB/s
 1#ooffice           :   6152192 ->   3590954 (1.713),  85.6 MB/s , 260.9 MB/s
 1#osdb              :  10085684 ->   3739042 (2.697), 101.1 MB/s , 338.7 MB/s
 1#reymont           :   6627202 ->   2152771 (3.078),  74.4 MB/s , 257.2 MB/s
 1#samba             :  21606400 ->   5510994 (3.921), 128.0 MB/s , 378.0 MB/s
 1#sao               :   7251944 ->   6256401 (1.159),  72.6 MB/s , 292.0 MB/s
 1#webster           :  41458703 ->  13692222 (3.028),  82.0 MB/s , 286.0 MB/s
 1#xml               :   5345280 ->    696652 (7.673), 150.6 MB/s , 381.1 MB/s
 1#x-ray             :   8474240 ->   6772557 (1.251), 144.2 MB/s , 321.4 MB/s
 1#silesia.tar       : 211948032 ->  73513096 (2.883), 107.7 MB/s , 325.1 MB/s

The situation on compression ratio is much improved.
No more noticeable losses (there are a few occasional losses, but they are very tiny). Some outliers actually achieve noticeable gains instead. The most impacted file is xml which reverse course from a noticeable loss to a noticeable gain (7.603 -> 7.561 -> 7.673). The end result is now slightly positive (as opposed to slightly negative in previous version).

Situation for compression speed is a bit more subtle. It's also more difficult to assess, due to target cpu volatility.
In many cases, speed gains are slightly reduced compared to previous version. It remains a gain, just a smaller one. However, there are no more losses observed (x-ray notably) and occasional outliers which improve their speed instead (like nci).
As a consequence, on average, the speed improvement seems slightly better than previous version. It's now closer to +5%.

So this version feels like an overall win.

Cyan4973 · 2019-04-03T17:00:47Z

Also :

I'm fine with the new scale for negative compression levels. I just hope we don't have too many users depending on current scale that would be surprised after an upgrade. I believe it's manageable. Just be prepared to see some comments on this point.
I confirmed speed gains on x64 similar to your measurements. Speed gains feel more noticeable on this platform, ending in the average range of +~6-7% .

 1#dickens           :  +4%
 1#mozilla           :  +5%
 1#mr                :  +9%
 1#nci               :  +4%
 1#ooffice           : +10%
 1#osdb              : +10%
 1#reymont           :  +0%
 1#samba             :  +4%
 1#sao               : +16%
 1#webster           :  +4%
 1#x-ray             :  +2%
 1#xml               :  +4%

facebook-github-bot added the CLA Signed label Mar 29, 2019

terrelln changed the title ~~[RFC] 2-6% speed boost for ZSTD_fast~~ [RFC] 2-7% speed boost for ZSTD_fast Mar 29, 2019

terrelln force-pushed the 2fast branch 2 times, most recently from ec48873 to 485dca7 Compare March 29, 2019 16:23

terrelln force-pushed the 2fast branch from 485dca7 to c1b8a6c Compare March 29, 2019 19:03

terrelln changed the title ~~[RFC] 2-7% speed boost for ZSTD_fast~~ [libzstd] Speed up single segment zstd_fast by 5% Mar 29, 2019

terrelln force-pushed the 2fast branch from c1b8a6c to b46db19 Compare March 29, 2019 19:06

terrelln force-pushed the 2fast branch 2 times, most recently from 4b3b3e3 to 58c8f38 Compare March 29, 2019 23:42

Cyan4973 reviewed Apr 2, 2019

View reviewed changes

terrelln force-pushed the 2fast branch 3 times, most recently from 4fb1ec5 to 3366b90 Compare April 3, 2019 01:57

terrelln commented Apr 3, 2019

View reviewed changes

terrelln force-pushed the 2fast branch from 3366b90 to 95624b7 Compare April 3, 2019 02:10

Cyan4973 approved these changes Apr 3, 2019

View reviewed changes

terrelln merged commit 72a3fbc into facebook:dev Apr 4, 2019

felixhandte mentioned this pull request Aug 17, 2021

Pipelined Implementation of ZSTD_fast (~+5% Speed) #2749

Merged

6 tasks

felixhandte mentioned this pull request Dec 10, 2021

Stagger Stepping in Negative Levels #2921

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libzstd] Speed up single segment zstd_fast by 5% #1562

[libzstd] Speed up single segment zstd_fast by 5% #1562

terrelln commented Mar 29, 2019 •

edited

Loading

terrelln commented Mar 29, 2019

terrelln commented Mar 29, 2019

Cyan4973 commented Mar 29, 2019

Cyan4973 commented Apr 2, 2019 •

edited

Loading

Cyan4973 Apr 2, 2019

terrelln Apr 3, 2019

Cyan4973 Apr 2, 2019

terrelln Apr 3, 2019

Cyan4973 commented Apr 2, 2019

terrelln Apr 3, 2019

Cyan4973 commented Apr 3, 2019 •

edited

Loading

Cyan4973 commented Apr 3, 2019 •

edited

Loading

[libzstd] Speed up single segment zstd_fast by 5% #1562

[libzstd] Speed up single segment zstd_fast by 5% #1562

Conversation

terrelln commented Mar 29, 2019 • edited Loading

terrelln commented Mar 29, 2019

terrelln commented Mar 29, 2019

Cyan4973 commented Mar 29, 2019

Cyan4973 commented Apr 2, 2019 • edited Loading

Summary

Cyan4973 Apr 2, 2019

Choose a reason for hiding this comment

terrelln Apr 3, 2019

Choose a reason for hiding this comment

Cyan4973 Apr 2, 2019

Choose a reason for hiding this comment

terrelln Apr 3, 2019

Choose a reason for hiding this comment

Cyan4973 commented Apr 2, 2019

terrelln Apr 3, 2019

Choose a reason for hiding this comment

Cyan4973 commented Apr 3, 2019 • edited Loading

Cyan4973 commented Apr 3, 2019 • edited Loading

terrelln commented Mar 29, 2019 •

edited

Loading

Cyan4973 commented Apr 2, 2019 •

edited

Loading

Cyan4973 commented Apr 3, 2019 •

edited

Loading

Cyan4973 commented Apr 3, 2019 •

edited

Loading