Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libzstd] Speed up single segment zstd_fast by 5% #1562

Merged
merged 1 commit into from
Apr 4, 2019

Conversation

terrelln
Copy link
Contributor

@terrelln terrelln commented Mar 29, 2019

This PR is based on top of PR #1563.

The optimization is to process two input pointers per loop.
It is based on ideas from igzip level 1, and talking to @gbtucker.

Platform Silesia Enwik8
OSX clang-10 +5.3% +5.4%
i9 5 GHz gcc-8 +6.6% +6.6%
i9 5 GHz clang-7 +8.0% +8.0%
Skylake 2.4 GHz gcc-4.8 +6.3% +7.9%
Skylake 2.4 GHz clang-7 +6.2% +7.5%

I see small (1-2%) gains on my AMD ThreadRipper with gcc-7.

Testing on all Silesia files on my Intel i9-9900k with gcc-8

Silesia File Ratio Change Speed Change
silesia.tar +0.17% +6.6%
dickens +0.25% +7.0%
mozilla +0.02% +6.8%
mr -0.30% +10.9%
nci +1.28% +4.5%
ooffice -0.35% +10.7%
osdb +0.75% +9.8%
reymont +0.65% +4.6%
samba +0.70% +5.9%
sao -0.01% +14.0%
webster +0.30% +5.5%
xml +0.92% +5.3%
x-ray -0.00% +1.4%

Same tests on Calgary. For brevity, I've only included files
where compression ratio regressed or was much better.

Calgary File Ratio Change Speed Change
calgary.tar +0.30% +7.1%
geo -0.14% +25.0%
obj1 -0.46% +15.2%
obj2 -0.18% +6.0%
pic +1.80% +9.3%
trans -0.35% +5.5%

We gain 0.1% of compression ratio on Silesia.
We gain 0.3% of compression ratio on enwik8.
I also tested on the GitHub and hg-commands datasets without a dictionary,
and we gain a small amount of compression ratio on each, as well as speed.

I tested the negative compression levels on Silesia on my
Intel i9-9900k with gcc-8:

Level Ratio Change Speed Change
-1 +0.13% +6.4%
-2 +4.6% -1.5%
-3 +7.5% -4.8%
-4 +8.5% -6.9%
-5 +9.1% -9.1%

Roughly, the negative levels now scale half as quickly. E.g. the new
level 16 is roughly equivalent to the old level 8, but a bit quicker
and smaller. If you don't think this is the right trade off, we can
change it to multiply the step size by 2, instead of adding 1. I think
this makes sense, because it gives a bit slower ratio decay.

@terrelln terrelln changed the title [RFC] 2-6% speed boost for ZSTD_fast [RFC] 2-7% speed boost for ZSTD_fast Mar 29, 2019
@terrelln terrelln force-pushed the 2fast branch 2 times, most recently from ec48873 to 485dca7 Compare March 29, 2019 16:23
@terrelln
Copy link
Contributor Author

I'm going to separate out the dict match state variant into a separate function, in a separate PR. It is hard to work with the two functions combined. Then I will come back to this.

@terrelln terrelln changed the title [RFC] 2-7% speed boost for ZSTD_fast [libzstd] Speed up single segment zstd_fast by 5% Mar 29, 2019
@terrelln
Copy link
Contributor Author

This PR is ready for review. However, it will be easier to either just review the second commit, or wait until PR #1563 has been merged, since those changes are included in this PR.

Theres still a question of how this behaves on ARM. @Cyan4973 would you be able to test on your phone, since you already have it set up?

I'll optimize the dictMatchState and extDict variants in separate PRs.

@Cyan4973
Copy link
Contributor

Sure, I can test it on a Qualcomm aarch64 chip.

@terrelln terrelln force-pushed the 2fast branch 2 times, most recently from 4b3b3e3 to 58c8f38 Compare March 29, 2019 23:42
@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 2, 2019

Some benchmark results, on Qualcomm aarch64 Kryo 2 using clang version 7.0.1 compiler.

There are wild swings of performance during benchmark, depending on the cpu triggering "turbo mode" or not, which cannot be controlled and makes results unstable. This can be detected by looking at decompression speed results, which are supposed to be equivalent since this part has not changed.
I tried to keep all results comparable, generally falling back to "economy" mode, which is about 20% slower than max.

  1. simple test, on calgary.tar :

facebook/dev :

./zstd -b1i9 calgary.tar
1#calgary.tar       :   3265536 ->   1187418 (2.750),  71.0 MB/s , 247.8 MB/s

terrelln/2fast :

./zstd -b1i9 calgary.tar
1#calgary.tar       :   3265536 ->   1186760 (2.752),  74.1 MB/s , 237.8 MB/s

All good.

  1. Now more complex, detailed results per file on silesia corpus :

facebook/dev :

./zstd -b1i9 -r -S dirSilesia
 1#dickens           :  10192446 ->   4279054 (2.382),  56.9 MB/s , 220.0 MB/s
 1#mozilla           :  51220480 ->  20120459 (2.546), 103.9 MB/s , 299.6 MB/s
 1#mr                :   9970564 ->   3829245 (2.604),  89.8 MB/s , 360.7 MB/s
 1#nci               :  33553445 ->   2884095 (11.63), 176.0 MB/s , 367.0 MB/s
 1#ooffice           :   6152192 ->   3579899 (1.719),  79.6 MB/s , 267.6 MB/s
 1#osdb              :  10085684 ->   3767584 (2.677),  94.7 MB/s , 340.7 MB/s
 1#reymont           :   6627202 ->   2167027 (3.058),  73.9 MB/s , 259.1 MB/s
 1#samba             :  21606400 ->   5550630 (3.893), 122.1 MB/s , 376.8 MB/s
 1#sao               :   7251944 ->   6254282 (1.160),  67.7 MB/s , 292.5 MB/s
 1#webster           :  41458703 ->  13737048 (3.018),  84.2 MB/s , 290.9 MB/s
 1#xml               :   5345280 ->    703093 (7.603), 144.4 MB/s , 385.1 MB/s
 1#x-ray             :   8474240 ->   6772289 (1.251), 144.0 MB/s , 328.9 MB/s
 1#silesia.tar       : 211948032 ->  73659442 (2.877), 101.4 MB/s , 322.3 MB/s

terrelln/2fast :

./zstd -b1i9 -r -S dirSilesia
 1#dickens           :  10192446 ->   4268428 (2.388),  60.4 MB/s , 223.4 MB/s   ++
 1#mozilla           :  51220480 ->  20433450 (2.507), 104.7 MB/s , 295.9 MB/s   -
 1#mr                :   9970564 ->   3838568 (2.597),  99.0 MB/s , 361.2 MB/s   ++
 1#nci               :  33553445 ->   2932904 (11.44), 179.2 MB/s , 367.5 MB/s   -
 1#ooffice           :   6152192 ->   3595743 (1.711),  87.3 MB/s , 264.9 MB/s   +
 1#osdb              :  10085684 ->   3738934 (2.697), 103.4 MB/s , 350.2 MB/s   ++
 1#reymont           :   6627202 ->   2159795 (3.068),  76.9 MB/s , 256.3 MB/s   ++
 1#samba             :  21606400 ->   5540418 (3.900), 126.5 MB/s , 377.6 MB/s   -
 1#sao               :   7251944 ->   6250760 (1.160),  76.1 MB/s , 295.3 MB/s   +
 1#webster           :  41458703 ->  13691279 (3.028),  85.1 MB/s , 290.6 MB/s   +
 1#xml               :   5345280 ->    706919 (7.561), 149.1 MB/s , 382.5 MB/s   +-
 1#x-ray             :   8474240 ->   6767479 (1.252), 109.7 MB/s , 315.2 MB/s   --
 1#silesia.tar       : 211948032 ->  73944642 (2.866), 104.4 MB/s , 321.2 MB/s   -

Summary

Results are mixed for speed, though generally positive. The impact lies in the +3-5% range. There are bad cases too. A particularly bad case is x-ray, which loses a lot of speed (about -30%). Nonetheless, on average it's rather a gain.

Impact on compression ratio is visible, and ranges from negligible (most of the time) to quite measurable. Ratio is noticeably and negatively impacted on mozilla, nci and xml. There are wins too, but none of them is particularly noticeable, osdb seeing the largest gains. So, overall, on this front, it's a loss.

The speed impact is expected to be platform-dependent, while the compression ratio losses are expected to be universal.

size_t const h1 = ZSTD_hashPtr(ip1, hlog, mls);
U32 const val1 = MEM_read32(ip1);
U32 const current0 = (U32)(ip0-base);
U32 const current1 = (U32)(ip1-base);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment :
could also be current1 = current0 + 1;.
Simpler but longer dependency chain.
Could benefit if execution units are working at full capacity and would welcome a slight operation relief.
Likely insignificant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ~1% loss for at gcc-8 on an intel-i9.

U32 const current1 = (U32)(ip1-base);
U32 const matchIndex0 = hashTable[h0];
U32 const matchIndex1 = hashTable[h1];
const BYTE* const repMatch = ip1-offset_1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that's the difference.
An "equivalent" algorithm (with current ZSTD_fast) would also check a second repMatch, at ip1 + 1 - offset_1.
That's more work, but also more parsing capabilities.
More work might kill the cpu advantage though.

An alternative could be to check at ip1 + 1 - offset_1 instead of ip1-offset_1,
though I suspect it does not combine well with current strategy to always blind-trust the repeat code.

Other variant : check ip1 + 1 - offset_1 only, try to expand backward in case of a match. Necessarily better than previous proposal in ratio, but also a little speed hit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an updated variant that fixes this.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 2, 2019

The speed gains could be explained just by the fact that there are less repeat code checks (once every 2 positions, instead of once per position). So it's not completely clear if this is entirely related to wider OoO execution.

@terrelln terrelln force-pushed the 2fast branch 3 times, most recently from 4fb1ec5 to 3366b90 Compare April 3, 2019 01:57
@@ -51,10 +51,12 @@ size_t ZSTD_compressBlock_fast_generic(
U32* const hashTable = ms->hashTable;
U32 const hlog = cParams->hashLog;
/* support stepSize of 0 */
U32 const stepSize = cParams->targetLength + !(cParams->targetLength);
size_t const stepSize = cParams->targetLength + !(cParams->targetLength) + 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this as 2 * (targetLength + !targetLength), but I think that +1 is a better choice.

This will make negative compression levels slightly stronger and slower, but the user can set any negative level, so that should be fine.

This PR is based on top of PR facebook#1563.

The optimization is to process two input pointers per loop.
It is based on ideas from [igzip] level 1, and talking to @gbtucker.

| Platform                | Silesia     | Enwik8 |
|-------------------------|-------------|--------|
| OSX clang-10            | +5.3%       | +5.4%  |
| i9 5 GHz gcc-8          | +6.6%       | +6.6%  |
| i9 5 GHz clang-7        | +8.0%       | +8.0%  |
| Skylake 2.4 GHz gcc-4.8 | +6.3%       | +7.9%  |
| Skylake 2.4 GHz clang-7 | +6.2%       | +7.5%  |

Testing on all Silesia files on my Intel i9-9900k with gcc-8

| Silesia File | Ratio Change | Speed Change |
|--------------|--------------|--------------|
| silesia.tar  | +0.17%       | +6.6%        |
| dickens      | +0.25%       | +7.0%        |
| mozilla      | +0.02%       | +6.8%        |
| mr           | -0.30%       | +10.9%       |
| nci          | +1.28%       | +4.5%        |
| ooffice      | -0.35%       | +10.7%       |
| osdb         | +0.75%       | +9.8%        |
| reymont      | +0.65%       | +4.6%        |
| samba        | +0.70%       | +5.9%        |
| sao          | -0.01%       | +14.0%       |
| webster      | +0.30%       | +5.5%        |
| xml          | +0.92%       | +5.3%        |
| x-ray        | -0.00%       | +1.4%        |

Same tests on Calgary. For brevity, I've only included files
where compression ratio regressed or was much better.

| Calgary File | Ratio Change | Speed Change |
|--------------|--------------|--------------|
| calgary.tar  | +0.30%       | +7.1%        |
| geo          | -0.14%       | +25.0%       |
| obj1         | -0.46%       | +15.2%       |
| obj2         | -0.18%       | +6.0%        |
| pic          | +1.80%       | +9.3%        |
| trans        | -0.35%       | +5.5%        |

We gain 0.1% of compression ratio on Silesia.
We gain 0.3% of compression ratio on enwik8.
I also tested on the GitHub and hg-commands datasets without a dictionary,
and we gain a small amount of compression ratio on each, as well as speed.

I tested the negative compression levels on Silesia on my
Intel i9-9900k with gcc-8:

| Level | Ratio Change | Speed Change |
|-------|--------------|--------------|
| -1    | +0.13%       | +6.4%        |
| -2    | +4.6%        | -1.5%        |
| -3    | +7.5%        | -4.8%        |
| -4    | +8.5%        | -6.9%        |
| -5    | +9.1%        | -9.1%        |

Roughly, the negative levels now scale half as quickly. E.g. the new
level 16 is roughly equivalent to the old level 8, but a bit quicker
and smaller.  If you don't think this is the right trade off, we can
change it to multiply the step size by 2, instead of adding 1. I think
this makes sense, because it gives a bit slower ratio decay.

[igzip]: https://github.com/01org/isa-l/tree/master/igzip
@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 3, 2019

The new version looks much better.

Results on the Qualcomm aarch64 Kryo 2 using clang version 7.0.1 compiler :

 1#dickens           :  10192446 ->   4268865 (2.388),  58.2 MB/s , 221.3 MB/s
 1#mozilla           :  51220480 ->  20117517 (2.546), 108.0 MB/s , 295.1 MB/s
 1#mr                :   9970564 ->   3840242 (2.596),  95.6 MB/s , 356.1 MB/s
 1#nci               :  33553445 ->   2849306 (11.78), 187.5 MB/s , 369.2 MB/s
 1#ooffice           :   6152192 ->   3590954 (1.713),  85.6 MB/s , 260.9 MB/s
 1#osdb              :  10085684 ->   3739042 (2.697), 101.1 MB/s , 338.7 MB/s
 1#reymont           :   6627202 ->   2152771 (3.078),  74.4 MB/s , 257.2 MB/s
 1#samba             :  21606400 ->   5510994 (3.921), 128.0 MB/s , 378.0 MB/s
 1#sao               :   7251944 ->   6256401 (1.159),  72.6 MB/s , 292.0 MB/s
 1#webster           :  41458703 ->  13692222 (3.028),  82.0 MB/s , 286.0 MB/s
 1#xml               :   5345280 ->    696652 (7.673), 150.6 MB/s , 381.1 MB/s
 1#x-ray             :   8474240 ->   6772557 (1.251), 144.2 MB/s , 321.4 MB/s
 1#silesia.tar       : 211948032 ->  73513096 (2.883), 107.7 MB/s , 325.1 MB/s

The situation on compression ratio is much improved.
No more noticeable losses (there are a few occasional losses, but they are very tiny). Some outliers actually achieve noticeable gains instead. The most impacted file is xml which reverse course from a noticeable loss to a noticeable gain (7.603 -> 7.561 -> 7.673). The end result is now slightly positive (as opposed to slightly negative in previous version).

Situation for compression speed is a bit more subtle. It's also more difficult to assess, due to target cpu volatility.
In many cases, speed gains are slightly reduced compared to previous version. It remains a gain, just a smaller one. However, there are no more losses observed (x-ray notably) and occasional outliers which improve their speed instead (like nci).
As a consequence, on average, the speed improvement seems slightly better than previous version. It's now closer to +5%.

So this version feels like an overall win.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 3, 2019

Also :

  • I'm fine with the new scale for negative compression levels. I just hope we don't have too many users depending on current scale that would be surprised after an upgrade. I believe it's manageable. Just be prepared to see some comments on this point.
  • I confirmed speed gains on x64 similar to your measurements. Speed gains feel more noticeable on this platform, ending in the average range of +~6-7% .
 1#dickens           :  +4%
 1#mozilla           :  +5%
 1#mr                :  +9%
 1#nci               :  +4%
 1#ooffice           : +10%
 1#osdb              : +10%
 1#reymont           :  +0%
 1#samba             :  +4%
 1#sao               : +16%
 1#webster           :  +4%
 1#x-ray             :  +2%
 1#xml               :  +4%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants