Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize zstd decompression by another x% #2689

Merged
merged 5 commits into from
Jun 29, 2021
Merged

Optimize zstd decompression by another x% #2689

merged 5 commits into from
Jun 29, 2021

Conversation

danlark1
Copy link
Contributor

@danlark1 danlark1 commented May 29, 2021

Hi, that's me again, for now with more significant numbers, graphs and explanations.

TL;DR. We see 3% improvement for silesia.tar on server processors for clang-12, on laptops around 6%. Other clang compilers also benefit a lot (see below). GCC for laptops is 2-3% faster, for servers 1-2% slower. At Google we see a significant improvement on our internal data like text logs, protobufs, they vary from 5% to 9%, for us this results in thousands of cores. This also eliminates most of the perf gap between gcc and clang and we are more than happy to show the improvements and methodology behind that.

After the 1.5.0 update we saw significant improvements and were interested in some more, we decided to precompile zstd with GCC and saw that the compression speed was more or less on par with the clang one. However, decompression differed by 4-7% pretty stably and we saw an opportunity there. Having a precompiled assembly or using GCC is not sufficiently stable for us and we decided to understand where the gap comes from. Even you can see that in silesia.tar, the gap for level 8 of compression (our default) for laptop processor is around 5.6% and for servers around 4%

image

image

Idea 1. Mark more conditions as likely and unlikely

In the ZSTD_decodeSequence (of the most consuming decompression entity) there are 3 main if statements

if (ofBits > 1) {
// ...
if (mlBits > 0) {
// ...
if (llBits > 0) {

GCC annotated objdump for zstd_decompress_block is here, for clang is here. You should look for ZSTD_decompressSequences_bmi2 function starting with ZSTD_seqSymbol const ofDInfo = seqState->stateOffb.table.

Some key observations that we had found out

GCC does a very great job of already addressing conditions for mlBits and llBits with jumping pretty far away if they do not equal zero:

 if (mlBits > 0)
    21ca: 45 84 db              test   %r11b,%r11b
JUMP FAR AWAY >>>>>>>>>>>    21cd: 0f 85 4d 02 00 00     jne    2420 <ZSTD_decompressSequences_bmi2.constprop.0+0x5a0>
    if (MEM_64bits() && UNLIKELY(totalBits >= STREAM_ACCUMULATOR_MIN_64-(LLFSELog+MLFSELog+OffFSELog)))
    21d3: 41 80 fd 1e           cmp    $0x1e,%r13b
    21d7: 0f 87 6f 02 00 00     ja     244c <ZSTD_decompressSequences_bmi2.constprop.0+0x5cc>
    21dd: 48 8b 8c 24 80 00 00  mov    0x80(%rsp),%rcx
    21e4: 00 
    if (llBits > 0)
    21e5: 45 84 d2              test   %r10b,%r10b
JUMP FAR AWAY >>>>>>>>>>    21e8: 0f 85 da 02 00 00     jne    24c8 <ZSTD_decompressSequences_bmi2.constprop.0+0x648>
    U32 const nbBits = DInfo.nbBits;
    21ee: 45 0f b6 50 03        movzbl 0x3(%r8),%r10d
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];
    21f3: 4c 8d 1d 00 00 00 00  lea    0x0(%rip),%r11        # 21fa <ZSTD_decompressSequences_bmi2.constprop.0+0x37a>
    DStatePtr->state = DInfo.nextState + lowBits;
    21fa: 45 0f b7 00           movzwl (%r8),%r8d
    size_t const sequenceLength = sequence.litLength + sequence.matchLength;
    21fe: 4e 8d 34 22           lea    (%rdx,%r12,1),%r14
    BYTE* const oMatchEnd = op + sequenceLength;   /* risk : address space overflow (32-bits) */
    2202: 4e 8d 2c 33           lea    (%rbx,%r14,1),%r13
    return BIT_getMiddleBits(bitD->bitContainer, (sizeof(bitD->bitContainer)*8) - bitD->bitsConsumed - nbBits, nbBits);
    2206: 44 01 d0              add    %r10d,%eax
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];

Separately, when ofBits > 1, GCC figures out for some reason that this condition holds more often and jumps on a negation of the condition

        if (ofBits > 1) {
  217d: 80 f9 01              cmp    $0x1,%cl
JUMP VERY FAR AWAY >>>>>>    2180: 0f 86 62 02 00 00     jbe    23e8 <ZSTD_decompressSequences_bmi2.constprop.0+0x568>
  2186: 41 89 cf              mov    %ecx,%r15d
  2189: c4 62 f9 f7 b4 24 80  shlx   %rax,0x80(%rsp),%r14
  2190: 00 00 00 
  bitD->bitsConsumed += nbBits;
  2193: 01 c8                 add    %ecx,%eax
          seqState->prevOffset[2] = seqState->prevOffset[1];
  2195: 48 8b 8c 24 e0 00 00  mov    0xe0(%rsp),%rcx
  219c: 00 
  return (bitD->bitContainer << (bitD->bitsConsumed & regMask)) >> (((regMask+1)-nbBits) & regMask);
  219d: 41 f7 df              neg    %r15d
  bitD->bitsConsumed += nbBits;
  21a0: 89 84 24 88 00 00 00  mov    %eax,0x88(%rsp)

GCC, maybe mistakenly, maybe for a reason, decides correctly what paths to have in order to make code closer and it turned out it benefits some percent from that.

Maybe GCC analyzed the tables, LL_bits and ML_bits contain lots of zeros, OF_bits more likely to have values more than 1. However, I don't believe GCC is so smart because these table are located far away from the actual code.

static UNUSED_ATTR const U32 LL_bits[MaxLL+1] = {
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 2, 2, 3, 3,
     4, 6, 7, 8, 9,10,11,12,
    13,14,15,16
};

static UNUSED_ATTR const U32 ML_bits[MaxML+1] = {
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0,
     1, 1, 1, 1, 2, 2, 3, 3,
     4, 4, 5, 7, 8, 9,10,11,
    12,13,14,15,16
};

static UNUSED_ATTR const U32 OF_bits[MaxOff+1] = {
                     0,  1,  2,  3,  4,  5,  6,  7,
                     8,  9, 10, 11, 12, 13, 14, 15,
                    16, 17, 18, 19, 20, 21, 22, 23,
                    24, 25, 26, 27, 28, 29, 30, 31 };

We gathered some profiles and clang showed that their decisions of setting up blocks is not what inlined with GCC, this can result in some wasting of loading up instructions.

if (ofBits > 1) {
...
0.66add      %esi,%eax
0.55add      %r10,%r8
0.66movdqu   -0x50(%rbp),%xmm0
1.05movdqu   %xmm0,-0x48(%rbp)
0.28mov      %r8,-0x50(%rbp)
1.16mov      -0x38(%rbp),%r13
0.55 │     ↓ jmp      52c
xchg     %ax,%ax
0.06510:   test     %esi,%esi
0.06mov      -0x38(%rbp),%r13
0.22 │     ↓ jne      8e3
VERY RARE BRANCH IN THE BENCHMARKS >>>>> else {
mov      %ebx,%r11d
test     %rdx,%rdx
U32 const ll0 = (llBase == 0);
if (LIKELY((ofBits == 0))) {
 if (LIKELY(!ll0))
   offset = seqState->prevOffset[0];
     │     ↓ je       92c
mov      -0x50(%rbp),%r8
1.05 │52c:   mov      %r14d,%r10d
0.83mov      -0x30(%rbp),%ebx
0.83add      %sil,%r9b
0.33test     %ecx,%ecx

For llBits and mlBits gaps look slightly scary and avoidable.

1.05movdqu   %xmm0,-0x48(%rbp)
0.28mov      %r8,-0x50(%rbp)
1.16mov      -0x38(%rbp),%r13
0.55 │     ↓ jmp      52c
xchg     %ax,%ax
if (mlBits > 0)
0.06510:   test     %esi,%esi
0.06mov      -0x38(%rbp),%r13
0.22 │     ↓ jne      8e3
mov      %ebx,%r11d
test     %rdx,%rdx
     │     ↓ je       92c
mov      -0x50(%rbp),%r8
1.05 │52c:   mov      %r14d,%r10d
0.83mov      -0x30(%rbp),%ebx
0.83add      %sil,%r9b
if (llBits > 0)
0.33test     %ecx,%ecx
     │     ↓ je       551
539:   shlx     %rax,-0xa8(%rbp),%rsi
0.06mov      %ecx,%edi
neg      %dil
     │       shrx     %rdi,%rsi,%rsi
add      %ecx,%eax
add      %rsi,%r12
0.83551:   cmp      $0x1f,%r9b
     │     ↓ jae      7e7
1.05 │55b:   mov      -0xb0(%rbp),%r9
0.88mov      -0xa8(%rbp),%rcx
0.50test     %r15d,%r15d
     │     ↓ je       584
0.17 │       shlx     %rax,%rcx,%rsi
0.06mov      %r15d,%edi
0.11neg      %dil
0.11 │       shrx     %rdi,%rsi,%rsi
0.17add      %r15d,%eax
add      %rsi,%rdx

Of course, we gathered data how often these conditions hit. See the dump. Some terminology. OF0 is the condition where ofBits == 0, OF1 is when ofBits == 1 and OFMore1 is when ofBits > 1, Ratio OF is OFMore1/(OF0 + OF1). LL0 is llBits == 0, LL1 is when (llBits > 0), Ratio LL is LL0/LL1, same with ML, LLD corresponds to another branch in the code:

if (LIKELY(!ll0)) {

When it hits, LLD1 has +1, otherwise LLD0, the ratio is LLD1/LLD0.

So, all ratios correspond to the GCC decisions (likeliness for ofBits > 1, unlikeliness for {ml,ll}Bits > 0).

The dump contains first name of the dataset, then compression level, then Ratio. We checked silesia and other datasets, for example, which snappy has. As you may see, ratios below 2 are super rare for all data. 32 out 41 0.something values correspond to LLD which is already marked as LIKELY, OF is below 1 only for gaviota. Lower than 2 is also pretty rare. Overall, average ratio is around 205 for everyone: 297 for LL, 137 for ML, 90 for OF, 294 for LLD. Also it does not seem the likeliness depends on the compression level. Anyway, it is very much reasonable to put likeliness to specific branches. Good to know that OF1 was zero always, it looks like it is not used currently in the compression, however, fuzzing hits that line.

After that clang improved by 2% for silesia and we saw already improvements. Unfortunately, GCC decided to move blocks and performance degraded and we decided to put likeliness only for clang.

Idea 2. BMI2 codegen

As BMI2 instructions are evolving and zstd has a dynamic dispatch for these things, we noticed that it accesses some memory a little more often and this was a static memory BIT_mask which stores (1 << i) - 1 for all i from 0 to 31.

/*=====    Local Constants   =====*/
static const unsigned BIT_mask[] = {
    0,          1,         3,         7,         0xF,       0x1F,
    0x3F,       0x7F,      0xFF,      0x1FF,     0x3FF,     0x7FF,
    0xFFF,      0x1FFF,    0x3FFF,    0x7FFF,    0xFFFF,    0x1FFFF,
    0x3FFFF,    0x7FFFF,   0xFFFFF,   0x1FFFFF,  0x3FFFFF,  0x7FFFFF,
    0xFFFFFF,   0x1FFFFFF, 0x3FFFFFF, 0x7FFFFFF, 0xFFFFFFF, 0x1FFFFFFF,
    0x3FFFFFFF, 0x7FFFFFFF}; /* up to 31 bits */
#define BIT_MASK_SIZE (sizeof(BIT_mask) / sizeof(BIT_mask[0]))

And the code was something like

  2206: 44 01 d0              add    %r10d,%eax
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];
    2209: 41 89 c1              mov    %eax,%r9d
    220c: 41 f7 d9              neg    %r9d
    220f: c4 62 b3 f7 c9        shrx   %r9,%rcx,%r9
    2214: 47 23 0c 93           and    (%r11,%r10,4),%r9d
 222b: 44 01 c8              add    %r9d,%eax
    return (bitContainer >> (start & regMask)) & BIT_mask[nbBits];
    222e: 41 89 c0              mov    %eax,%r8d
    2231: 41 f7 d8              neg    %r8d
    2234: c4 62 bb f7 c1        shrx   %r8,%rcx,%r8
    2239: 47 23 04 8b           and    (%r11,%r9,4),%r8d
    223d: 4c 01 c7              add    %r8,%rdi
    2240: 48 89 bc 24 c8 00 00  mov    %rdi,0xc8(%rsp)

and, add, mov seemed to me suspicious as well as using r11 register for static memory. I remembered of bzhi (BZHI — Zero High Bits Starting with Specified Bit Position) instruction which just dumps the bits, the gut feeling was right, codegen is much better for bmi2 https://gcc.godbolt.org/z/cac5nr5hs

Clang produces optimal code

        movl    8(%rdi), %eax
        addl    %esi, %eax
        negb    %al
        shrxq   %rax, (%rdi), %rax
        bzhiq   %rsi, %rax, %rax
        retq

LLVM-mca shows clearly it is better than the mask

Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      5     0.50    *                   movl	8(%rdi), %eax
 1      1     0.25                        addl	%esi, %eax
 1      1     0.25                        negb	%al
 2      6     0.50    *                   shrxq	%rax, (%rdi), %rax # likely cached
 1      1     0.50                        bzhiq	%rsi, %rax, %rax
 3      7     1.00                  U     retq
 
 1      1     0.25                        movl	%esi, %ecx
 2      6     0.50    *                   addl	8(%rdi), %esi
 1      1     0.25                        negb	%sil
 2      6     0.50    *                   shrxq	%rsi, (%rdi), %rax # likely cached
 2      6     0.50    *                   andl	_ZL8BIT_mask(,%rcx,4), %eax #likely cached
 3      7     1.00                  U     retq

Even if we consider all memory accesses to be L1 cached, it is still better in number of instructions and worst latencies.

We changed from a mask approach and _bmi2 functions improved in their codegen. When bmi2 is disabled, for example, because of old hardware, this can only happen pre-Haswell, we consider such hardware not so important for performance in order to avoid complexity of dispatching BIT_getMiddleBits to BIT_getMiddleBits_bmi2 and so on. I don't recall any such instructions for ARM and others, that's why I put ifdef on x86_64 platforms.

It improved by another 0.3-0.4% though it is hard to estimate, for sure it does not make the things worse.

Idea 3. Preambula

After the changes above stack memory for defining variables in the preambula decided to live its own life in GCC and a little bit in clang.

ZSTD_seqSymbol const llDInfo = seqState->stateLL.table[seqState->stateLL.state];
    5ec2: 41 0f b7 44 f5 00     movzwl 0x0(%r13,%rsi,8),%eax // next state
    5ec8: 48 89 84 24 98 00 00  mov    %rax,0x98(%rsp)
    5ecf: 00
    5ed0: 41 0f b6 44 f5 02     movzbl 0x2(%r13,%rsi,8),%eax // nbAdditionalBits
    5ed6: 45 0f b6 44 f5 03     movzbl 0x3(%r13,%rsi,8),%r8d // nbBits

Store nextState in eax and move it back on stack, not the best policy, after the changes, even though some other variables were still pushed to stack, it became slightly faster and allowed to remove ifdef for gcc and clang on WithDInfo. I bet alignment played role or the stack address comes into cache a little bit earlier. Either way, changing where variables are put is a good idea because compilers already unpack all structures within the definitions and use it for updating FSE State. After that the performance stabilized for almost all compilers.

I also moved

U32 const llBase = llDInfo.baseValue;
U32 const mlBase = mlDInfo.baseValue;

to

seq.matchLength = mlDInfo->baseValue;
seq.litLength = llDInfo->baseValue;

it saved couple of instructions for stack spill and registers were reused after decodeSequence in execSequence which is a good thing to have.

Idea 4. Previous patch

I moved

if (LIKELY(!ll0))
    offset = seqState->prevOffset[0];
else {
    offset = seqState->prevOffset[1];

to

offset = seqState->prevOffset[ll0];
if (UNLIKELY(ll0)) {

Because gcc9 had bad perf for lazy2 compressed files. Seemed like a slight win performance for clang. And it reasoned a bit better about block fuse

    6056:	48 8b 44 24 78       	mov    0x78(%rsp),%rax
    605b:	48 89 84 24 80 00 00 	mov    %rax,0x80(%rsp)
    6062:	00
    6063:	4c 89 64 24 78       	mov    %r12,0x78(%rsp)
    6068:	44 00 c3             	add    %r8b,%bl
    
    ...
        62e0:	0f 94 c0             	sete   %al
                offset = seqState->prevOffset[ll0];
    62e3:	4c 8b 64 c4 78       	mov    0x78(%rsp,%rax,8),%r12
    62e8:	0f 85 7a fd ff ff    	jne    6068 <ZSTD_decompressSequences_bmi2+0x508>
    62ee:	e9 63 fd ff ff       	jmpq   6056 <ZSTD_decompressSequences_bmi2+0x4f6> // jump to 6056
    ...
                        if (offset != 1) seqState->prevOffset[2] = seqState->prevOffset[1];
    664f:	48 83 f8 01          	cmp    $0x1,%rax
    6653:	0f 85 ed f9 ff ff    	jne    6046 <ZSTD_decompressSequences_bmi2+0x4e6>
    6659:	e9 f8 f9 ff ff       	jmpq   6056 <ZSTD_decompressSequences_bmi2+0x4f6> // jump to 6056

image

Results

I was manually testing silesia.tar with all compression levels from 0 to 22.

#!/bin/bash

echo start >results

for compiler in gcc_8 gcc_9 gcc_10 clang_9 clang_10 clang_11 clang_12
do
 echo $compiler >../out
 cd build_optimized_$compiler && make >/dev/null -j 7
 sudo perf stat -r 50 "$HOME/zstd_clean/build_clean_$compiler/programs/zstd" -t $1 2>>../out
 sudo perf stat -r 50 "$HOME/zstd/build/cmake/build_optimized_$compiler/programs/zstd" -t $1 2>>../out
 cd ..
 grep -E "optimized|clean|task|cycles" out && rm out
done

And then these results were aggregated. Perf data for laptop, perf data for server.

clang 12 laptop

clang_12

clang 12 production server

clang_12_xeon

gcc 10 laptop

gcc_10

gcc 10 production server (note it became slightly worse)

gcc_10_xeon

clang 11 laptop

clang_11

clang 11 server

clang_11_xeon

clang 10 laptop

clang_10

clang 10 server

clang_10_xeon

clang 9 laptop

clang_9

clang 9 server

clang_9_xeon

gcc 8 laptop

gcc_8

gcc 8 server

gcc_8_xeon

gcc 9 laptop

gcc_9

gcc 9 server

gcc_9_xeon

GCC vs Clang without patch laptop:

gcc_vs_clang_old

GCC vs Clang with patch laptop:

gcc_vs_clang_new

GCC vs Clang without patch xeon:

gcc_vs_clang_old_xeon

GCC vs Clang with patch xeon:

gcc_vs_clang_new_xeon

Snappy benchmark (level 8, window log 20, our defaults at Google):

name                           old cpu/op  new cpu/op  delta
BM_DecompressSink/0 [html   ]    56.7µs ± 6%  52.1µs ± 5%  -8.12%  (p=0.000 n=116+119)
BM_DecompressSink/1 [urls   ]     699µs ± 5%   680µs ± 7%  -2.60%    (p=0.000 n=88+81)
BM_DecompressSink/2 [jpg    ]    4.48µs ± 5%  4.51µs ± 6%  +0.62%  (p=0.034 n=116+118)
BM_DecompressSink/4 [pdf    ]    14.2µs ± 6%  13.9µs ± 6%  -2.09%  (p=0.000 n=107+110)
BM_DecompressSink/5 [html4  ]    83.9µs ± 5%  80.2µs ± 9%  -4.42%  (p=0.000 n=117+105)
BM_DecompressSink/6 [txt1   ]     213µs ± 4%   205µs ± 5%  -3.86%    (p=0.000 n=87+81)
BM_DecompressSink/7 [txt2   ]     185µs ± 5%   177µs ± 5%  -4.57%   (p=0.000 n=100+99)
BM_DecompressSink/8 [txt3   ]     511µs ± 4%   494µs ± 7%  -3.44%    (p=0.000 n=87+78)
BM_DecompressSink/9 [txt4   ]     682µs ± 4%   664µs ± 7%  -2.68%    (p=0.000 n=85+79)
BM_DecompressSink/10 [pb     ]    51.2µs ± 4%  48.4µs ± 5%  -5.49%  (p=0.000 n=107+116)
BM_DecompressSink/11 [gaviota]     244µs ± 4%   232µs ± 5%  -4.77%    (p=0.000 n=79+83)


name                           old speed               new speed               delta
BM_DecompressSink/0 [html   ]  1.81GB/s ± 5%           1.97GB/s ± 5%  +8.85%      (p=0.000 n=116+119)
BM_DecompressSink/1 [urls   ]  1.01GB/s ± 4%           1.03GB/s ± 7%  +2.68%        (p=0.000 n=88+81)
BM_DecompressSink/2 [jpg    ]  27.4GB/s ± 5%           27.3GB/s ± 5%  -0.55%      (p=0.046 n=117+118)
BM_DecompressSink/4 [pdf    ]  7.22GB/s ± 6%           7.37GB/s ± 5%  +2.14%      (p=0.000 n=107+110)
BM_DecompressSink/5 [html4  ]  4.88GB/s ± 5%           5.11GB/s ± 8%  +4.66%      (p=0.000 n=117+105)
BM_DecompressSink/6 [txt1   ]   713MB/s ± 4%            742MB/s ± 5%  +4.02%        (p=0.000 n=87+81)
BM_DecompressSink/7 [txt2   ]   676MB/s ± 5%            708MB/s ± 4%  +4.79%       (p=0.000 n=100+99)
BM_DecompressSink/8 [txt3   ]   835MB/s ± 4%            864MB/s ± 7%  +3.50%        (p=0.000 n=87+79)
BM_DecompressSink/9 [txt4   ]   707MB/s ± 4%            726MB/s ± 9%  +2.68%        (p=0.000 n=85+80)
BM_DecompressSink/10 [pb     ]  2.32GB/s ± 4%           2.45GB/s ± 5%  +5.83%      (p=0.000 n=107+116)
BM_DecompressSink/11 [gaviota]   757MB/s ± 4%            795MB/s ± 5%  +5.02%        (p=0.000 n=79+83)

And couple of internal stuff which we ran for quite some time showed +5-6% improvement.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR & analysis!

This is looking good to me! I've done some initial measurements and they look good. I will spend a bit more time measuring, then we should be good to merge this, with the change to make it C89 compliant.

Note that on my machine (i9-9900K) gcc-11 has some alignment issues that cause a ~10% regression. But, changing the alignment to .p2align 4 fixes that. This is a known issue, see:

/* Align the decompression loop to 32 + 16 bytes.
*
* zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression
* speed swings based on the alignment of the decompression loop. This
* performance swing is caused by parts of the decompression loop falling
* out of the DSB. The entire decompression loop should fit in the DSB,
* when it can't we get much worse performance. You can measure if you've
* hit the good case or the bad case with this perf command for some
* compressed file test.zst:
*
* perf stat -e cycles -e instructions -e idq.all_dsb_cycles_any_uops \
* -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst
*
* If you see most cycles served out of the MITE you've hit the bad case.
* If you see most cycles served out of the DSB you've hit the good case.
* If it is pretty even then you may be in an okay case.
*
* This issue has been reproduced on the following CPUs:
* - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9
* Use Instruments->Counters to get DSB/MITE cycles.
* I never got performance swings, but I was able to
* go from the good case of mostly DSB to half of the
* cycles served from MITE.
* - Coffeelake: Intel i9-9900k
* - Coffeelake: Intel i7-9700k
*
* I haven't been able to reproduce the instability or DSB misses on any
* of the following CPUS:
* - Haswell
* - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH
* - Skylake
*
* If you are seeing performance stability this script can help test.
* It tests on 4 commits in zstd where I saw performance change.
*
* https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4
*/
__asm__(".p2align 6");
__asm__("nop");
__asm__(".p2align 5");
__asm__("nop");
# if __GNUC__ >= 9
/* better for gcc-9 and gcc-10, worse for clang and gcc-8 */
__asm__(".p2align 3");
# else
__asm__(".p2align 4");
# endif
#endif

lib/decompress/zstd_decompress_block.c Show resolved Hide resolved
@danlark1
Copy link
Contributor Author

danlark1 commented Jun 2, 2021

Awesome PR & analysis!

This is looking good to me! I've done some initial measurements and they look good. I will spend a bit more time measuring, then we should be good to merge this, with the change to make it C89 compliant.

Note that on my machine (i9-9900K) gcc-11 has some alignment issues that cause a ~10% regression. But, changing the alignment to .p2align 4 fixes that. This is a known issue, see:

/* Align the decompression loop to 32 + 16 bytes.
*
* zstd compiled with gcc-9 on an Intel i9-9900k shows 10% decompression
* speed swings based on the alignment of the decompression loop. This
* performance swing is caused by parts of the decompression loop falling
* out of the DSB. The entire decompression loop should fit in the DSB,
* when it can't we get much worse performance. You can measure if you've
* hit the good case or the bad case with this perf command for some
* compressed file test.zst:
*
* perf stat -e cycles -e instructions -e idq.all_dsb_cycles_any_uops \
* -e idq.all_mite_cycles_any_uops -- ./zstd -tq test.zst
*
* If you see most cycles served out of the MITE you've hit the bad case.
* If you see most cycles served out of the DSB you've hit the good case.
* If it is pretty even then you may be in an okay case.
*
* This issue has been reproduced on the following CPUs:
* - Kabylake: Macbook Pro (15-inch, 2019) 2.4 GHz Intel Core i9
* Use Instruments->Counters to get DSB/MITE cycles.
* I never got performance swings, but I was able to
* go from the good case of mostly DSB to half of the
* cycles served from MITE.
* - Coffeelake: Intel i9-9900k
* - Coffeelake: Intel i7-9700k
*
* I haven't been able to reproduce the instability or DSB misses on any
* of the following CPUS:
* - Haswell
* - Broadwell: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GH
* - Skylake
*
* If you are seeing performance stability this script can help test.
* It tests on 4 commits in zstd where I saw performance change.
*
* https://gist.github.com/terrelln/9889fc06a423fd5ca6e99351564473f4
*/
__asm__(".p2align 6");
__asm__("nop");
__asm__(".p2align 5");
__asm__("nop");
# if __GNUC__ >= 9
/* better for gcc-9 and gcc-10, worse for clang and gcc-8 */
__asm__(".p2align 3");
# else
__asm__(".p2align 4");
# endif
#endif

Thanks, let me try out alignment for other gcc versions

@terrelln
Copy link
Contributor

terrelln commented Jun 2, 2021

Thanks, let me try out alignment for other gcc versions

Not all Intel CPUs have this problem, it seems to only be the newer consumer CPUs. I haven't seen the same on Skylake server CPUs, but it may be present in some of the newer ones. You can use the perf commands referenced in the comments to see if you're running into the DSB issues.

@danlark1
Copy link
Contributor Author

danlark1 commented Jun 8, 2021

Thanks, let me try out alignment for other gcc versions

Not all Intel CPUs have this problem, it seems to only be the newer consumer CPUs. I haven't seen the same on Skylake server CPUs, but it may be present in some of the newer ones. You can use the perf commands referenced in the comments to see if you're running into the DSB issues.

After several rounds of evaluation, I agree gcc11 is worse, however, gcc9 and gcc10 are better with .palign 3. Adjusted to that

@terrelln
Copy link
Contributor

terrelln commented Jun 8, 2021

After several rounds of evaluation, I agree gcc11 is worse, however, gcc9 and gcc10 are better with .palign 3. Adjusted to that

Thanks for doing that! The alignment issues are super annoying, and make it very hard to get consistent measurements, since any small change to the decoder can cause a 10% drop in performance. The .p2align directives help stabilize it a bit, so at least changes outside the decoding loop generally don't affect its performance. But, it doesn't help as much when changing the core decoder logic.

I'll dedicate some time this week to measure performance across the various machines I have, then we can get this merged.

@ghost
Copy link

ghost commented Jun 11, 2021

Just FYI, Intel released an Optimization Reference Manual yesterday:

https://www.phoronix.com/scan.php?page=news_item&px=Intel-ORM-Code-Samples

@danlark1
Copy link
Contributor Author

Hi, @terrelln, were you able to take a look?

@terrelln
Copy link
Contributor

Yeah, I ran some benchmarks on Friday & today.

I still see a consistent gcc-11 regression of ~4% on my i9-9900k (down from ~10%). But, I have confirmed the clang speedup.

It looks like gcc-11 is hitting the bad case of falling out of the DSB both before & after this patch. Comparing v148 and v150 on gcc-11 there is a 5% regression. And, I can't mess with the .p2align directives to fix it. I believe that the change in PR #2625 is the culprit. But, I'm trying to figure out if this performance loss is real, an if it affects just my i9-9900k or other machines, and if we can avoid a gcc performance loss. So far, I haven't been able to re-write the loop to regain the speed of v148 without reverting PR #2625.

I still have to measure on two more machines, both of which don't have alignment issues. I want to accept this PR because the changes make sense. And, performance parity with clang would be great. But, the fact that it isn't a pure win, and that there are already gcc regressions makes it tricky.

I'm collecting the results of my benchmarks, and I will share them here when I'm done. In the meantime I'm going to continue trying to fiddle with the code to make gcc happy.

@danlark1
Copy link
Contributor Author

danlark1 commented Jun 15, 2021

I confirm only gcc 11 is affected by 2-4% for compression levels 2-11, all other versions of gcc stay the same for performance as noted in the description

gcc 11 laptop

gcc_11_c89_optimized

gcc 11 server

gcc_11_server_optimized

@danlark1
Copy link
Contributor Author

More graphs!

gcc_11_server_comparison_optimized

gcc10 vs gcc11 comparison, it seems like gcc11 found new ways to optimize code even further. With that patch gcc11 codegen is the same as old gcc10 codegen

@danlark1
Copy link
Contributor Author

I also was not able to remove 3-4% performance difference for gcc-11

I have several options:

  1. Leave it as it is, as I showed the performance is the same as gcc-10 and we can try to optimize regressions from [lib] Fix UBSAN warning in ZSTD_decompressSequences() #2625 separately
  2. Try to move old code under __GNUC__ == 11 macro
  3. Investigate more, however, first analysis showed it is much harder to reason about GCC changes and I believe the problem is somewhere in stack allocation

The patch is extremely valuable for other configurations and for us specifically, it would be unfortunate to abandon it

@danlark1
Copy link
Contributor Author

Hi, @terrelln

I would like to know how we should proceed with that PR. Were you able to test on other hardware?

@terrelln
Copy link
Contributor

Sorry for the radio silence. I've been a bit busy, and kicking this task down the road because its a bit of a puzzler. But, I think I may have found a solution to our gcc-11 problems:

diff --git a/lib/decompress/zstd_decompress_block.c b/lib/decompress/zstd_decompress_block.c
index e5391d66..9acf2f74 100644
--- a/lib/decompress/zstd_decompress_block.c
+++ b/lib/decompress/zstd_decompress_block.c
@@ -977,10 +977,8 @@ ZSTD_decodeSequence(seqState_t* seqState, const ZSTD_longOffset_e longOffsets)
                 U32 const ll0 = (llDInfo->baseValue == 0);
                 if (LIKELY((ofBits == 0))) {
                     offset = seqState->prevOffset[ll0];
-                    if (UNLIKELY(ll0)) {
-                        seqState->prevOffset[1] = seqState->prevOffset[0];
-                        seqState->prevOffset[0] = offset;
-                    }
+                    seqState->prevOffset[1] = seqState->prevOffset[!ll0];
+                    seqState->prevOffset[0] = offset;
                 } else {
                     offset = ofBase + ll0 + BIT_readBitsFast(&seqState->DStream, 1);
                     {   size_t temp = (offset==3) ? seqState->prevOffset[0] - 1 : seqState->prevOffset[offset];

I've found that gcc-11 and clang-12 both do well with this patch. Please try it out and let me know it works well in your tests.

If not, I think we go ahead and merge this PR as-is. Our current internal gcc-9 and clang-9 are both neutral with this PR.

@danlark1
Copy link
Contributor Author

danlark1 commented Jun 29, 2021

I tried the patch, it does slightly better for clang and a little bit better for gcc-11, yet, I am still able to reproduce some regression for server gcc-11. Now it is 2% instead of 4%.

Anyway, I am good with this, thanks a lot, I appreciate the effort

I do remember I tried something similar but probably instead of seqState->prevOffset[1] = seqState->prevOffset[!ll0]; I tried seqState->prevOffset[ll0] = seqState->prevOffset[0]; and it was worse

@terrelln
Copy link
Contributor

I tried the patch, it does slightly better for clang and a little bit better for gcc-11, yet, I am still able to reproduce some regression for server gcc-11. Now it is 2% instead of 4%.

Glad to hear that you've found a slight improvement as well. I think we should merge this PR with the small gcc-11 regression, since it brings large benefits for clang. And hopefully we'll find a way to improve gcc down the line.

I tried seqState->prevOffset[ll0] = seqState->prevOffset[0]; and it was worse

Yeah, that is what I tried first and found it to be worse as well. I didn't dig into why one works and the other doesn't, but my guess is that something (either the CPU or the compiler) doesn't like the potentially aliased stores, or it is just noise, and there is no good reason why it is worse.

@terrelln terrelln merged commit 094b260 into facebook:dev Jun 29, 2021
@ghost
Copy link

ghost commented Jul 7, 2021

The decompression speed of negative compression levels is reduced a bit.
AMD 3600X, MSVC 2019.

I only tested a special data, maybe the result is not universal.
Please also test the negative compression level in the future.

The X axis is the compression level, the Y axis is the speed (MiB/s).

canvas

@terrelln
Copy link
Contributor

terrelln commented Jul 7, 2021

Hi @animalize, thanks for the benchmark!

We'd accept patches that improve Visual Studios performance without hurting clang/gcc performance or code legibility. However, VS is not a primary target, and we do not actively measure / optimize VS performance.

We already have more than enough trouble trying to keep clang/gcc performance high. Adding another compiler to that list makes the problem much harder. If anything, we'd prefer to drop a "supported performance" compiler rather than add one.

However, we do fully support VS for correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants