Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lazy] Speed up compilation times #2828

Merged
merged 1 commit into from
Oct 25, 2021
Merged

Conversation

terrelln
Copy link
Contributor

Speed up compilation times by moving each specialized search function
into its own function. This is faster because compilers can handle many
smaller functions much faster than one gigantic function. The previous
approach generated one giant function with switch statements and
inlining to select the implementation.

Compiler Flags Dev Time (s) PR Time (s) Delta
gcc -O3 16.5 5.6 -66%
gcc -O3 -g -fsanitize=address,undefined 158.9 38.2 -75%
clang -O3 36.5 5.5 -85%
clang -O3 -g -fsanitize=address,undefined 27.8 17.5 -37%

This also reduces the binary size because the search functions are no
longer inlined into the main body.

Compiler Dev libzstd.a Size (B) PR libzstd.a Size (B) Delta
gcc 1563868 1308844 -16%
clang 1924372 1376020 -28%

Finally, the performance is not impacted significantly by this change,
in fact we generally see a small speed boost.

Compiler Level Dev Speed (MB/s) PR Speed (MB/s) Delta
gcc 5 110.6 110.0 -0.5%
gcc 7 70.4 72.2 +2.5%
gcc 9 53.2 53.5 +0.5%
gcc 13 12.7 12.9 +1.5%
clang 5 113.9 110.4 -3.0%
clang 7 67.7 70.6 +4.2%
clang 9 51.9 52.2 +0.5%
clang 13 12.4 13.3 +7.2%

The compression strategy is unmodified in this PR, so the compressed size
should be exactly the same. I may have a follow up PR to slightly improve
the compression ratio, if it doesn't cost too much speed.

@Cyan4973
Copy link
Contributor

Excellent ! Great work @terrelln !

The only downside I can think of is that it's generally preferred to not use macro when an alternative is possible, and this code is heavy on template-by-macros,
but well, in this case, measurements seem to prove your point, making compilation considerably faster as well as improving binary size by a sizable amount. So that's a small price to pay.

@terrelln terrelln force-pushed the lazy-compile branch 2 times, most recently from a194e31 to 13cad3a Compare October 22, 2021 20:38
@terrelln
Copy link
Contributor Author

The only downside I can think of is that it's generally preferred to not use macro when an alternative is possible, and this code is heavy on template-by-macros,
but well, in this case, measurements seem to prove your point, making compilation considerably faster as well as improving binary size by a sizable amount. So that's a small price to pay.

Yeah, I think it is a reasonable tradeoff. Inlining + functions for all the logic, and macros generate the functions that have the compile time constant "template parameters". That keeps all the logic in functions free of macros, and the macro magic is limited to selecting which function to call.

Speed up compilation times by moving each specialized search function
into its own function. This is faster because compilers can handle many
smaller functions much faster than one gigantic function. The previous
approach generated one giant function with `switch` statements and
inlining to select the implementation.

| Compiler | Flags                               | Dev Time (s) | PR Time (s) | Delta |
|----------|-------------------------------------|--------------|-------------|-------|
| gcc      | -O3                                 |         16.5 |         5.6 |  -66% |
| gcc      | -O3 -g -fsanitize=address,undefined |        158.9 |        38.2 |  -75% |
| clang    | -O3                                 |         36.5 |         5.5 |  -85% |
| clang    | -O3 -g -fsanitize=address,undefined |         27.8 |        17.5 |  -37% |

This also reduces the binary size because the search functions are no
longer inlined into the main body.

| Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta |
|----------|------------------------|-----------------------|-------|
| gcc      |                1563868 |               1308844 |  -16% |
| clang    |                1924372 |               1376020 |  -28% |

Finally, the performance is not impacted significantly by this change,
in fact we generally see a small speed boost.

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta |
|----------|-------|------------------|-----------------|-------|
| gcc      |     5 |            110.6 |           110.0 | -0.5% |
| gcc      |     7 |             70.4 |            72.2 | +2.5% |
| gcc      |     9 |             53.2 |            53.5 | +0.5% |
| gcc      |    13 |             12.7 |            12.9 | +1.5% |
| clang    |     5 |            113.9 |           110.4 | -3.0% |
| clang    |     7 |             67.7 |            70.6 | +4.2% |
| clang    |     9 |             51.9 |            52.2 | +0.5% |
| clang    |    13 |             12.4 |            13.3 | +7.2% |

The compression strategy is unmodified in this PR, so the compressed size
should be exactly the same. I may have a follow up PR to slightly improve
the compression ratio, if it doesn't cost too much speed.
@terrelln terrelln merged commit ad739e5 into facebook:dev Oct 25, 2021
terrelln added a commit to terrelln/zstd that referenced this pull request Nov 16, 2021
This PR reduces binary size and maximum stack usage by splitting up
large functions into multiple smaller functions. Compilers handle many
small functions much better than one large function.

See PR facebook#2828 [0] for details.

[0] facebook#2828
terrelln added a commit to terrelln/zstd that referenced this pull request Nov 16, 2021
Following the same idea as PR facebook#2828 [0] we break up large inlined
functions into many smaller outlined functions to help the compiler
optimize better, and not use excess stack space. This saves binary
size on all architectures, and reduces stack usage on parisc.

[0] facebook#2828
terrelln added a commit to terrelln/zstd that referenced this pull request Nov 16, 2021
Take the same approach as in PR facebook#2828 [0] to remove functions that force
inline many function bodies and `switch`. Instead, create one function per
"template" combination, and then switch between these functions. This
allows the compiler to break the large function into many small
functions, which generally helps codegen.

Also, in the `extDict` modes when there is no ext-dict, call the top
level function instead of the force inlined one, to save on code size.

I'm specifically doing this because gcc on the parisc architecture doesn't
handle the large function body well, and ends up using a lot of excess
stack space. Outlining these functions fixes it.
terrelln added a commit to terrelln/zstd that referenced this pull request Nov 16, 2021
Take the same approach as in PR facebook#2828 [0] to remove functions that force
inline many function bodies and `switch`. Instead, create one function per
"template" combination, and then switch between these functions. This
allows the compiler to break the large function into many small
functions, which generally helps codegen.

Also, in the `extDict` modes when there is no ext-dict, call the top
level function instead of the force inlined one, to save on code size.

I'm specifically doing this because gcc on the parisc architecture doesn't
handle the large function body well, and ends up using a lot of excess
stack space. Outlining these functions fixes it.
terrelln added a commit to terrelln/linux that referenced this pull request Nov 16, 2021
Backport of upstream PR #2828 [0].

Large functions with excessive force inlining can cause trouble for
compilers, and can sometimes take excess stack space because the
compiler isn't able to fully analyze the function. This commit splits
functions that have multiple copies of the same body into multiple
smaller functions, which can help the compiler.

This commit isn't strictly necessary, as the reported problems [1] are
in zstd_fast.c and zstd_double_fast.c. But, these functions are using
the same pattern, so they could also be problematic. And, we already had
the fix sitting in our dev branch for our next release, so I figured I'd
add it in for consistency.

Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB
of code size:

```
> ../scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559)
Total: Before=6418562, After=6417003, chg -0.02%
```

[0] facebook/zstd#2828
[1] https://lkml.org/lkml/2021/11/15/710

Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Nick Terrell <[email protected]>
terrelln added a commit to terrelln/zstd that referenced this pull request Nov 16, 2021
This PR reduces binary size and maximum stack usage by splitting up
large functions into multiple smaller functions. Compilers handle many
small functions much better than one large function.

See PR facebook#2828 [0] for details.

[0] facebook#2828
buzzcut-s pushed a commit to buzzcut-s/kernel-x86-5.15 that referenced this pull request Nov 18, 2021
Backport of upstream PR #2828 [0].

Large functions with excessive force inlining can cause trouble for
compilers, and can sometimes take excess stack space because the
compiler isn't able to fully analyze the function. This commit splits
functions that have multiple copies of the same body into multiple
smaller functions, which can help the compiler.

This commit isn't strictly necessary, as the reported problems [1] are
in zstd_fast.c and zstd_double_fast.c. But, these functions are using
the same pattern, so they could also be problematic. And, we already had
the fix sitting in our dev branch for our next release, so I figured I'd
add it in for consistency.

Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB
of code size:

```
> ../scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559)
Total: Before=6418562, After=6417003, chg -0.02%
```

[0] facebook/zstd#2828
[1] https://lkml.org/lkml/2021/11/15/710

Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Nick Terrell <[email protected]>
terrelln added a commit to terrelln/zstd that referenced this pull request Dec 2, 2021
Use the same trick as we did for zstd_lazy in PR facebook#2828:
* Create one search function specialization for each (dictMode, mls).
* Select the search function pointer at the top of the match finder.

Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into
every function, since `dictMode` is no longer used as a template. Create
two specializations, for opt levels 0 and 2, and call one of the two
specializations.

Lastly, remove the hack that disabled inlining for zstd_opt for the
Linux Kernel, as we've gotten most of the benefit already.

Compilation speed sees a ~4x reduction:

| Compiler | Flags                            | Dev Time (s) | PR Time (s) | Delta |
|----------|----------------------------------|--------------|-------------|-------|
| gcc      | -O3                              |         10.1 |         2.3 |  -77% |
| gcc      | -O3 -fsanitize=address,undefined |         61.1 |        10.2 |  -83% |
| clang    | -O3                              |          9.0 |         2.1 |  -76% |
| clang    | -O3 -fsanitize=address,undefined |         33.5 |         5.1 |  -84% |

Build size is reduced by 150KB - 200KB:

| Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta |
|----------|------------------------|-----------------------|-------|
| gcc      |                1327476 |               1177108 |  -11% |
| clang    |                1378324 |               1167780 |  -15% |

There is a <2% speed loss in all cases:

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |    16 |             4.78 |            4.72 | -1.25% |
| gcc      |    17 |             3.49 |            3.46 | -0.85% |
| gcc      |    18 |             2.92 |            2.86 | -2.04% |
| gcc      |    19 |             2.61 |            2.61 |  0.00% |
| clang    |    16 |             4.69 |            4.80 |  2.34% |
| clang    |    17 |             3.53 |            3.49 | -1.13% |
| clang    |    18 |             2.86 |            2.85 | -0.34% |
| clang    |    19 |             2.61 |            2.61 |  0.00% |
terrelln added a commit to terrelln/zstd that referenced this pull request Dec 2, 2021
Use the same trick as we did for zstd_lazy in PR facebook#2828:
* Create one search function specialization for each (dictMode, mls).
* Select the search function pointer at the top of the match finder.

Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into
every function, since `dictMode` is no longer used as a template. Create
two specializations, for opt levels 0 and 2, and call one of the two
specializations.

Lastly, remove the hack that disabled inlining for zstd_opt for the
Linux Kernel, as we've gotten most of the benefit already.

Compilation speed sees a ~4x reduction:

| Compiler | Flags                            | Dev Time (s) | PR Time (s) | Delta |
|----------|----------------------------------|--------------|-------------|-------|
| gcc      | -O3                              |         10.1 |         2.3 |  -77% |
| gcc      | -O3 -fsanitize=address,undefined |         61.1 |        10.2 |  -83% |
| clang    | -O3                              |          9.0 |         2.1 |  -76% |
| clang    | -O3 -fsanitize=address,undefined |         33.5 |         5.1 |  -84% |

Build size is reduced by 150KB - 200KB:

| Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta |
|----------|------------------------|-----------------------|-------|
| gcc      |                1327476 |               1177108 |  -11% |
| clang    |                1378324 |               1167780 |  -15% |

There is a <2% speed loss in all cases:

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |    16 |             4.78 |            4.72 | -1.25% |
| gcc      |    17 |             3.49 |            3.46 | -0.85% |
| gcc      |    18 |             2.92 |            2.86 | -2.04% |
| gcc      |    19 |             2.61 |            2.61 |  0.00% |
| clang    |    16 |             4.69 |            4.80 |  2.34% |
| clang    |    17 |             3.53 |            3.49 | -1.13% |
| clang    |    18 |             2.86 |            2.85 | -0.34% |
| clang    |    19 |             2.61 |            2.61 |  0.00% |

Fixes Issue facebook#2862.
terrelln added a commit to terrelln/zstd that referenced this pull request Dec 2, 2021
Use the same trick as we did for zstd_lazy in PR facebook#2828:
* Create one search function specialization for each (dictMode, mls).
* Select the search function pointer at the top of the match finder.

Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into
every function, since `dictMode` is no longer used as a template. Create
two specializations, for opt levels 0 and 2, and call one of the two
specializations.

Lastly, remove the hack that disabled inlining for zstd_opt for the
Linux Kernel, as we've gotten most of the benefit already.

Compilation time sees a ~4x reduction:

| Compiler | Flags                            | Dev Time (s) | PR Time (s) | Delta |
|----------|----------------------------------|--------------|-------------|-------|
| gcc      | -O3                              |         10.1 |         2.3 |  -77% |
| gcc      | -O3 -fsanitize=address,undefined |         61.1 |        10.2 |  -83% |
| clang    | -O3                              |          9.0 |         2.1 |  -76% |
| clang    | -O3 -fsanitize=address,undefined |         33.5 |         5.1 |  -84% |

Build size is reduced by 150KB - 200KB:

| Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta |
|----------|------------------------|-----------------------|-------|
| gcc      |                1327476 |               1177108 |  -11% |
| clang    |                1378324 |               1167780 |  -15% |

There is a <2% speed loss in all cases:

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |    16 |             4.78 |            4.72 | -1.25% |
| gcc      |    17 |             3.49 |            3.46 | -0.85% |
| gcc      |    18 |             2.92 |            2.86 | -2.04% |
| gcc      |    19 |             2.61 |            2.61 |  0.00% |
| clang    |    16 |             4.69 |            4.80 |  2.34% |
| clang    |    17 |             3.53 |            3.49 | -1.13% |
| clang    |    18 |             2.86 |            2.85 | -0.34% |
| clang    |    19 |             2.61 |            2.61 |  0.00% |

Fixes Issue facebook#2862.
terrelln added a commit to terrelln/zstd that referenced this pull request Dec 2, 2021
Use the same trick as we did for zstd_lazy in PR facebook#2828:
* Create one search function specialization for each (dictMode, mls).
* Select the search function pointer at the top of the match finder.

Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into
every function, since `dictMode` is no longer used as a template. Create
two specializations, for opt levels 0 and 2, and call one of the two
specializations.

Lastly, remove the hack that disabled inlining for zstd_opt for the
Linux Kernel, as we've gotten most of the benefit already.

Compilation time sees a ~4x reduction:

| Compiler | Flags                            | Dev Time (s) | PR Time (s) | Delta |
|----------|----------------------------------|--------------|-------------|-------|
| gcc      | -O3                              |         10.1 |         2.3 |  -77% |
| gcc      | -O3 -fsanitize=address,undefined |         61.1 |        10.2 |  -83% |
| clang    | -O3                              |          9.0 |         2.1 |  -76% |
| clang    | -O3 -fsanitize=address,undefined |         33.5 |         5.1 |  -84% |

Build size is reduced by 150KB - 200KB:

| Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta |
|----------|------------------------|-----------------------|-------|
| gcc      |                1327476 |               1177108 |  -11% |
| clang    |                1378324 |               1167780 |  -15% |

There is a <2% speed loss in all cases:

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |    16 |             4.78 |            4.72 | -1.25% |
| gcc      |    17 |             3.49 |            3.46 | -0.85% |
| gcc      |    18 |             2.92 |            2.86 | -2.04% |
| gcc      |    19 |             2.61 |            2.61 |  0.00% |
| clang    |    16 |             4.69 |            4.80 |  2.34% |
| clang    |    17 |             3.53 |            3.49 | -1.13% |
| clang    |    18 |             2.86 |            2.85 | -0.34% |
| clang    |    19 |             2.61 |            2.61 |  0.00% |

Fixes Issue facebook#2862.
buzzcut-s pushed a commit to buzzcut-s/kernel-x86-5.17 that referenced this pull request Mar 26, 2022
Backport of upstream PR #2828 [0].

Large functions with excessive force inlining can cause trouble for
compilers, and can sometimes take excess stack space because the
compiler isn't able to fully analyze the function. This commit splits
functions that have multiple copies of the same body into multiple
smaller functions, which can help the compiler.

This commit isn't strictly necessary, as the reported problems [1] are
in zstd_fast.c and zstd_double_fast.c. But, these functions are using
the same pattern, so they could also be problematic. And, we already had
the fix sitting in our dev branch for our next release, so I figured I'd
add it in for consistency.

Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB
of code size:

```
> ../scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559)
Total: Before=6418562, After=6417003, chg -0.02%
```

[0] facebook/zstd#2828
[1] https://lkml.org/lkml/2021/11/15/710

Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Nick Terrell <[email protected]>
ptr1337 pushed a commit to CachyOS/linux that referenced this pull request Jun 4, 2022
Backport of upstream PR #2828 [0].

Large functions with excessive force inlining can cause trouble for
compilers, and can sometimes take excess stack space because the
compiler isn't able to fully analyze the function. This commit splits
functions that have multiple copies of the same body into multiple
smaller functions, which can help the compiler.

This commit isn't strictly necessary, as the reported problems [1] are
in zstd_fast.c and zstd_double_fast.c. But, these functions are using
the same pattern, so they could also be problematic. And, we already had
the fix sitting in our dev branch for our next release, so I figured I'd
add it in for consistency.

Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB
of code size:

```
> ../scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559)
Total: Before=6418562, After=6417003, chg -0.02%
```

[0] facebook/zstd#2828
[1] https://lkml.org/lkml/2021/11/15/710

Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Nick Terrell <[email protected]>
Dark-Matter7232 pushed a commit to Dark-Matter7232/kernel_x86_laptop that referenced this pull request Oct 2, 2022
Backport of upstream PR #2828 [0].

Large functions with excessive force inlining can cause trouble for
compilers, and can sometimes take excess stack space because the
compiler isn't able to fully analyze the function. This commit splits
functions that have multiple copies of the same body into multiple
smaller functions, which can help the compiler.

This commit isn't strictly necessary, as the reported problems [1] are
in zstd_fast.c and zstd_double_fast.c. But, these functions are using
the same pattern, so they could also be problematic. And, we already had
the fix sitting in our dev branch for our next release, so I figured I'd
add it in for consistency.

Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB
of code size:

```
> ../scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559)
Total: Before=6418562, After=6417003, chg -0.02%
```

[0] facebook/zstd#2828
[1] https://lkml.org/lkml/2021/11/15/710

Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Nick Terrell <[email protected]>
Dark-Matter7232 pushed a commit to Dark-Matter7232/kernel_x86_laptop that referenced this pull request Oct 2, 2022
Backport of upstream PR #2828 [0].

Large functions with excessive force inlining can cause trouble for
compilers, and can sometimes take excess stack space because the
compiler isn't able to fully analyze the function. This commit splits
functions that have multiple copies of the same body into multiple
smaller functions, which can help the compiler.

This commit isn't strictly necessary, as the reported problems [1] are
in zstd_fast.c and zstd_double_fast.c. But, these functions are using
the same pattern, so they could also be problematic. And, we already had
the fix sitting in our dev branch for our next release, so I figured I'd
add it in for consistency.

Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB
of code size:

```
> ../scripts/bloat-o-meter vmlinux.old vmlinux
add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559)
Total: Before=6418562, After=6417003, chg -0.02%
```

[0] facebook/zstd#2828
[1] https://lkml.org/lkml/2021/11/15/710

Reported-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Nick Terrell <[email protected]>
terrelln added a commit to terrelln/zstd that referenced this pull request Oct 20, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR facebook#2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR facebook#2828 was merged.

This PR is necessary for Issue facebook#3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
terrelln added a commit to terrelln/zstd that referenced this pull request Oct 20, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR facebook#2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR facebook#2828 was merged.

This PR is necessary for Issue facebook#3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
terrelln added a commit to terrelln/zstd that referenced this pull request Oct 21, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR facebook#2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR facebook#2828 was merged.

This PR is necessary for Issue facebook#3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
terrelln added a commit that referenced this pull request Oct 21, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR #2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR #2828 was merged.

This PR is necessary for Issue #3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
terrelln added a commit that referenced this pull request Oct 22, 2022
Use a switch statement to select the search function instead of an
indirect function call. This results in a sizable performance win.

This PR is a modification of the approach taken in PR #2828.
When I measured performance for that commit, it was neutral.
However, I now see a performance regression on gcc, but still
neutral on clang. I'm measuring on the same platform, but with
newer compilers. The new approach beats both the current dev
branch and the baseline before PR #2828 was merged.

This PR is necessary for Issue #3275, to update zstd in the kernel.
Without this PR there is a large regression in greedy - btlazy2
compression speed. With this PR it is about neutral.

gcc version: 12.2.0
clang version: 14.0.6
dataset: silesia.tar

| Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta  |
|----------|-------|------------------|-----------------|--------|
| gcc      |     5 |            102.6 |           113.7 | +10.8% |
| gcc      |     7 |             66.6 |            74.8 | +12.3% |
| gcc      |     9 |             51.5 |            58.9 | +14.3% |
| gcc      |    13 |             14.3 |            14.3 |  +0.0% |
| clang    |     5 |            108.1 |           114.8 |  +6.2% |
| clang    |     7 |             68.5 |            72.3 |  +5.5% |
| clang    |     9 |             53.2 |            56.2 |  +5.6% |
| clang    |    13 |             14.3 |            14.7 |  +2.8% |

The binary size stays just about the same for clang and gcc, measured
using the `size` command:

| Compiler | Branch | Text    | Data | BSS | Total   |
|----------|--------|---------|------|-----|---------|
| gcc      | dev    | 1127950 | 3312 | 280 | 1131542 |
| gcc      | PR     | 1123422 | 2512 | 280 | 1126214 |
| clang    | dev    | 1046254 | 3256 | 216 | 1049726 |
| clang    | PR     | 1048198 | 2296 | 216 | 1050710 |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants