[lazy] Speed up compilation times #2828

terrelln · 2021-10-22T20:15:50Z

Speed up compilation times by moving each specialized search function
into its own function. This is faster because compilers can handle many
smaller functions much faster than one gigantic function. The previous
approach generated one giant function with switch statements and
inlining to select the implementation.

Compiler	Flags	Dev Time (s)	PR Time (s)	Delta
gcc	-O3	16.5	5.6	-66%
gcc	-O3 -g -fsanitize=address,undefined	158.9	38.2	-75%
clang	-O3	36.5	5.5	-85%
clang	-O3 -g -fsanitize=address,undefined	27.8	17.5	-37%

This also reduces the binary size because the search functions are no
longer inlined into the main body.

Compiler	Dev libzstd.a Size (B)	PR libzstd.a Size (B)	Delta
gcc	1563868	1308844	-16%
clang	1924372	1376020	-28%

Finally, the performance is not impacted significantly by this change,
in fact we generally see a small speed boost.

Compiler	Level	Dev Speed (MB/s)	PR Speed (MB/s)	Delta
gcc	5	110.6	110.0	-0.5%
gcc	7	70.4	72.2	+2.5%
gcc	9	53.2	53.5	+0.5%
gcc	13	12.7	12.9	+1.5%
clang	5	113.9	110.4	-3.0%
clang	7	67.7	70.6	+4.2%
clang	9	51.9	52.2	+0.5%
clang	13	12.4	13.3	+7.2%

The compression strategy is unmodified in this PR, so the compressed size
should be exactly the same. I may have a follow up PR to slightly improve
the compression ratio, if it doesn't cost too much speed.

Cyan4973 · 2021-10-22T20:24:22Z

Excellent ! Great work @terrelln !

The only downside I can think of is that it's generally preferred to not use macro when an alternative is possible, and this code is heavy on template-by-macros,
but well, in this case, measurements seem to prove your point, making compilation considerably faster as well as improving binary size by a sizable amount. So that's a small price to pay.

terrelln · 2021-10-22T20:42:12Z

The only downside I can think of is that it's generally preferred to not use macro when an alternative is possible, and this code is heavy on template-by-macros,
but well, in this case, measurements seem to prove your point, making compilation considerably faster as well as improving binary size by a sizable amount. So that's a small price to pay.

Yeah, I think it is a reasonable tradeoff. Inlining + functions for all the logic, and macros generate the functions that have the compile time constant "template parameters". That keeps all the logic in functions free of macros, and the macro magic is limited to selecting which function to call.

Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with `switch` statements and inlining to select the implementation. | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|-------------------------------------|--------------|-------------|-------| | gcc | -O3 | 16.5 | 5.6 | -66% | | gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% | | clang | -O3 | 36.5 | 5.5 | -85% | | clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% | This also reduces the binary size because the search functions are no longer inlined into the main body. | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1563868 | 1308844 | -16% | | clang | 1924372 | 1376020 | -28% | Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost. | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|-------| | gcc | 5 | 110.6 | 110.0 | -0.5% | | gcc | 7 | 70.4 | 72.2 | +2.5% | | gcc | 9 | 53.2 | 53.5 | +0.5% | | gcc | 13 | 12.7 | 12.9 | +1.5% | | clang | 5 | 113.9 | 110.4 | -3.0% | | clang | 7 | 67.7 | 70.6 | +4.2% | | clang | 9 | 51.9 | 52.2 | +0.5% | | clang | 13 | 12.4 | 13.3 | +7.2% | The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.

This PR reduces binary size and maximum stack usage by splitting up large functions into multiple smaller functions. Compilers handle many small functions much better than one large function. See PR facebook#2828 [0] for details. [0] facebook#2828

Following the same idea as PR facebook#2828 [0] we break up large inlined functions into many smaller outlined functions to help the compiler optimize better, and not use excess stack space. This saves binary size on all architectures, and reduces stack usage on parisc. [0] facebook#2828

Take the same approach as in PR facebook#2828 [0] to remove functions that force inline many function bodies and `switch`. Instead, create one function per "template" combination, and then switch between these functions. This allows the compiler to break the large function into many small functions, which generally helps codegen. Also, in the `extDict` modes when there is no ext-dict, call the top level function instead of the force inlined one, to save on code size. I'm specifically doing this because gcc on the parisc architecture doesn't handle the large function body well, and ends up using a lot of excess stack space. Outlining these functions fixes it.

Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>

This PR reduces binary size and maximum stack usage by splitting up large functions into multiple smaller functions. Compilers handle many small functions much better than one large function. See PR facebook#2828 [0] for details. [0] facebook#2828

Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>

Use the same trick as we did for zstd_lazy in PR facebook#2828: * Create one search function specialization for each (dictMode, mls). * Select the search function pointer at the top of the match finder. Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into every function, since `dictMode` is no longer used as a template. Create two specializations, for opt levels 0 and 2, and call one of the two specializations. Lastly, remove the hack that disabled inlining for zstd_opt for the Linux Kernel, as we've gotten most of the benefit already. Compilation speed sees a ~4x reduction: | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|----------------------------------|--------------|-------------|-------| | gcc | -O3 | 10.1 | 2.3 | -77% | | gcc | -O3 -fsanitize=address,undefined | 61.1 | 10.2 | -83% | | clang | -O3 | 9.0 | 2.1 | -76% | | clang | -O3 -fsanitize=address,undefined | 33.5 | 5.1 | -84% | Build size is reduced by 150KB - 200KB: | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1327476 | 1177108 | -11% | | clang | 1378324 | 1167780 | -15% | There is a <2% speed loss in all cases: | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 16 | 4.78 | 4.72 | -1.25% | | gcc | 17 | 3.49 | 3.46 | -0.85% | | gcc | 18 | 2.92 | 2.86 | -2.04% | | gcc | 19 | 2.61 | 2.61 | 0.00% | | clang | 16 | 4.69 | 4.80 | 2.34% | | clang | 17 | 3.53 | 3.49 | -1.13% | | clang | 18 | 2.86 | 2.85 | -0.34% | | clang | 19 | 2.61 | 2.61 | 0.00% |

Use the same trick as we did for zstd_lazy in PR facebook#2828: * Create one search function specialization for each (dictMode, mls). * Select the search function pointer at the top of the match finder. Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into every function, since `dictMode` is no longer used as a template. Create two specializations, for opt levels 0 and 2, and call one of the two specializations. Lastly, remove the hack that disabled inlining for zstd_opt for the Linux Kernel, as we've gotten most of the benefit already. Compilation speed sees a ~4x reduction: | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|----------------------------------|--------------|-------------|-------| | gcc | -O3 | 10.1 | 2.3 | -77% | | gcc | -O3 -fsanitize=address,undefined | 61.1 | 10.2 | -83% | | clang | -O3 | 9.0 | 2.1 | -76% | | clang | -O3 -fsanitize=address,undefined | 33.5 | 5.1 | -84% | Build size is reduced by 150KB - 200KB: | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1327476 | 1177108 | -11% | | clang | 1378324 | 1167780 | -15% | There is a <2% speed loss in all cases: | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 16 | 4.78 | 4.72 | -1.25% | | gcc | 17 | 3.49 | 3.46 | -0.85% | | gcc | 18 | 2.92 | 2.86 | -2.04% | | gcc | 19 | 2.61 | 2.61 | 0.00% | | clang | 16 | 4.69 | 4.80 | 2.34% | | clang | 17 | 3.53 | 3.49 | -1.13% | | clang | 18 | 2.86 | 2.85 | -0.34% | | clang | 19 | 2.61 | 2.61 | 0.00% | Fixes Issue facebook#2862.

Use the same trick as we did for zstd_lazy in PR facebook#2828: * Create one search function specialization for each (dictMode, mls). * Select the search function pointer at the top of the match finder. Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into every function, since `dictMode` is no longer used as a template. Create two specializations, for opt levels 0 and 2, and call one of the two specializations. Lastly, remove the hack that disabled inlining for zstd_opt for the Linux Kernel, as we've gotten most of the benefit already. Compilation time sees a ~4x reduction: | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|----------------------------------|--------------|-------------|-------| | gcc | -O3 | 10.1 | 2.3 | -77% | | gcc | -O3 -fsanitize=address,undefined | 61.1 | 10.2 | -83% | | clang | -O3 | 9.0 | 2.1 | -76% | | clang | -O3 -fsanitize=address,undefined | 33.5 | 5.1 | -84% | Build size is reduced by 150KB - 200KB: | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1327476 | 1177108 | -11% | | clang | 1378324 | 1167780 | -15% | There is a <2% speed loss in all cases: | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 16 | 4.78 | 4.72 | -1.25% | | gcc | 17 | 3.49 | 3.46 | -0.85% | | gcc | 18 | 2.92 | 2.86 | -2.04% | | gcc | 19 | 2.61 | 2.61 | 0.00% | | clang | 16 | 4.69 | 4.80 | 2.34% | | clang | 17 | 3.53 | 3.49 | -1.13% | | clang | 18 | 2.86 | 2.85 | -0.34% | | clang | 19 | 2.61 | 2.61 | 0.00% | Fixes Issue facebook#2862.

Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>

Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR facebook#2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR facebook#2828 was merged. This PR is necessary for Issue facebook#3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 5 | 102.6 | 113.7 | +10.8% | | gcc | 7 | 66.6 | 74.8 | +12.3% | | gcc | 9 | 51.5 | 58.9 | +14.3% | | gcc | 13 | 14.3 | 14.3 | +0.0% | | clang | 5 | 108.1 | 114.8 | +6.2% | | clang | 7 | 68.5 | 72.3 | +5.5% | | clang | 9 | 53.2 | 56.2 | +5.6% | | clang | 13 | 14.3 | 14.7 | +2.8% | The binary size stays just about the same for clang and gcc, measured using the `size` command: | Compiler | Branch | Text | Data | BSS | Total | |----------|--------|---------|------|-----|---------| | gcc | dev | 1127950 | 3312 | 280 | 1131542 | | gcc | PR | 1123422 | 2512 | 280 | 1126214 | | clang | dev | 1046254 | 3256 | 216 | 1049726 | | clang | PR | 1048198 | 2296 | 216 | 1050710 |

Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR #2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR #2828 was merged. This PR is necessary for Issue #3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 5 | 102.6 | 113.7 | +10.8% | | gcc | 7 | 66.6 | 74.8 | +12.3% | | gcc | 9 | 51.5 | 58.9 | +14.3% | | gcc | 13 | 14.3 | 14.3 | +0.0% | | clang | 5 | 108.1 | 114.8 | +6.2% | | clang | 7 | 68.5 | 72.3 | +5.5% | | clang | 9 | 53.2 | 56.2 | +5.6% | | clang | 13 | 14.3 | 14.7 | +2.8% | The binary size stays just about the same for clang and gcc, measured using the `size` command: | Compiler | Branch | Text | Data | BSS | Total | |----------|--------|---------|------|-----|---------| | gcc | dev | 1127950 | 3312 | 280 | 1131542 | | gcc | PR | 1123422 | 2512 | 280 | 1126214 | | clang | dev | 1046254 | 3256 | 216 | 1049726 | | clang | PR | 1048198 | 2296 | 216 | 1050710 |

facebook-github-bot added the CLA Signed label Oct 22, 2021

terrelln force-pushed the lazy-compile branch 2 times, most recently from a194e31 to 13cad3a Compare October 22, 2021 20:38

Cyan4973 approved these changes Oct 22, 2021

View reviewed changes

terrelln merged commit ad739e5 into facebook:dev Oct 25, 2021

terrelln mentioned this pull request Nov 16, 2021

Reduce function size in fast & dfast #2863

Merged

terrelln mentioned this pull request Nov 16, 2021

[umbrella task] Track and improve code size #2867

Open

terrelln mentioned this pull request Dec 2, 2021

Improve zstd_opt build speed and size #2898

Merged

terrelln mentioned this pull request Oct 20, 2022

[lazy] Use switch instead of indirect function calls. #3295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lazy] Speed up compilation times #2828

[lazy] Speed up compilation times #2828

terrelln commented Oct 22, 2021

Cyan4973 commented Oct 22, 2021

terrelln commented Oct 22, 2021

[lazy] Speed up compilation times #2828

[lazy] Speed up compilation times #2828

Conversation

terrelln commented Oct 22, 2021

Cyan4973 commented Oct 22, 2021

terrelln commented Oct 22, 2021