-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lazy] Speed up compilation times #2828
Conversation
Excellent ! Great work @terrelln ! The only downside I can think of is that it's generally preferred to not use macro when an alternative is possible, and this code is heavy on template-by-macros, |
a194e31
to
13cad3a
Compare
Yeah, I think it is a reasonable tradeoff. Inlining + functions for all the logic, and macros generate the functions that have the compile time constant "template parameters". That keeps all the logic in functions free of macros, and the macro magic is limited to selecting which function to call. |
Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with `switch` statements and inlining to select the implementation. | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|-------------------------------------|--------------|-------------|-------| | gcc | -O3 | 16.5 | 5.6 | -66% | | gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% | | clang | -O3 | 36.5 | 5.5 | -85% | | clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% | This also reduces the binary size because the search functions are no longer inlined into the main body. | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1563868 | 1308844 | -16% | | clang | 1924372 | 1376020 | -28% | Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost. | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|-------| | gcc | 5 | 110.6 | 110.0 | -0.5% | | gcc | 7 | 70.4 | 72.2 | +2.5% | | gcc | 9 | 53.2 | 53.5 | +0.5% | | gcc | 13 | 12.7 | 12.9 | +1.5% | | clang | 5 | 113.9 | 110.4 | -3.0% | | clang | 7 | 67.7 | 70.6 | +4.2% | | clang | 9 | 51.9 | 52.2 | +0.5% | | clang | 13 | 12.4 | 13.3 | +7.2% | The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.
This PR reduces binary size and maximum stack usage by splitting up large functions into multiple smaller functions. Compilers handle many small functions much better than one large function. See PR facebook#2828 [0] for details. [0] facebook#2828
Following the same idea as PR facebook#2828 [0] we break up large inlined functions into many smaller outlined functions to help the compiler optimize better, and not use excess stack space. This saves binary size on all architectures, and reduces stack usage on parisc. [0] facebook#2828
Take the same approach as in PR facebook#2828 [0] to remove functions that force inline many function bodies and `switch`. Instead, create one function per "template" combination, and then switch between these functions. This allows the compiler to break the large function into many small functions, which generally helps codegen. Also, in the `extDict` modes when there is no ext-dict, call the top level function instead of the force inlined one, to save on code size. I'm specifically doing this because gcc on the parisc architecture doesn't handle the large function body well, and ends up using a lot of excess stack space. Outlining these functions fixes it.
Take the same approach as in PR facebook#2828 [0] to remove functions that force inline many function bodies and `switch`. Instead, create one function per "template" combination, and then switch between these functions. This allows the compiler to break the large function into many small functions, which generally helps codegen. Also, in the `extDict` modes when there is no ext-dict, call the top level function instead of the force inlined one, to save on code size. I'm specifically doing this because gcc on the parisc architecture doesn't handle the large function body well, and ends up using a lot of excess stack space. Outlining these functions fixes it.
Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>
This PR reduces binary size and maximum stack usage by splitting up large functions into multiple smaller functions. Compilers handle many small functions much better than one large function. See PR facebook#2828 [0] for details. [0] facebook#2828
Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>
Use the same trick as we did for zstd_lazy in PR facebook#2828: * Create one search function specialization for each (dictMode, mls). * Select the search function pointer at the top of the match finder. Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into every function, since `dictMode` is no longer used as a template. Create two specializations, for opt levels 0 and 2, and call one of the two specializations. Lastly, remove the hack that disabled inlining for zstd_opt for the Linux Kernel, as we've gotten most of the benefit already. Compilation speed sees a ~4x reduction: | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|----------------------------------|--------------|-------------|-------| | gcc | -O3 | 10.1 | 2.3 | -77% | | gcc | -O3 -fsanitize=address,undefined | 61.1 | 10.2 | -83% | | clang | -O3 | 9.0 | 2.1 | -76% | | clang | -O3 -fsanitize=address,undefined | 33.5 | 5.1 | -84% | Build size is reduced by 150KB - 200KB: | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1327476 | 1177108 | -11% | | clang | 1378324 | 1167780 | -15% | There is a <2% speed loss in all cases: | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 16 | 4.78 | 4.72 | -1.25% | | gcc | 17 | 3.49 | 3.46 | -0.85% | | gcc | 18 | 2.92 | 2.86 | -2.04% | | gcc | 19 | 2.61 | 2.61 | 0.00% | | clang | 16 | 4.69 | 4.80 | 2.34% | | clang | 17 | 3.53 | 3.49 | -1.13% | | clang | 18 | 2.86 | 2.85 | -0.34% | | clang | 19 | 2.61 | 2.61 | 0.00% |
Use the same trick as we did for zstd_lazy in PR facebook#2828: * Create one search function specialization for each (dictMode, mls). * Select the search function pointer at the top of the match finder. Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into every function, since `dictMode` is no longer used as a template. Create two specializations, for opt levels 0 and 2, and call one of the two specializations. Lastly, remove the hack that disabled inlining for zstd_opt for the Linux Kernel, as we've gotten most of the benefit already. Compilation speed sees a ~4x reduction: | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|----------------------------------|--------------|-------------|-------| | gcc | -O3 | 10.1 | 2.3 | -77% | | gcc | -O3 -fsanitize=address,undefined | 61.1 | 10.2 | -83% | | clang | -O3 | 9.0 | 2.1 | -76% | | clang | -O3 -fsanitize=address,undefined | 33.5 | 5.1 | -84% | Build size is reduced by 150KB - 200KB: | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1327476 | 1177108 | -11% | | clang | 1378324 | 1167780 | -15% | There is a <2% speed loss in all cases: | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 16 | 4.78 | 4.72 | -1.25% | | gcc | 17 | 3.49 | 3.46 | -0.85% | | gcc | 18 | 2.92 | 2.86 | -2.04% | | gcc | 19 | 2.61 | 2.61 | 0.00% | | clang | 16 | 4.69 | 4.80 | 2.34% | | clang | 17 | 3.53 | 3.49 | -1.13% | | clang | 18 | 2.86 | 2.85 | -0.34% | | clang | 19 | 2.61 | 2.61 | 0.00% | Fixes Issue facebook#2862.
Use the same trick as we did for zstd_lazy in PR facebook#2828: * Create one search function specialization for each (dictMode, mls). * Select the search function pointer at the top of the match finder. Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into every function, since `dictMode` is no longer used as a template. Create two specializations, for opt levels 0 and 2, and call one of the two specializations. Lastly, remove the hack that disabled inlining for zstd_opt for the Linux Kernel, as we've gotten most of the benefit already. Compilation time sees a ~4x reduction: | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|----------------------------------|--------------|-------------|-------| | gcc | -O3 | 10.1 | 2.3 | -77% | | gcc | -O3 -fsanitize=address,undefined | 61.1 | 10.2 | -83% | | clang | -O3 | 9.0 | 2.1 | -76% | | clang | -O3 -fsanitize=address,undefined | 33.5 | 5.1 | -84% | Build size is reduced by 150KB - 200KB: | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1327476 | 1177108 | -11% | | clang | 1378324 | 1167780 | -15% | There is a <2% speed loss in all cases: | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 16 | 4.78 | 4.72 | -1.25% | | gcc | 17 | 3.49 | 3.46 | -0.85% | | gcc | 18 | 2.92 | 2.86 | -2.04% | | gcc | 19 | 2.61 | 2.61 | 0.00% | | clang | 16 | 4.69 | 4.80 | 2.34% | | clang | 17 | 3.53 | 3.49 | -1.13% | | clang | 18 | 2.86 | 2.85 | -0.34% | | clang | 19 | 2.61 | 2.61 | 0.00% | Fixes Issue facebook#2862.
Use the same trick as we did for zstd_lazy in PR facebook#2828: * Create one search function specialization for each (dictMode, mls). * Select the search function pointer at the top of the match finder. Additionally, we no longer inline `ZSTD_compressBlock_opt_generic` into every function, since `dictMode` is no longer used as a template. Create two specializations, for opt levels 0 and 2, and call one of the two specializations. Lastly, remove the hack that disabled inlining for zstd_opt for the Linux Kernel, as we've gotten most of the benefit already. Compilation time sees a ~4x reduction: | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|----------------------------------|--------------|-------------|-------| | gcc | -O3 | 10.1 | 2.3 | -77% | | gcc | -O3 -fsanitize=address,undefined | 61.1 | 10.2 | -83% | | clang | -O3 | 9.0 | 2.1 | -76% | | clang | -O3 -fsanitize=address,undefined | 33.5 | 5.1 | -84% | Build size is reduced by 150KB - 200KB: | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1327476 | 1177108 | -11% | | clang | 1378324 | 1167780 | -15% | There is a <2% speed loss in all cases: | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 16 | 4.78 | 4.72 | -1.25% | | gcc | 17 | 3.49 | 3.46 | -0.85% | | gcc | 18 | 2.92 | 2.86 | -2.04% | | gcc | 19 | 2.61 | 2.61 | 0.00% | | clang | 16 | 4.69 | 4.80 | 2.34% | | clang | 17 | 3.53 | 3.49 | -1.13% | | clang | 18 | 2.86 | 2.85 | -0.34% | | clang | 19 | 2.61 | 2.61 | 0.00% | Fixes Issue facebook#2862.
Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>
Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>
Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>
Backport of upstream PR #2828 [0]. Large functions with excessive force inlining can cause trouble for compilers, and can sometimes take excess stack space because the compiler isn't able to fully analyze the function. This commit splits functions that have multiple copies of the same body into multiple smaller functions, which can help the compiler. This commit isn't strictly necessary, as the reported problems [1] are in zstd_fast.c and zstd_double_fast.c. But, these functions are using the same pattern, so they could also be problematic. And, we already had the fix sitting in our dev branch for our next release, so I figured I'd add it in for consistency. Bloat-o-meter output summary on x86-64 shows we also save 1.5 KB of code size: ``` > ../scripts/bloat-o-meter vmlinux.old vmlinux add/remove: 50/5 grow/shrink: 10/6 up/down: 28810/-30369 (-1559) Total: Before=6418562, After=6417003, chg -0.02% ``` [0] facebook/zstd#2828 [1] https://lkml.org/lkml/2021/11/15/710 Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Nick Terrell <[email protected]>
Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR facebook#2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR facebook#2828 was merged. This PR is necessary for Issue facebook#3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 5 | 102.6 | 113.7 | +10.8% | | gcc | 7 | 66.6 | 74.8 | +12.3% | | gcc | 9 | 51.5 | 58.9 | +14.3% | | gcc | 13 | 14.3 | 14.3 | +0.0% | | clang | 5 | 108.1 | 114.8 | +6.2% | | clang | 7 | 68.5 | 72.3 | +5.5% | | clang | 9 | 53.2 | 56.2 | +5.6% | | clang | 13 | 14.3 | 14.7 | +2.8% | The binary size stays just about the same for clang and gcc, measured using the `size` command: | Compiler | Branch | Text | Data | BSS | Total | |----------|--------|---------|------|-----|---------| | gcc | dev | 1127950 | 3312 | 280 | 1131542 | | gcc | PR | 1123422 | 2512 | 280 | 1126214 | | clang | dev | 1046254 | 3256 | 216 | 1049726 | | clang | PR | 1048198 | 2296 | 216 | 1050710 |
Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR facebook#2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR facebook#2828 was merged. This PR is necessary for Issue facebook#3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 5 | 102.6 | 113.7 | +10.8% | | gcc | 7 | 66.6 | 74.8 | +12.3% | | gcc | 9 | 51.5 | 58.9 | +14.3% | | gcc | 13 | 14.3 | 14.3 | +0.0% | | clang | 5 | 108.1 | 114.8 | +6.2% | | clang | 7 | 68.5 | 72.3 | +5.5% | | clang | 9 | 53.2 | 56.2 | +5.6% | | clang | 13 | 14.3 | 14.7 | +2.8% | The binary size stays just about the same for clang and gcc, measured using the `size` command: | Compiler | Branch | Text | Data | BSS | Total | |----------|--------|---------|------|-----|---------| | gcc | dev | 1127950 | 3312 | 280 | 1131542 | | gcc | PR | 1123422 | 2512 | 280 | 1126214 | | clang | dev | 1046254 | 3256 | 216 | 1049726 | | clang | PR | 1048198 | 2296 | 216 | 1050710 |
Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR facebook#2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR facebook#2828 was merged. This PR is necessary for Issue facebook#3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 5 | 102.6 | 113.7 | +10.8% | | gcc | 7 | 66.6 | 74.8 | +12.3% | | gcc | 9 | 51.5 | 58.9 | +14.3% | | gcc | 13 | 14.3 | 14.3 | +0.0% | | clang | 5 | 108.1 | 114.8 | +6.2% | | clang | 7 | 68.5 | 72.3 | +5.5% | | clang | 9 | 53.2 | 56.2 | +5.6% | | clang | 13 | 14.3 | 14.7 | +2.8% | The binary size stays just about the same for clang and gcc, measured using the `size` command: | Compiler | Branch | Text | Data | BSS | Total | |----------|--------|---------|------|-----|---------| | gcc | dev | 1127950 | 3312 | 280 | 1131542 | | gcc | PR | 1123422 | 2512 | 280 | 1126214 | | clang | dev | 1046254 | 3256 | 216 | 1049726 | | clang | PR | 1048198 | 2296 | 216 | 1050710 |
Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR #2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR #2828 was merged. This PR is necessary for Issue #3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 5 | 102.6 | 113.7 | +10.8% | | gcc | 7 | 66.6 | 74.8 | +12.3% | | gcc | 9 | 51.5 | 58.9 | +14.3% | | gcc | 13 | 14.3 | 14.3 | +0.0% | | clang | 5 | 108.1 | 114.8 | +6.2% | | clang | 7 | 68.5 | 72.3 | +5.5% | | clang | 9 | 53.2 | 56.2 | +5.6% | | clang | 13 | 14.3 | 14.7 | +2.8% | The binary size stays just about the same for clang and gcc, measured using the `size` command: | Compiler | Branch | Text | Data | BSS | Total | |----------|--------|---------|------|-----|---------| | gcc | dev | 1127950 | 3312 | 280 | 1131542 | | gcc | PR | 1123422 | 2512 | 280 | 1126214 | | clang | dev | 1046254 | 3256 | 216 | 1049726 | | clang | PR | 1048198 | 2296 | 216 | 1050710 |
Use a switch statement to select the search function instead of an indirect function call. This results in a sizable performance win. This PR is a modification of the approach taken in PR #2828. When I measured performance for that commit, it was neutral. However, I now see a performance regression on gcc, but still neutral on clang. I'm measuring on the same platform, but with newer compilers. The new approach beats both the current dev branch and the baseline before PR #2828 was merged. This PR is necessary for Issue #3275, to update zstd in the kernel. Without this PR there is a large regression in greedy - btlazy2 compression speed. With this PR it is about neutral. gcc version: 12.2.0 clang version: 14.0.6 dataset: silesia.tar | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|--------| | gcc | 5 | 102.6 | 113.7 | +10.8% | | gcc | 7 | 66.6 | 74.8 | +12.3% | | gcc | 9 | 51.5 | 58.9 | +14.3% | | gcc | 13 | 14.3 | 14.3 | +0.0% | | clang | 5 | 108.1 | 114.8 | +6.2% | | clang | 7 | 68.5 | 72.3 | +5.5% | | clang | 9 | 53.2 | 56.2 | +5.6% | | clang | 13 | 14.3 | 14.7 | +2.8% | The binary size stays just about the same for clang and gcc, measured using the `size` command: | Compiler | Branch | Text | Data | BSS | Total | |----------|--------|---------|------|-----|---------| | gcc | dev | 1127950 | 3312 | 280 | 1131542 | | gcc | PR | 1123422 | 2512 | 280 | 1126214 | | clang | dev | 1046254 | 3256 | 216 | 1049726 | | clang | PR | 1048198 | 2296 | 216 | 1050710 |
Speed up compilation times by moving each specialized search function
into its own function. This is faster because compilers can handle many
smaller functions much faster than one gigantic function. The previous
approach generated one giant function with
switch
statements andinlining to select the implementation.
This also reduces the binary size because the search functions are no
longer inlined into the main body.
Finally, the performance is not impacted significantly by this change,
in fact we generally see a small speed boost.
The compression strategy is unmodified in this PR, so the compressed size
should be exactly the same. I may have a follow up PR to slightly improve
the compression ratio, if it doesn't cost too much speed.