From 04b914cbdf97324764c6d3aad82775e258f7ad24 Mon Sep 17 00:00:00 2001 From: Denis Bakhvalov Date: Mon, 23 Sep 2024 12:57:37 -0400 Subject: [PATCH] [Chapter10] Final touches --- biblio.bib | 7 ------- .../10-0 Optimizing bad speculation.md | 10 +++++----- .../10-2 Replace branches with arithmetic.md | 20 +++++++++---------- .../10-3 Replace branches with predication.md | 6 +++--- .../10-4 Multiple Compares Single Branch.md | 10 +++++----- .../10-6 Chapter Summary.md | 5 +++-- 6 files changed, 26 insertions(+), 32 deletions(-) diff --git a/biblio.bib b/biblio.bib index 15824aa440..3b21b76e09 100644 --- a/biblio.bib +++ b/biblio.bib @@ -110,13 +110,6 @@ @Article{LemireBranchless url = {https://www.infoq.com/articles/making-code-faster-taming-branches/}, } -@Article{IntelAvoidingBrMisp, - author = {Rajiv Kapoor}, - title = {Avoiding the Cost of Branch Misprediction}, - year = {2009}, - url = {https://software.intel.com/en-us/articles/avoiding-the-cost-of-branch-misprediction}, -} - @inproceedings{Nowak2014TheOO, author={Andrzej Nowak and Georgios Bitzes}, title={The overhead of profiling using PMU hardware counters}, diff --git a/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md b/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md index c96410177e..12d5cd3938 100644 --- a/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md +++ b/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md @@ -6,11 +6,11 @@ So far we've been talking about optimizing memory accesses and computations. How In general, modern processors are very good at predicting branch outcomes. They not only follow static prediction rules but also detect dynamic patterns. Usually, branch predictors save the history of previous outcomes for the branches and try to guess what the next result will be. However, when the pattern becomes hard for the CPU branch predictor to follow, it may hurt performance. -Mispredicting a branch can add a significant speed penalty when it happens regularly. When such an event occurs, a CPU is required to clear all the speculative work that was done ahead of time and later was proven to be wrong. It also needs to flush the pipeline and start filling it with instructions from the correct path. Typically, modern CPUs experience 10 to 20-cycle penalties as a result of a branch misprediction. The exact number of cycles depends on the microarchitecture design, namely, on the depth of the pipeline and the mechanism used to recover from the mispredicts. +Mispredicting a branch can add a significant speed penalty when it happens regularly. When such an event occurs, a CPU is required to clear all the speculative work that was done ahead of time and later was proven to be wrong. It also needs to flush the pipeline and start filling it with instructions from the correct path. Typically, modern CPUs experience 10 to 25-cycle penalties as a result of a branch misprediction. The exact number of cycles depends on the microarchitecture design, namely, on the depth of the pipeline and the mechanism used to recover from a mispredict. -Branch predictors use caches and history registers and therefore are susceptible to the issues related to caches, namely three C's: +Perhaps the most frequent reason for a branch mispredict is simply because it a complicated outcome pattern (e.g., exhibits pseudorandom behavior), which is unpredictable for a processor. For completeness, lets cover the other less frequent reasons behind branch mispredicts. Branch predictors use caches and history registers and therefore are susceptible to the issues related to caches, namely: -- **Compulsory misses**: mispredictions may happen on the first dynamic occurrence of the branch when static prediction is employed and no dynamic history is available. +- **Cold misses**: mispredictions may happen on the first dynamic occurrence of the branch when static prediction is employed and no dynamic history is available. - **Capacity misses**: mispredictions arising from the loss of dynamic history due to a very high number of branches in the program or exceedingly long dynamic pattern. - **Conflict misses**: branches are mapped into cache buckets (associative sets) using a combination of their virtual and/or physical addresses. If too many active branches are mapped to the same set, the loss of history can occur. Another instance of a conflict miss is false sharing when two independent branches are mapped to the same cache entry and interfere with each other potentially degrading the prediction history. @@ -18,8 +18,8 @@ A program will always experience a non-zero number of branch mispredictions. You In the past, developers had an option of providing a prediction hint to an x86 processor in the form of an encoding prefix to the branch instruction (`0x2E: Branch Not Taken`, `0x3E: Branch Taken`). This could potentially improve performance on older microarchitectures, like Pentium 4. However, modern x86 processors used to ignore those hints until Intel's RedwoodCove started using it again. Its branch predictor is still good at finding dynamic patterns, but now it will use the encoded prediction hint for branches that have never been seen before (i.e. when there is no stored information about a branch). [@IntelOptimizationManual, Section 2.1.1.1 Branch Hint] -There are indirect ways to reduce the branch misprediction rate by reducing the dynamic number of branch instructions. This approach helps because it alleviates the pressure on branch predictor structures. Compiler transformations such as loop unrolling and vectorization help in reducing the dynamic branch count, though they don't specifically aim at improving the prediction rate of any given conditional statement. Profile-Guided Optimizations (PGO) and post-link optimizers (e.g., BOLT) are also effective at reducing branch mispredictions thanks to improving the fallthrough rate (straightening the code). We will discuss those techniques in the next chapter.[^1] +There are indirect ways to reduce the branch misprediction rate by reducing the dynamic number of branch instructions. This approach helps because it alleviates the pressure on branch predictor structures. When a program executes fewer branch instructions, it may indirectly improve prediction of branches that previously suffered from capacity and conflict misses. Compiler transformations such as loop unrolling and vectorization help in reducing the dynamic branch count, though they don't specifically aim at improving the prediction rate of any given conditional statement. Profile-Guided Optimizations (PGO) and post-link optimizers (e.g., BOLT) are also effective at reducing branch mispredictions thanks to improving the fallthrough rate (straightening the code). We will discuss those techniques in the next chapter.[^1] -So perhaps the only direct way to get rid of branch mispredictions is to get rid of the branch itself. In subsequent sections, we will take a look at how branches can be replaced with lookup tables, arithmetic, and selection. +The only direct way to get rid of branch mispredictions is to get rid of the branch intruction itself. In subsequent sections, we will take a look at both direct and indirect ways to improve branch prediction. In particular, we will explore the following techniques: replacing branches with lookup tables, arithmetic, bitwise operations, selection, and SIMD instructions. [^1]: There is a conventional wisdom that never-taken branches are transparent to the branch prediction and can't affect performance, and therefore it doesn't make much sense to remove them, at least from a prediction perspective. However, contrary to the wisdom, an experiment conducted by authors of BOLT optimizer demonstrated that replacing never-taken branches with equal-sized no-ops in a large code footprint application, such as Clang C++ compiler, leads to approximately 5\% speedup on modern Intel CPUs. So it still pays to try to eliminate all branches. diff --git a/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with arithmetic.md b/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with arithmetic.md index a2c1305a0b..f39be157a3 100644 --- a/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with arithmetic.md +++ b/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with arithmetic.md @@ -20,22 +20,22 @@ Another common way is to replace conditional branches with a combination of bitw Listing: Replacing branches in LFSR. ~~~~ {#lst:BrancheslessLFSR .cpp} -int lfsr(int x) { int lfsr(int x) { - if (x < 0) x = (x << 1) ^ ((x >> 31) & CONSTANT); - x = (x << 1) ^ CONSTANT; => return x; - else } +int lfsr(int x) { int lfsr(int x) { + if (x < 0) x = (x << 1) ^ ((x >> 31) & CONSTANT); + x = (x << 1) ^ CONSTANT; => return x; + else } x = (x << 1); return x; } -; x86 machine code ; x86 machine code -lea ecx, [rdi + rdi] lea eax, [rdi + rdi] -mov eax, ecx sar edi, 31 -xor eax, #CONSTANT and edi, #CONSTANT -test edi, edi xor eax, edi +; x86 machine code ; x86 machine code +lea ecx, [rdi + rdi] lea eax, [rdi + rdi] +mov eax, ecx sar edi, 31 +xor eax, #CONSTANT and edi, #CONSTANT +test edi, edi xor eax, edi cmovns eax, ecx ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In our example, we shift left the input value regardless if it is positive or negative. In addition, if the input value is negative, we also XOR it with a constant (the exact value is irrelevant for this scenario). In the modified version, we leverage the fact that arithmetic right shift (`>>`) turns the sign of `x` (the high order bit) into a mask of all zeros or all ones. The subsequent AND (`&`) operation produces either zero or the desired constant. The original version of the function takes ~4 cycles, while the modified version takes only 3 cycles. It's worth mentioning that the Clang 17 compiler replaced the branch with a conditional select (CMOVNS) instruction, which we will cover in the next section. Nevertheless, with some smart bit manipulation, we were able to improve it even further. +In our example, we shift left the input value regardless if it is positive or negative. If the input value is negative, we XOR the result of the shift operation with a constant (the exact value is irrelevant for this scenario). In the modified version, we leverage the fact that arithmetic right shift (`>>`) turns the sign of `x` (the high order bit) into a mask of all zeros or all ones. The subsequent AND (`&`) operation produces either zero or the desired constant. The original version of the function takes ~4 cycles, while the modified version takes only 3 cycles, which was confirmed by running the code on Intel Core i7-1260P (12th Gen, Alderlake). It's worth mentioning that the Clang 17 compiler replaced the branch with a conditional select (CMOVNS) instruction, which we will cover in the next section. Nevertheless, with some smart bit manipulation, we were able to improve it even further. As of the year 2024, compilers are usually unable to find these shortcuts on their own, so it is up to the programmer to do it manually. If you can find a way to replace a frequently mispredicted branch with arithmetic, you will likely see a performance improvement. You can find more examples of bit manipulation tricks in other books, for example [@HackersDelight]. diff --git a/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md b/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md index d50ec92c0d..a6e274dbd7 100644 --- a/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md +++ b/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md @@ -14,7 +14,7 @@ if (cond) { /* frequently mispredicted */ => int y = computeY(); foo(a); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -For the code on the right, the compiler can replace the branch that comes from the ternary operator, and generate a `CMOV` x86 instruction instead. A `CMOVcc` instruction checks the state of one or more of the status flags in the `EFLAGS` register (`CF, OF, PF, SF` and `ZF`) and performs a move operation if the flags are in a specified state or condition. A similar transformation can be done for floating-point numbers with `FCMOVcc,VMAXSS/VMINSS` instructions. In the ARM ISA, there is `CSEL` (conditional selection) instruction, but also `CSINC` (select and increment), `CSNEG` (select and negate), and a few other instructions. +For the code on the right, the compiler can replace the branch that comes from the ternary operator, and generate a `CMOV` x86 instruction instead. A `CMOVcc` instruction checks the state of one or more of the status flags in the `EFLAGS` register (`CF, OF, PF, SF` and `ZF`) and performs a move operation if the flags are in a specified state or condition. A similar transformation can be done for floating-point numbers with `FCMOVcc,VMAXSS/VMINSS` instructions. In the ARM ISA, there is `CSEL` (conditional selection) instruction, but also `CSINC` (select and increment), `CSNEG` (select and negate), and a few other conditional instructions. Listing: Replacing Branches with Selection - x86 assembly code. @@ -33,9 +33,9 @@ Listing: Replacing Branches with Selection - x86 assembly code. [@lst:ReplaceBranchesWithSelectionAsm] shows assembly listings for the original and the branchless version. In contrast with the original version, the branchless version doesn't have jump instructions. However, the branchless version calculates both `x` and `y` independently, and then selects one of the values and discards the other. While this transformation eliminates the penalty of a branch misprediction, it is doing more work than the original code. -We already know that the branch in the original version on the left is hard to predict. This is what motivates us to try a branchless version in the first place. In this example, the performance gain of this change depends on the characteristics of `computeX` and `computeY` functions. If the functions are small[^1] and the compiler can inline them, then selection might bring noticeable performance benefits. If the functions are big[^2], it might be cheaper to take the cost of a branch mispredict than to execute both `computeX` and `computeY` functions. +We already know that the branch in the original version on the left is hard to predict. This is what motivates us to try a branchless version in the first place. In this example, the performance gain of this change depends on the characteristics of `computeX` and `computeY` functions. If the functions are small[^1] and the compiler can inline them, then selection might bring noticeable performance benefits. If the functions are big[^2], it might be cheaper to take the cost of a branch mispredict than to execute both `computeX` and `computeY` functions. Ultimately, performance measurements always decide which version is better. -Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. On the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual result. By the time we start resolving the branch, we could be already halfway through the `foo` function, despite it is still speculative. If we are correct, we've saved a lot of cycles. If we are wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution. +Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. On the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual outcome of the branch. By the time we start resolving the branch, we could be already halfway through the `foo` function, despite it is still speculative. If we are correct, we've saved a lot of cycles. If we are wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution. With conditional selection, it is different. There are no branches, so the processor doesn't have to speculate. It can execute `computeX` and `computeY` functions in parallel. However, it cannot start running the code from `foo` until it computes the result of the `CMOVNE` instruction since `foo` uses it as an argument (data dependency). When you use conditional select instructions, you convert a control flow dependency into a data flow dependency. diff --git a/chapters/10-Optimizing-Branch-Prediction/10-4 Multiple Compares Single Branch.md b/chapters/10-Optimizing-Branch-Prediction/10-4 Multiple Compares Single Branch.md index 8449f8ec08..9962152e01 100644 --- a/chapters/10-Optimizing-Branch-Prediction/10-4 Multiple Compares Single Branch.md +++ b/chapters/10-Optimizing-Branch-Prediction/10-4 Multiple Compares Single Branch.md @@ -1,8 +1,8 @@ ## Multiple Tests Single Branch {#sec:MultipleCmpSingleBranch} -The last technique that we discuss in this chapter aims at minimizing the dynamic number of branch instructions by combining multiple tests. The main idea here is to avoid executing a branch for every element of a large array. Instead, the goal is to perform multiple tests simultaneously, which primarily involves using SIMD instructions. The result of this is a vector mask that can be converted into a byte mask, which enables us to eliminate many unnecessary branches as you will see shortly. You may encounter this technique being used in SIMD implementations of various algorithms such as JSON/HTML parsing, media codecs, and others. +The last technique that we discuss in this chapter aims at minimizing the dynamic number of branch instructions by combining multiple tests. The main idea here is to avoid executing a branch for every element of a large array. Instead, the goal is to perform multiple tests simultaneously, which primarily involves using SIMD instructions. The result of multiple tests is a vector mask that can be converted into a byte mask, which often can be processed with a single branch instruction. This enables us to eliminate many branch instructions as you will see shortly. You may encounter this technique being used in SIMD implementations of various algorithms such as JSON/HTML parsing, media codecs, and others. -[@lst:LongestLineNaive] shows a function that finds the longest line in an input string by testing one character at a time. We go through the input string and search for end-of-line (`eol`) characters (`\n`, 0x0a in ASCII). For every found `eol` character we check if the current line is the longest, and reset the length of the current line to zero. This code will execute one branch instruction for every character.[^1] +[@lst:LongestLineNaive] shows a function that finds the longest line in an input string by testing one character at a time. We go through the input string and search for end-of-line (`eol`) characters (`\n`, 0x0A in ASCII). For every found `eol` character we check if the current line is the longest, and reset the length of the current line to zero. This code will execute one branch instruction for every character.[^1] Listing: Find the longest line (one character at a time). @@ -66,11 +66,11 @@ uint8_t tzcnt(uint8_t mask) { } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We start by preparing an 8-byte mask filled with `eol` symbols. The inner loop loads eight characters of the input string and performs a byte-wise comparison of these characters with the `eol` mask. Vectors in modern processors contain 16/32/64 bytes, so we can process even more characters simultaneously. The result of those eight comparisons is an 8-bit mask with either 0 or 1 in the corresponding position (see `compareBytes`). For example, when comparing `0x00FF0a00FF0aFF00` and `0x0a0a0a0a0a0a0a0a`, we will get `0b00100100` as a result. With x86 and ARM ISAs, the function `compareBytes` can be implemented using two vector instructions.[^4] +We start by preparing an 8-byte mask filled with `eol` symbols. The inner loop loads eight characters of the input string and performs a byte-wise comparison of these characters with the `eol` mask. Vectors in modern processors contain 16/32/64 bytes, so we can process even more characters simultaneously. The result of those eight comparisons is an 8-bit mask with either 0 or 1 in the corresponding position (see `compareBytes`). For example, when comparing `0x00FF0A000AFFFF00` and `0x0A0A0A0A0A0A0A0A`, we will get `0b00101000` as a result. With x86 and ARM ISAs, the function `compareBytes` can be implemented using two vector instructions.[^4] -If the mask is zero, that means there are no `eol` characters in the current chunk and we can skip it (see line 11). This is a critical optimization that provides large speedups for input strings with long lines. If a mask is not zero, that means there are `eol` characters and we need to find their positions. To do so, we use the `tzcnt` function, which counts the number of trailing zero bits in an 8-bit mask. For example, for the mask `0b00100100`, it will return 2. We use the position of the rightmost set bit in the mask to calculate the length of the current line. We repeat until there are no set bits in the mask and then start processing the next chunk. Most ISAs support implementing the `tzcnt` function with a single instruction.[^3] +If the mask is zero, that means there are no `eol` characters in the current chunk and we can skip it (see line 11). This is a critical optimization that provides large speedups for input strings with long lines. If a mask is not zero, that means there are `eol` characters and we need to find their positions. To do so, we use the `tzcnt` function, which counts the number of trailing zero bits in an 8-bit mask (the position of the rightmost set bit). For example, for the mask `0b00101000`, it will return 3. Most ISAs support implementing the `tzcnt` function with a single instruction.[^3] Line 14 calculates the length of the current line using the result of the `tzcnt` function. We shift right the mask and repeat until there are no set bits in the mask. -We tested this technique using AVX2 implementation on several different inputs, including textbooks, and source code files. The result was 5--6 times fewer branch instructions and more than 4x better performance when running on Intel Core i7-1260P (12th Gen, Alderlake). +For an input string with a single very long line (best case scenario), the SIMD version will execute eight times fewer branch instructions. However, in the worst case scenario with zero-length lines (i.e., only `eol` characters in the input string), the original approach is faster. We benchmarked this technique using AVX2 implementation (with chunks of 16 characters) on several different inputs, including textbooks, and source code files. The result was 5--6 times fewer branch instructions and more than 4x better performance when running on Intel Core i7-1260P (12th Gen, Alderlake). [^1]: Assuming that compiler will avoid generating branch instructions for `std::max`. [^2]: Performance Ninja: compiler intrinsics 2 - [https://github.com/dendibakh/perf-ninja/tree/main/labs/core_bound/compiler_intrinsics_2](https://github.com/dendibakh/perf-ninja/tree/main/labs/core_bound/compiler_intrinsics_2). diff --git a/chapters/10-Optimizing-Branch-Prediction/10-6 Chapter Summary.md b/chapters/10-Optimizing-Branch-Prediction/10-6 Chapter Summary.md index a6baeed2ba..6ce3f086f7 100644 --- a/chapters/10-Optimizing-Branch-Prediction/10-6 Chapter Summary.md +++ b/chapters/10-Optimizing-Branch-Prediction/10-6 Chapter Summary.md @@ -2,8 +2,9 @@ \markright{Summary} -* Modern processors are very good at predicting branch outcomes. So, we recommend starting the work on fixing branch mispredictions only when the TMA report points to a high `Bad Speculation` metric. -* When branch outcome patterns become hard for the CPU branch predictor to follow, the performance of the application may suffer. In this case, the branchless version of an algorithm can be more performant. In this chapter, we showed how branches could be replaced with lookup tables, arithmetic, and selection. In some situations, it is also possible to use compiler intrinsics to eliminate branches, as shown in [@IntelAvoidingBrMisp]. +* Modern processors are very good at predicting branch outcomes. So, we recommend starting the work on fixing branch mispredictions only when the TMA points to a high `Bad Speculation` metric. +* When branch outcome patterns become hard for the CPU branch predictor to follow, the performance of the application may suffer. In this case, the branchless version of an algorithm can be more performant. In this chapter, we showed how branches could be replaced with lookup tables, arithmetic, and selection. * Branchless algorithms are not universally beneficial. Always measure to find out what works better in your specific case. +* There are indirect ways to reduce the branch misprediction rate by reducing the dynamic number of branch instructions in a program. This approach helps because it alleviates the pressure on branch predictor structures. Examples of such techniques include loop unrolling/vectorization, replacing branches with bitwise operations, and using SIMD instructions. \sectionbreak