[Chapter10] Final touches

dendibakh · Sep 23, 2024 · 04b914c · 04b914c
1 parent e9906ac
commit 04b914c
Show file tree

Hide file tree

Showing 6 changed files with 26 additions and 32 deletions.
diff --git a/biblio.bib b/biblio.bib
@@ -110,13 +110,6 @@ @Article{LemireBranchless
   url    = {https://www.infoq.com/articles/making-code-faster-taming-branches/},
 }
 
-@Article{IntelAvoidingBrMisp,
-  author = {Rajiv Kapoor},
-  title  = {Avoiding the Cost of Branch Misprediction},
-  year   = {2009},
-  url    = {https://software.intel.com/en-us/articles/avoiding-the-cost-of-branch-misprediction},
-}
-
 @inproceedings{Nowak2014TheOO,
   author={Andrzej Nowak and Georgios Bitzes},
   title={The overhead of profiling using PMU hardware counters},

diff --git a/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md b/chapters/10-Optimizing-Branch-Prediction/10-0 Optimizing bad speculation.md
@@ -6,20 +6,20 @@ So far we've been talking about optimizing memory accesses and computations. How
 
 In general, modern processors are very good at predicting branch outcomes. They not only follow static prediction rules but also detect dynamic patterns. Usually, branch predictors save the history of previous outcomes for the branches and try to guess what the next result will be. However, when the pattern becomes hard for the CPU branch predictor to follow, it may hurt performance.
 
-Mispredicting a branch can add a significant speed penalty when it happens regularly. When such an event occurs, a CPU is required to clear all the speculative work that was done ahead of time and later was proven to be wrong. It also needs to flush the pipeline and start filling it with instructions from the correct path. Typically, modern CPUs experience 10 to 20-cycle penalties as a result of a branch misprediction. The exact number of cycles depends on the microarchitecture design, namely, on the depth of the pipeline and the mechanism used to recover from the mispredicts.
+Mispredicting a branch can add a significant speed penalty when it happens regularly. When such an event occurs, a CPU is required to clear all the speculative work that was done ahead of time and later was proven to be wrong. It also needs to flush the pipeline and start filling it with instructions from the correct path. Typically, modern CPUs experience 10 to 25-cycle penalties as a result of a branch misprediction. The exact number of cycles depends on the microarchitecture design, namely, on the depth of the pipeline and the mechanism used to recover from a mispredict.
 
-Branch predictors use caches and history registers and therefore are susceptible to the issues related to caches, namely three C's:
+Perhaps the most frequent reason for a branch mispredict is simply because it a complicated outcome pattern (e.g., exhibits pseudorandom behavior), which is unpredictable for a processor. For completeness, lets cover the other less frequent reasons behind branch mispredicts. Branch predictors use caches and history registers and therefore are susceptible to the issues related to caches, namely:
 
-- **Compulsory misses**: mispredictions may happen on the first dynamic occurrence of the branch when static prediction is employed and no dynamic history is available.
+- **Cold misses**: mispredictions may happen on the first dynamic occurrence of the branch when static prediction is employed and no dynamic history is available.
 - **Capacity misses**: mispredictions arising from the loss of dynamic history due to a very high number of branches in the program or exceedingly long dynamic pattern.
 - **Conflict misses**: branches are mapped into cache buckets (associative sets) using a combination of their virtual and/or physical addresses. If too many active branches are mapped to the same set, the loss of history can occur. Another instance of a conflict miss is false sharing when two independent branches are mapped to the same cache entry and interfere with each other potentially degrading the prediction history.
 
 A program will always experience a non-zero number of branch mispredictions. You can find out how much a program suffers from branch mispredictions by looking at the TMA `Bad Speculation` metric. It is normal for a general-purpose application to have a `Bad Speculation` metric in the range of 5-10\%. Our recommendation is to pay close attention once this metric goes higher than 10\%.
 
 In the past, developers had an option of providing a prediction hint to an x86 processor in the form of an encoding prefix to the branch instruction (`0x2E: Branch Not Taken`, `0x3E: Branch Taken`). This could potentially improve performance on older microarchitectures, like Pentium 4. However, modern x86 processors used to ignore those hints until Intel's RedwoodCove started using it again. Its branch predictor is still good at finding dynamic patterns, but now it will use the encoded prediction hint for branches that have never been seen before (i.e. when there is no stored information about a branch). [@IntelOptimizationManual, Section 2.1.1.1 Branch Hint]
 
-There are indirect ways to reduce the branch misprediction rate by reducing the dynamic number of branch instructions. This approach helps because it alleviates the pressure on branch predictor structures. Compiler transformations such as loop unrolling and vectorization help in reducing the dynamic branch count, though they don't specifically aim at improving the prediction rate of any given conditional statement. Profile-Guided Optimizations (PGO) and post-link optimizers (e.g., BOLT) are also effective at reducing branch mispredictions thanks to improving the fallthrough rate (straightening the code). We will discuss those techniques in the next chapter.[^1]
+There are indirect ways to reduce the branch misprediction rate by reducing the dynamic number of branch instructions. This approach helps because it alleviates the pressure on branch predictor structures. When a program executes fewer branch instructions, it may indirectly improve prediction of branches that previously suffered from capacity and conflict misses. Compiler transformations such as loop unrolling and vectorization help in reducing the dynamic branch count, though they don't specifically aim at improving the prediction rate of any given conditional statement. Profile-Guided Optimizations (PGO) and post-link optimizers (e.g., BOLT) are also effective at reducing branch mispredictions thanks to improving the fallthrough rate (straightening the code). We will discuss those techniques in the next chapter.[^1]
 
-So perhaps the only direct way to get rid of branch mispredictions is to get rid of the branch itself. In subsequent sections, we will take a look at how branches can be replaced with lookup tables, arithmetic, and selection.
+The only direct way to get rid of branch mispredictions is to get rid of the branch intruction itself. In subsequent sections, we will take a look at both direct and indirect ways to improve branch prediction. In particular, we will explore the following techniques: replacing branches with lookup tables, arithmetic, bitwise operations, selection, and SIMD instructions.
 
 [^1]: There is a conventional wisdom that never-taken branches are transparent to the branch prediction and can't affect performance, and therefore it doesn't make much sense to remove them, at least from a prediction perspective. However, contrary to the wisdom, an experiment conducted by authors of BOLT optimizer demonstrated that replacing never-taken branches with equal-sized no-ops in a large code footprint application, such as Clang C++ compiler, leads to approximately 5\% speedup on modern Intel CPUs. So it still pays to try to eliminate all branches.
diff --git a/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with arithmetic.md b/chapters/10-Optimizing-Branch-Prediction/10-2 Replace branches with arithmetic.md
@@ -20,22 +20,22 @@ Another common way is to replace conditional branches with a combination of bitw
 Listing: Replacing branches in LFSR.
 
 ~~~~ {#lst:BrancheslessLFSR .cpp}
-int lfsr(int x) {                     int lfsr(int x) {
-  if (x < 0)                            x = (x << 1) ^ ((x >> 31) & CONSTANT);
-    x = (x << 1) ^ CONSTANT;    =>      return x;
-  else                                }
+int lfsr(int x) {                         int lfsr(int x) {
+  if (x < 0)                                x = (x << 1) ^ ((x >> 31) & CONSTANT);
+    x = (x << 1) ^ CONSTANT;      =>        return x;
+  else                                    }
     x = (x << 1);
   return x;                 
 }                       
 
-; x86 machine code                    ; x86 machine code
-lea     ecx, [rdi + rdi]              lea     eax, [rdi + rdi]
-mov     eax, ecx                      sar     edi, 31
-xor     eax, #CONSTANT                and     edi, #CONSTANT
-test    edi, edi                      xor     eax, edi
+; x86 machine code                        ; x86 machine code
+lea     ecx, [rdi + rdi]                  lea     eax, [rdi + rdi]
+mov     eax, ecx                          sar     edi, 31
+xor     eax, #CONSTANT                    and     edi, #CONSTANT
+test    edi, edi                          xor     eax, edi
 cmovns  eax, ecx
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-In our example, we shift left the input value regardless if it is positive or negative. In addition, if the input value is negative, we also XOR it with a constant (the exact value is irrelevant for this scenario). In the modified version, we leverage the fact that arithmetic right shift (`>>`) turns the sign of `x` (the high order bit) into a mask of all zeros or all ones. The subsequent AND (`&`) operation produces either zero or the desired constant. The original version of the function takes ~4 cycles, while the modified version takes only 3 cycles. It's worth mentioning that the Clang 17 compiler replaced the branch with a conditional select (CMOVNS) instruction, which we will cover in the next section. Nevertheless, with some smart bit manipulation, we were able to improve it even further.
+In our example, we shift left the input value regardless if it is positive or negative. If the input value is negative, we XOR the result of the shift operation with a constant (the exact value is irrelevant for this scenario). In the modified version, we leverage the fact that arithmetic right shift (`>>`) turns the sign of `x` (the high order bit) into a mask of all zeros or all ones. The subsequent AND (`&`) operation produces either zero or the desired constant. The original version of the function takes ~4 cycles, while the modified version takes only 3 cycles, which was confirmed by running the code on Intel Core i7-1260P (12th Gen, Alderlake). It's worth mentioning that the Clang 17 compiler replaced the branch with a conditional select (CMOVNS) instruction, which we will cover in the next section. Nevertheless, with some smart bit manipulation, we were able to improve it even further.
 
 As of the year 2024, compilers are usually unable to find these shortcuts on their own, so it is up to the programmer to do it manually. If you can find a way to replace a frequently mispredicted branch with arithmetic, you will likely see a performance improvement. You can find more examples of bit manipulation tricks in other books, for example [@HackersDelight].
diff --git a/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md b/chapters/10-Optimizing-Branch-Prediction/10-3 Replace branches with predication.md
@@ -14,7 +14,7 @@ if (cond) { /* frequently mispredicted */   =>     int y = computeY();
 foo(a);
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-For the code on the right, the compiler can replace the branch that comes from the ternary operator, and generate a `CMOV` x86 instruction instead. A `CMOVcc` instruction checks the state of one or more of the status flags in the `EFLAGS` register (`CF, OF, PF, SF` and `ZF`) and performs a move operation if the flags are in a specified state or condition. A similar transformation can be done for floating-point numbers with `FCMOVcc,VMAXSS/VMINSS` instructions. In the ARM ISA, there is `CSEL` (conditional selection) instruction, but also `CSINC` (select and increment), `CSNEG` (select and negate), and a few other instructions.
+For the code on the right, the compiler can replace the branch that comes from the ternary operator, and generate a `CMOV` x86 instruction instead. A `CMOVcc` instruction checks the state of one or more of the status flags in the `EFLAGS` register (`CF, OF, PF, SF` and `ZF`) and performs a move operation if the flags are in a specified state or condition. A similar transformation can be done for floating-point numbers with `FCMOVcc,VMAXSS/VMINSS` instructions. In the ARM ISA, there is `CSEL` (conditional selection) instruction, but also `CSINC` (select and increment), `CSNEG` (select and negate), and a few other conditional instructions.
 
 Listing: Replacing Branches with Selection - x86 assembly code.
 
@@ -33,9 +33,9 @@ Listing: Replacing Branches with Selection - x86 assembly code.
 
 [@lst:ReplaceBranchesWithSelectionAsm] shows assembly listings for the original and the branchless version. In contrast with the original version, the branchless version doesn't have jump instructions. However, the branchless version calculates both `x` and `y` independently, and then selects one of the values and discards the other. While this transformation eliminates the penalty of a branch misprediction, it is doing more work than the original code. 
 
-We already know that the branch in the original version on the left is hard to predict. This is what motivates us to try a branchless version in the first place. In this example, the performance gain of this change depends on the characteristics of `computeX` and `computeY` functions. If the functions are small[^1] and the compiler can inline them, then selection might bring noticeable performance benefits. If the functions are big[^2], it might be cheaper to take the cost of a branch mispredict than to execute both `computeX` and `computeY` functions. 
+We already know that the branch in the original version on the left is hard to predict. This is what motivates us to try a branchless version in the first place. In this example, the performance gain of this change depends on the characteristics of `computeX` and `computeY` functions. If the functions are small[^1] and the compiler can inline them, then selection might bring noticeable performance benefits. If the functions are big[^2], it might be cheaper to take the cost of a branch mispredict than to execute both `computeX` and `computeY` functions. Ultimately, performance measurements always decide which version is better.
 
-Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. On the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual result. By the time we start resolving the branch, we could be already halfway through the `foo` function, despite it is still speculative. If we are correct, we've saved a lot of cycles. If we are wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution.
+Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. On the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual outcome of the branch. By the time we start resolving the branch, we could be already halfway through the `foo` function, despite it is still speculative. If we are correct, we've saved a lot of cycles. If we are wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution.
 
 With conditional selection, it is different. There are no branches, so the processor doesn't have to speculate. It can execute `computeX` and `computeY` functions in parallel. However, it cannot start running the code from `foo` until it computes the result of the `CMOVNE` instruction since `foo` uses it as an argument (data dependency). When you use conditional select instructions, you convert a control flow dependency into a data flow dependency.