Skip to content

Commit

Permalink
[Grammar] Update 10-3 Replace branches with predication.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Sep 22, 2024
1 parent 92f53b5 commit 1cb1a77
Showing 1 changed file with 4 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,13 @@ Listing: Replacing Branches with Selection - x86 assembly code.

We already know that the branch in the original version on the left is hard to predict. This is what motivates us to try a branchless version in the first place. In this example, the performance gain of this change depends on the characteristics of `computeX` and `computeY` functions. If the functions are small[^1] and the compiler can inline them, then selection might bring noticeable performance benefits. If the functions are big[^2], it might be cheaper to take the cost of a branch mispredict than to execute both `computeX` and `computeY` functions.

Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. One the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual result. By the time we start resolving the branch, we could be already halfway through `foo` function, despite it is still speculative. If we were correct, we've saved a lot of cycles. If we were wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution.
Take a look at [@lst:ReplaceBranchesWithSelectionAsm] one more time. On the left, a processor can predict, for example, that the `je 400514` branch will be taken, speculatively call `computeY`, and start running code from the function `foo`. Remember, branch prediction usually happens many cycles before we know the actual result. By the time we start resolving the branch, we could be already halfway through the `foo` function, despite it is still speculative. If we are correct, we've saved a lot of cycles. If we are wrong, we have to take the penalty and start over from the correct path. In the latter case, we don't gain anything from the fact that we have already completed a portion of `foo`, it all must be thrown away. If the mispredictions occur too often, the recovering penalty outweighs the gains from speculative execution.

With conditional selection, it is different. There are no branches, so the processor doesn't have to speculate. It can execute `computeX` and `computeY` functions in parallel. However, it cannot start running the code from `foo` until it computes the result of the `CMOVNE` instruction since `foo` uses it as an argument (data dependency). When you use conditional select instructions, you convert a control flow dependency into a data flow dependency.

To sum it up, for small `if ... else` statements that perform simple operations, conditional selects can be more efficient than branches, but only if the branch is hard to predict. So don't force compiler to generate conditional selects for every conditional statement. For conditional statements that are always correctly predicted, having a branch instruction is likely an optimal choice, because you allow the processor to speculate (correctly) and run ahead of the actual execution. And don't forget to measure the impact of your changes.
To sum it up, for small `if ... else` statements that perform simple operations, conditional selects can be more efficient than branches, but only if the branch is hard to predict. So don't force the compiler to generate conditional selects for every conditional statement. For conditional statements that are always correctly predicted, having a branch instruction is likely an optimal choice, because you allow the processor to speculate (correctly) and run ahead of the actual execution. And don't forget to measure the impact of your changes.

Without profiling data, compilers don't have visibility into the misprediction rates. As a result, they usually prefer to generate branch instructions by default. Compilers are conservative at using selection and may resist generating `CMOV` instructions even in simple cases. Again, the tradeoffs are complicated, and it is hard to make the right decision without the runtime data.[^4] Starting from Clang-17, the compiler now honors a `__builtin_unpredictable` hint for the x86 target, which indicates to the compiler that a branch condition is unpredictable. It can help influencing the compiler's decision but does not guarantee that the `CMOV` instruction will be generated. For example:
Without profiling data, compilers don't have visibility into the misprediction rates. As a result, they usually prefer to generate branch instructions by default. Compilers are conservative at using selection and may resist generating `CMOV` instructions even in simple cases. Again, the tradeoffs are complicated, and it is hard to make the right decision without the runtime data.[^4] Starting from Clang-17, the compiler now honors a `__builtin_unpredictable` hint for the x86 target, which indicates to the compiler that a branch condition is unpredictable. It can help influence the compiler's decision but does not guarantee that the `CMOV` instruction will be generated. For example:

```cpp
int a;
Expand All @@ -52,6 +52,6 @@ if (__builtin_unpredictable(cond)) {
}
```

[^1]: Just a handfull instructions that can be completed in a few cycles.
[^1]: Just a handful of instructions that can be completed in a few cycles.
[^2]: More than twenty instructions that take more than twenty cycles.
[^4]: Hardware-based PGO (see [@sec:secPGO]) will be a huge step forward here.

0 comments on commit 1cb1a77

Please sign in to comment.