Skip to content

Commit

Permalink
[Chapter2] Rewrite. part3
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Aug 29, 2024
1 parent 8b20e2d commit a5ce690
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions chapters/2-Measuring-Performance/2-1 Noise In Modern Systems.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
## Noise in Modern Systems {#sec:secFairExperiments}

There are many features in hardware and software that are designed to increase performance, but not all of them have deterministic behavior. Let's consider Dynamic Frequency Scaling (DFS), a feature that allows a CPU to increase its frequency making far above the base frequency, allowing it run significantly faster. DFS is also frequently reffered to as the *turbo* mode. Unfortunately, a CPU cannot stay in the turbo mode for a long time, otherwise it may face the risk of overheating. So later, it decreases its frequency to stay within its thermal limits. DFS usually depends a lot on the current system load and external factors, such as core temperature, which makes it hard to predict the impact on our experiments.
There are many features in hardware and software that are designed to increase performance, but not all of them have deterministic behavior. Let's consider Dynamic Frequency Scaling (DFS), a feature that allows a CPU to increase its frequency far above the base frequency, allowing it to run significantly faster. DFS is also frequently reffered to as *turbo* mode. Unfortunately, a CPU cannot stay in the turbo mode for a long time, otherwise it may face the risk of overheating. So later, it decreases its frequency to stay within its thermal limits. DFS usually depends a lot on the current system load and external factors, such as core temperature, which makes it hard to predict the impact on performance measurements.

Figure @fig:FreqScaling shows a typical example where DFS can cause variance in measurements. In our scenario, we started two runs of the benchmark, one right after another on a "cold" processor.[^1] During the first second, the first iteration of the benchmark was running on the maximum turbo frequency of 4.4 Ghz but later it has to decrease its frequency a bit below 4 Ghz. Second run did not have the advantage of boosting the CPU frequency and did not enter the turbo mode. Even though we ran the exact same version of a program two times, the environment in which they ran was not the same. As you can see, the first run is 200 milliseconds faster than the second run due to the fact that it was running on a higher frequency in the beginning. Such a scenario can frequently happen when you benchmark software on a laptop since laptops have limited heat dissipation.
Figure @fig:FreqScaling shows a typical example where DFS can cause variance in performance. In our scenario, we started two runs of a benchmark, one right after another on a "cold" processor.[^1] During the first second, the first iteration of the benchmark was running on the maximum turbo frequency of 4.4 Ghz but later it has to decrease its frequency below 4 Ghz. Second run did not have the advantage of boosting the CPU frequency and did not enter the turbo mode. Even though we ran the exact same version of the benchmark two times, the environment in which they ran was not the same. As you can see, the first run is 200 milliseconds faster than the second run due to the fact that it was running with a higher CPU frequency in the beginning. Such a scenario can frequently happen when you benchmark software on a laptop since laptops have limited heat dissipation.

![Variance in performance caused by dynamic frequency scaling: the first run is 200 milliseconds faster than the second.](../../img/measurements/FreqScaling.jpg){#fig:FreqScaling width=90%}

Remember that even running Windows task manager or Linux `top` programs, can affect measurements since an additional CPU core will be activated and assigned to it. This might affect the frequency of the core that is running the actual benchmark.

Frequency Scaling is an example of how a hardware feature can cause variations in our measurements, however, they could also come from software. Let's consider benchmarking a `git status` command, which accesses many files. The filesystem plays a big role in performance; in particular, the filesystem cache. On the first run of the benchmark, the required entries in the filesystem cache are missing. The filesystem cache is not effective and our `git status` command runs very slow. However, the second time, the filesystem cache will be warmed up, making it much faster than the first run.
Frequency Scaling is an example of how a hardware feature can cause variations in our measurements, however, they could also come from software. Let's consider benchmarking a `git status` command, which accesses many files on the disk. The filesystem plays a big role for performance in this scenario; in particular, the filesystem cache. On the first run, the required entries in the filesystem cache are missing. The filesystem cache is not effective and our `git status` command runs very slowly. However, the second time, the filesystem cache will be warmed up, making it much faster than the first run.

You're probably thinking about having a dry run before taking measurements. That certainly helps, unfortunately, measurement bias can persist through the runs as well. [@Mytkowicz09] paper demonstrates that UNIX environment size (i.e., the total number of bytes required to store the environment variables) or the link order (the order of object files that are given to the linker) can affect performance in unpredictable ways. Moreover, there are numerous other ways of affecting memory layout and potentially affecting performance measurements. One approach to enable statistically sound performance analysis of software on modern architectures was presented in [@Curtsinger13]. This work shows that it's possible to eliminate measurement bias that comes from memory layout by efficiently and repeatedly randomizing the placement of code, stack, and heap objects at runtime. Sadly, these ideas didn't go much further, and right now, this project is almost abandoned.
You're probably thinking about including a dry run before taking measurements. That certainly helps, unfortunately, measurement bias can persist through the runs as well. [@Mytkowicz09] paper demonstrates that UNIX environment size (i.e., the total number of bytes required to store the environment variables) or the link order (the order of object files that are given to the linker) can affect performance in unpredictable ways. There are numerous other ways how memory layout may affect performance measurements.[^2]

Having consistent measurements requires running all iterations of the benchmark with the same conditions. However, it is not possible to replicate the exact same environment and eliminate bias completely: there could be different temperature conditions, power delivery spikes, neighbor processes running, unexpected system interrupts, etc. Chasing all potential sources of noise and variation in a system can be a never-ending story. Sometimes it cannot be achieved, for example, when benchmarking a large distributed cloud service.
Having consistent performance requires running all iterations of the benchmark with the same conditions. It is impossible to achieve 100% consistent results on every run of a benchmark, but perhaps you can get close by carefully controling the environment. Eliminating non-determinism in a system is helpful for well-defined, stable performance tests, e.g., microbenchmarks. For instance, when you implement a code change and want to know the relative speedup ratio by benchmarking two different versions of the same program. This is a scenario in which you can control most of the variability in a system, including HW configuration, OS settings, background processes, etc. In this situation, eliminating non-determinism in the system helps to get a more consistent and accurate comparison. You can find some examples of features that can bring noise into performance measurements and how to disable them in Appendix A. Also, there are tools that can set up the environment to ensure benchmarking results with a low variance; one such tool is [temci](https://github.com/parttimenerd/temci)[^14].

So, eliminating non-determinism in a system is helpful for well-defined, stable performance tests, e.g., microbenchmarks. For instance, when you implement a code change and want to know the relative speedup ratio by benchmarking two different versions of the same program. This is a scenario where you can control most of the variables in the benchmark, including its input, environment configuration, etc. In this situation, eliminating non-determinism in a system helps to get a more consistent and accurate comparison. After finishing with local testing, remember to verify that projected performance improvements are reflected in real-world measurements. Readers can find some examples of features that can bring noise into performance measurements and how to disable them in Appendix A. Also, there are tools that can set up the environment to ensure benchmarking results with a low variance; one such tool is [temci](https://github.com/parttimenerd/temci)[^14].
However, it is not possible to replicate the exact same environment and eliminate bias completely: there could be different temperature conditions, power delivery spikes, unexpected system interrupts, etc. Chasing all potential sources of noise and variation in a system can be a never-ending story. Sometimes it cannot be achieved, for example, when you're benchmarking a large distributed cloud service.

It is not recommended to eliminate system non-deterministic behavior when estimating real-world performance improvements. Engineers should try to replicate the target system configuration, which they are optimizing for. Introducing any artificial tuning to the system under test will diverge results from what users of your service will see in practice. Also, any performance analysis work, including profiling (see [@sec:profiling]), should be done on a system that is configured similarly to what will be used in a real deployment.

Finally, it's important to keep in mind that even if a particular hardware or software feature has non-deterministic behavior, that doesn't mean it is considered harmful. It could give an inconsistent result, but it is designed to improve the overall performance of the system. Disabling such a feature might reduce the noise in microbenchmarks but make the whole suite run longer. This might be especially important for CI/CD performance testing when there are time limits for how long it should take to run the whole benchmark suite.
You should not eliminate system non-deterministic behavior when you want to measure real-world performance impact of your change. Users of your application are likely to have all the features enabled since these features provide better performance. Yes, these features may contribute to performance instabilities, but they are designed to improve the overall performance of the system.[^3] In fact, you customers probably do not care about non-deterministic performance as long as it helps to run as fast as possible. So, when you analyze performance of a production application, you should try to replicate the target system configuration, which you are optimizing for. Introducing any artificial tuning to the system will diverge results from what users of your service will see in practice.

[^1]: By cold processor, we mean the CPU that stayed in idle mode for a while, allowing it to cool down its temperature.
[^2]: One approach to enable statistically sound performance analysis was presented in [@Curtsinger13]. This work showed that it's possible to eliminate measurement bias that comes from memory layout by repeatedly randomizing the placement of code, stack, and heap objects at runtime. Sadly, these ideas didn't go much further, and right now, this project is almost abandoned.
[^3]: Another downside of disabling non-deterministic performance features is that it makes a benchmark run longer. This is especially important for CI/CD performance testing when there are time limits for how long it should take to run the whole benchmark suite.
[^14]: Temci - [https://github.com/parttimenerd/temci](https://github.com/parttimenerd/temci).

0 comments on commit a5ce690

Please sign in to comment.