Skip to content

Commit

Permalink
[Chapter2] Rewrite. part5
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Aug 30, 2024
1 parent f235a2a commit 75c290f
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 14 deletions.
29 changes: 15 additions & 14 deletions chapters/2-Measuring-Performance/2-3 Performance Regressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,23 @@ Software vendors constantly seek ways to accelerate the pace of delivering their

Performance regressions are defects that make software run slower compared to the previous versions. Catching performance regressions (or improvements) requires detecting which commit(s) has changed the performance of the program. From database systems to search engines to compilers, performance regressions are commonly experienced by almost all large-scale software systems during their continuous evolution and deployment life cycle. It may be impossible to entirely avoid performance regressions during software development, but with proper testing and diagnostic tools, the likelihood of such defects silently leaking into production code can be reduced significantly.

Let's consider some potential solutions for detecting performance regressions. The first option that comes to mind is: having humans look at the graphs and compare results. It shouldn't be surprising that we want to move away from that option very quickly. People tend to lose focus quickly and can miss regressions, especially on a noisy chart, like the one shown in Figure @fig:PerfRegress. Humans will likely catch performance regression that happened around August 5th, but it's not obvious that humans will detect later regressions. In addition to being error-prone, having humans in the loop is also a time-consuming and boring job that must be performed daily.
It is useful to track performance of your application with charts, like the one shown in Figure @fig:PerfRegress. Using such a chart you can see historical trends and find moments where performance improved or dropped. Typically, you will have a separate line for each performance test you're tracking. Do not include too many benchmarks on a single chart as it will become very noisy.

![Performance trend graph for four tests with a small drop in performance on August 5th (the higher value, the better). *© Source: [@MongoDBChangePointDetection]*](../../img/measurements/PerfRegressions.png){#fig:PerfRegress width=100%}
Let's consider some potential solutions for detecting performance regressions. The first option that comes to mind is: having humans look at the graphs. For the chart in Figure @fig:PerfRegress, humans will likely catch performance regression that happened on August 7th, but it's not obvious that they will detect later smaller regressions. People tend to lose focus quickly and can miss regressions, especially on a busy chart. In addition to that, it is a time-consuming and boring job that must be performed daily. It shouldn't be surprising that we want to move away from that option very quickly.

The second option is to have a threshold, e.g., 2%: every code modification that has performance within the threshold is considered noise and everything above the threshold is considered a regression. It is somewhat better than the first option but still has its own drawbacks. Fluctuations in performance tests are inevitable: sometimes, even a harmless code change[^3] can trigger performance variation in a benchmark. Choosing the right value for the threshold is extremely hard and does not guarantee a low rate of false-positive as well as false-negative alarms. Setting the threshold too low might lead to analyzing a bunch of small regressions that were not caused by the change in source code but due to some random noise. Setting the threshold too high might lead to filtering out real performance regressions. Small changes can pile up slowly into a bigger regression, which can be left unnoticed. For instance, suppose you have a threshold of 2%. If you have two consecutive 1.5% regressions, they both will be filtered out. But throughout two days, performance regression will sum up to 3%, which is bigger than the threshold. By looking at Figure @fig:PerfRegress, we can observe that the threshold requires adjustment for every test. The threshold that might work for the green (upper line) test will not necessarily work equally well for the purple (lower line) test since they have a different level of noise. An example of a CI system where each test requires setting explicit threshold values for alerting a regression is [LUCI](https://chromium.googlesource.com/chromium/src.git/+/master/docs/tour_of_luci_ui.md),[^2] which is a part of the Chromium project.
There is another interesting performance drop on August 3rd. Developer will also likely catch it, however, most of us would be tempted to dismiss it since performance recovered the next day. But are we sure that it was merely a glitch in measurements? What if this was a real regression that was compensated by an optimization on August 4th? If we could fix the regression *and* keep the optimization, we would have performance score around 4500. Do not dismiss such cases. One way to proceed here would be to repeat the measurements for the dates Aug 02 - Aug 04 and inspect code changes during that period.

A third option is using statistical analysis to identify performance regressions. A simple example of this is using [Student's t-test](https://en.wikipedia.org/wiki/Student's_t-test)[^5]) to compare the arithmetic mean of 100 runs of program A to that of 100 runs of program B. However, parametric tests such as this assume normal (i.e., Gaussian) sample distributions, which is often not true with typically right-skewed, multimodal system performance runtime histograms. Therefore, misapplying statistical tools in such cases runs the risk of producing misleading results. Fortunately, more appropriate statistical tools exist for non-normal distributions called "non-parametric" tests, examples of which include Mann-Whitney, Anderson-Darling, and Kolmogorov–Smirnov (more about that in the next section). Python and R offer these as downloadable packages for those interested in rolling their own automated performance regression test frameworks, while a growing list of open-source projects like [stats-pal](https://github.com/JoeyHendricks/STATS-PAL)[^6] offers ready-made frameworks for plugging into existing CI/CD pipelines.
![Performance graph (higher better) for an application showing a big drop in performance on August 5th and smaller ones later.](../../img/measurements/PerfRegressions.png){#fig:PerfRegress width=100%}

An even more sophisticated statistical approach to identify performance regressions was taken in [@MongoDBChangePointDetection]. MongoDB developers implemented change point analysis for identifying performance changes in the evolving code base of their database products. According to [@ChangePointAnalysis], change point analysis is the process of detecting distributional changes within time-ordered observations. MongoDB developers utilized an "E-Divisive means" algorithm that works by hierarchically selecting distributional change points that divide the time series into clusters. Their open-sourced CI system called [Evergreen](https://github.com/evergreen-ci/evergreen)[^4] incorporates this algorithm to display change points on the chart and opens Jira tickets. More details about this automated performance testing system can be found in [@Evergreen].
The second option is to have a threshold, say, 2%. Every code modification that has performance within that threshold is considered noise and everything above the threshold is considered a regression. It is somewhat better than the first option but still has its own drawbacks. Fluctuations in performance tests are inevitable: sometimes, even a harmless code change can trigger performance variation in a benchmark.[^3] Choosing the right value for the threshold is extremely hard and does not guarantee a low rate of false-positive as well as false-negative alarms. Setting the threshold too low might lead to analyzing a bunch of small regressions that were not caused by the change in source code but due to some random noise. Setting the threshold too high might lead to filtering out real performance regressions.

Another interesting approach is presented in [@AutoPerf]. The authors of this paper presented `AutoPerf`, which uses hardware performance counters (PMC, see [@sec:PMC]) to diagnose performance regressions in a modified program. First, it learns the distribution of the performance of a modified function based on its PMC profile data collected from the original program. Then, it detects deviations in performance as anomalies based on the PMC profile data collected from the modified program. `AutoPerf` showed that this design could effectively diagnose some of the most complex software performance bugs, like those hidden in parallel programs.
Small regressions can pile up slowly into a bigger regression, which can be left unnoticed. Going back to Figure @fig:PerfRegress, notice a downward trend that lasted from Aug 11 to Aug 21. The period started with the score of 3000 and ended up with 2600. That is roughly 15% regression over 10 days, or 1.5% per day on average. If we set a 2% threshold all regressions will be filtered out. But as we can see, the accumulated regression is much bigger than the threshold.

Regardless of the underlying algorithm of detecting performance regressions, a typical CI system should automate the following actions:
Nevertheless, this option works reasonably well for some projects, especially if the level of noise in the benchmark is very low. Also, you can adjust the threshold for each test. An example of a CI system where each test requires setting explicit threshold values for alerting a regression is [LUCI](https://chromium.googlesource.com/chromium/src.git/+/master/docs/tour_of_luci_ui.md),[^2] which is a part of the Chromium project.

An option that we recommend is to use statistical approach. An algorithm to identify performance regressions that recently became popular is called "Change Point Detection" (see [@ChangePointAnalysis]). It utilizes historical data and identifies points in time where performance has changed. A CI system that uses such approach can automatically display change points on the chart and open a new ticket in the bug tracker system. Many performance monitoring systems embraced Change Point Detection algorithm, including several open-source projects. Search the web to find the one that better suits your needs.

A typical CI performance tracking system should automate the following actions:

1. Set up a system under test.
2. Run a benchmark suite.
Expand All @@ -29,14 +33,11 @@ Regardless of the underlying algorithm of detecting performance regressions, a t
5. Alert on unexpected changes in performance.
6. Visualize the results for a human to analyze.

CI system should support both automated and manual benchmarking, yield repeatable results, and open tickets for performance regressions that were found. It is very important to detect regressions promptly. First, because fewer changes were merged since a regression happened. This allows us to have a person responsible for the regression look into the problem before they move to another task. Also, it is a lot easier for a developer to approach the regression since all the details are still fresh in their head as opposed to several weeks after that.
CI systems should support both automated and manual benchmarking, yield repeatable results, and open tickets for performance regressions that were found. It is very important to detect regressions promptly. First, because fewer changes were merged since a regression happened. This allows us to have a person responsible for the regression look into the problem before they move to another task. Also, it is a lot easier for a developer to approach the regression since all the details are still fresh in their head as opposed to several weeks after that.

Lastly, the CI system should alert, not just on software performance regressions, but on unexpected performance improvements, too. For example, someone may check in a seemingly innocuous commit which, nonetheless, reduces latency by a whopping 10% in the Automated Performance Regression harness. Your initial instinct may be to celebrate this fortuitous performance boost and proceed with your day. However, while this commit may have passed all functional tests in your CI pipeline, chances are that this unexpected latency improvement uncovered a gap in functional testing which only manifested itself in the performance regression results. This scenario occurs often enough that it warrants explicit mention: treat the Automated Performance Regression harness as part of a holistic software testing framework, not as a silo.
Lastly, the CI system should alert, not just on software performance regressions, but on unexpected performance improvements, too. For example, someone may check in a seemingly innocuous commit which, nonetheless, improves performance by 10% in the automated tracking harness. Your initial instinct may be to celebrate this fortuitous performance boost and proceed with your day. However, while this commit may have passed all functional tests in your CI pipeline, chances are that this unexpected improvement uncovered a gap in functional testing which only manifested itself in the performance regression results. For instance, the change caused the application to skip some part of work, which was not covered by functional tests. This scenario occurs often enough that it warrants explicit mention: treat the automated performance regression harness as part of a holistic software testing framework.

We highly recommend setting up an automated statistical performance tracking system. Try using different algorithms and see which works best for your application. It will certainly take time, but it will be a solid investment in the future performance health of your project.
To wrap it up, we highly recommend setting up an automated statistical performance tracking system. Try using different algorithms and see which works best for your application. It will certainly take time, but it will be a solid investment in the future performance health of your project.

[^2]: LUCI - [https://chromium.googlesource.com/chromium/src.git/+/master/docs/tour_of_luci_ui.md](https://chromium.googlesource.com/chromium/src.git/+/master/docs/tour_of_luci_ui.md)
[^3]: The following article shows that changing the order of the functions or removing dead functions can cause variations in performance: [https://easyperf.net/blog/2018/01/18/Code_alignment_issues](https://easyperf.net/blog/2018/01/18/Code_alignment_issues)
[^4]: Evergreen - [https://github.com/evergreen-ci/evergreen](https://github.com/evergreen-ci/evergreen)
[^5]: Student's t-test - [https://en.wikipedia.org/wiki/Student%27s_t-test](https://en.wikipedia.org/wiki/Student's_t-test)
[^6]: Stats-pal - [https://github.com/JoeyHendricks/STATS-PAL](https://github.com/JoeyHendricks/STATS-PAL)
[^3]: The following article shows that changing the order of the functions or removing dead functions can cause variations in performance: [https://easyperf.net/blog/2018/01/18/Code_alignment_issues](https://easyperf.net/blog/2018/01/18/Code_alignment_issues)
Binary file modified img/measurements/FreqScaling.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/measurements/PerfRegressions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 75c290f

Please sign in to comment.