Skip to content

Commit

Permalink
Chapter 4 edits (#71)
Browse files Browse the repository at this point in the history
* 4-0

* 4-3: fix inversion of CPI/IPC definitions, fix number problems

* 4-4: comma splices

* 4-9: normalize a lot of entries

* 4.10: normalize some things

* chapter 4: en dashes, number agreement, word ordering
  • Loading branch information
dankamongmen committed Sep 14, 2024
1 parent 07bd550 commit 1ac2b43
Show file tree
Hide file tree
Showing 9 changed files with 80 additions and 80 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Terminology and Metrics in Performance Analysis {#sec:secMetrics}

Like many engineering disciplines, Performance Analysis is quite heavy on using peculiar terms and metrics. For a beginner, it can be a very hard time looking into a profile generated by an analysis tool like Linux `perf` or Intel VTune Profiler. Those tools juggle many complex terms and metrics, however, it is a "must-know" if you're set to do any serious performance engineering work.
Like many engineering disciplines, Performance Analysis is quite heavy on using peculiar terms and metrics. For a beginner, it can be a very hard time looking into a profile generated by an analysis tool like Linux `perf` or Intel VTune Profiler. Those tools juggle many complex terms and metrics, however, they are "must-knows" if you're set to do any serious performance engineering work.

Since we have mentioned Linux `perf`, let us briefly introduce the tool as we have many examples of using it in this and later chapters. Linux `perf` is a performance profiler that you can use to find hotspots in a program, collect various low-level CPU performance events, analyze call stacks, and many other things. We will use Linux `perf` extensively throughout the book as it is one of the most popular performance analysis tools. Another reason why we prefer showcasing Linux `perf` is because it is open-source software, which enables enthusiastic readers to explore the mechanics of what's going on inside a modern profiling tool. This is especially useful for learning concepts presented in this book because GUI-based tools, like Intel® VTune™ Profiler, tend to hide all the complexity. We will have a more detailed overview of Linux `perf` in [@sec:secOverviewPerfTools].

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
## Memory Latency and Bandwidth {#sec:MemLatBw}

Inefficient memory accesses are often a dominant performance bottleneck in modern environments. Thus, how quickly a processor can fetch data from the memory subsystem is a critical factor in determining application performance. There are two aspects of memory performance: 1) how fast a CPU can fetch a single byte from memory (latency), and 2) how many bytes it can fetch per second (bandwidth). Both are important in various scenarios, we will look at a few examples later. In this section, we will focus on measuring the peak performance of the memory subsystem components.
Inefficient memory accesses are often a dominant performance bottleneck in modern environments. Thus, how quickly a processor can fetch data from the memory subsystem is a critical factor in determining application performance. There are two aspects of memory performance: 1) how fast a CPU can fetch a single byte from memory (latency), and 2) how many bytes it can fetch per second (bandwidth). Both are important in various scenarios; we will look at a few examples later. In this section, we will focus on measuring the peak performance of the memory subsystem components.

One of the tools that can become helpful on x86 platforms is Intel Memory Latency Checker (MLC),[^1] which is available for free on Windows and Linux. MLC can measure cache and memory latency and bandwidth using different access patterns and under load. On ARM-based systems there is no similar tool, however, users can download and build memory latency and bandwidth benchmarks from sources. Examples of such projects are [lmbench](https://sourceforge.net/projects/lmbench/)[^2], [bandwidth](https://zsmith.co/bandwidth.php)[^4] and [Stream](https://github.com/jeffhammond/STREAM).[^3]

We will only focus on a subset of metrics, namely idle read latency and read bandwidth. Let's start with the read latency. Idle means that while we do the measurements, the system is idle. This will give us the minimum time required to fetch data from memory system components, but when the system is loaded by other "memory-hungry" applications, this latency increases as there may be more queueing for resources at various points. MLC measures idle latency by doing dependent loads (also known as pointer chasing). A measuring thread allocates a very large buffer and initializes it so that each (64-byte) cache line within the buffer contains a pointer to another, but non-adjacent, cache line within the buffer. By appropriately sizing the buffer, we can ensure that almost all the loads are hitting a certain level of the cache or in the main memory.

Our system under test is an Intel Alderlake box with Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s dual-channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48 KB of L1 data cache and 1.25 MB of L2 cache. Every E-core has 32 KB of L1 data cache, and four E-cores form a cluster that has access to a shared 2 MB L2 cache. All cores in the system are backed by an 18 MB L3 cache. If we use a 10 MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command:
Our system under test is an Intel Alderlake box with a Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s dual-channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48 KB of L1 data cache and 1.25 MB of L2 cache. Every E-core has 32 KB of L1 data cache, and four E-cores form a cluster that has access to a shared 2 MB L2 cache. All cores in the system are backed by an 18 MB L3 cache. If we use a 10 MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command:

```bash
$ ./mlc --idle_latency -c0 -L -b10m
Expand All @@ -21,9 +21,9 @@ Each iteration took 31.1 base frequency clocks ( 12.5 ns)

The option `--idle_latency` measures read latency without loading the system. MLC has the `--loaded_latency` option to measure latency when there is memory traffic generated by other threads. The option `-c0` pins the measurement thread to logical CPU 0, which is on a P-core. The option `-L` enables large pages to limit TLB effects in our measurements. The option `-b10m` tells MLC to use a 10MB buffer, which will fit in the L3 cache on our system.

Figure @fig:MemoryLatenciesCharts shows the read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1 KB to 48 KB buffer size corresponds to the L1d cache, which is private to each physical core. We can observe 0.9 ns latency for the E-core and a slightly higher 1.1 ns for the P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32 KB but E-core latency stays constant up to 48KB. That confirms that the L1d cache size in E-core is 32 KB, and in P-core it is 48 KB.
Figure @fig:MemoryLatenciesCharts shows the read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1 KB to 48 KB buffer size corresponds to the L1 D-cache, which is private to each physical core. We can observe 0.9 ns latency for the E-core and a slightly higher 1.1 ns for the P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32 KB but E-core latency stays constant up to 48KB. That confirms that the L1 D-cache size in E-core is 32 KB, and in P-core it is 48 KB.

![L1/L2/L3 cache read latencies (lower better) on Intel Core i7-1260P, measured with the mlc tool, large pages enabled.](../../img/terms-and-metrics/MemLatencies.png){#fig:MemoryLatenciesCharts width=100% }
![L1/L2/L3 cache read latencies (lower better) on Intel Core i7-1260P, measured with the MLC tool, large pages enabled.](../../img/terms-and-metrics/MemLatencies.png){#fig:MemoryLatenciesCharts width=100% }

The second region shows the L2 cache latencies, which for E-core is almost twice higher than for P-core (5.9 ns vs. 3.2 ns). For P-core, the latency increases after we cross the 1.25 MB buffer size, which is expected. But we expect E-core latency to stay the same until 2 MB, which is not happening in our measurements.

Expand Down Expand Up @@ -52,11 +52,11 @@ There are a couple of new options here. The `-k` option specifies a list of CPU
Cores can draw much higher bandwidth from lower-level caches like L1 and L2 than from shared L3 cache or main memory. Shared caches such as L3 and E-core L2, scale reasonably well to serve requests from multiple cores at the same time. For example, a single E-core L2 bandwidth is 100GB/s. With two E-cores from the same cluster, I measured 140 GB/s, three E-cores - 165 GB/s, and all four E-cores can draw 175 GB/s from the shared L2. The same goes for L3 cache, which allows for 60 GB/s for a single P-core and only 25 GB/s for a single E-core. But when all the cores are used, the L3 cache can sustain a bandwidth of 300 GB/s.
Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, a P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics change to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3Ghz.
Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, a P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics change to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3 GHz.
Knowledge of the primary characteristics of a machine is fundamental to assessing how well a program utilizes available resources. We will return to this topic in [@sec:roofline] when discussing the Roofline performance model. If you constantly analyze performance on a single platform, it is a good idea to memorize the latencies and bandwidth of various components of the memory hierarchy or have them handy. It helps to establish the mental model for a system under test which will aid your further performance analysis as you will see next.
[^1]: Intel MLC tool - [https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html](https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html)
[^2]: lmbench - [https://sourceforge.net/projects/lmbench](https://sourceforge.net/projects/lmbench)
[^3]: Stream - [https://github.com/jeffhammond/STREAM](https://github.com/jeffhammond/STREAM)
[^4]: Memory bandwidth benchmark by Zack Smith - [https://zsmith.co/bandwidth.php](https://zsmith.co/bandwidth.php)
[^4]: Memory bandwidth benchmark by Zack Smith - [https://zsmith.co/bandwidth.php](https://zsmith.co/bandwidth.php)
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ For this exercise, we run all four benchmarks on the machine with the following
* 64-bit Ubuntu 22.04.1 LTS (Jammy Jellyfish)
* Clang-15 C++ compiler with the following options: `-O3 -march=core-avx2`

To collect performance metrics, we use `toplev.py` script that is a part of [pmu-tools](https://github.com/andikleen/pmu-tools)[^1] written by Andi Kleen:
To collect performance metrics, we use the `toplev.py` script from Andi Kleen's [pmu-tools](https://github.com/andikleen/pmu-tools):[^1]

```bash
$ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v -- <app with args>
```

Table {@tbl:perf_metrics_case_study} provides a side-by-side comparison of performance metrics for our four benchmarks. There is a lot we can learn about the nature of those workloads just by looking at the metrics. Here are the hypotheses we can make about the benchmarks before collecting performance profiles and diving deeper into the code of those applications.

* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction presents a minor bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is quite good. TLB is not a bottleneck as we very rarely miss in STLB. We ignore the `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; an ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low, it's only 1.58 GB/s, far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating-point algorithm (see `IpFLOP` metric), a large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in Blender's source code. Conclusion: Blender's performance is bound by FP compute.
* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction presents a minor bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is quite good. TLB is not a bottleneck as we very rarely miss in STLB. We ignore the `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; an ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low (only 1.58 GB/s), far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating-point algorithm (see `IpFLOP` metric), a large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in Blender's source code. Conclusion: Blender's performance is bound by FP compute.

* __Stockfish__. We ran it using only one hardware thread, so there is zero work on E-cores, as expected. The number of L1 misses is relatively high, but then most of them are contained in L2 and L3 caches. The branch misprediction ratio is high; we pay the misprediction penalty every `215` instructions. We can estimate that we get one mispredict every `215 (instructions) / 1.80 (IPC) = 120` cycles, which is very frequent. Similar to the Blender reasoning, we can say that TLB and DRAM bandwidth is not an issue for Stockfish. Going further, we see that there are almost no FP operations in the workload. Conclusion: Stockfish is an integer compute workload, which is heavily affected by branch mispredictions.

Expand Down Expand Up @@ -104,7 +104,7 @@ Table: Performance Metrics of Four Benchmarks. {#tbl:perf_metrics_case_study}

\normalsize

As you can see from this study, there is a lot one can learn about the behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that, you will need to collect a performance profile, which we will introduce in later chapters. In Part 2 of this book, we will discuss how to mitigate the performance issues we suspect take place in the four benchmarks that we have analyzed.
As you can see from this study, there is a lot one can learn about the behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that, you will need to collect a performance profile, which we will introduce in later chapters. In Part 2 of this book, we will discuss how to mitigate the performance issues we suspect to exist in the four benchmarks that we have analyzed.

Keep in mind that the summary of performance metrics in Table {@tbl:perf_metrics_case_study} only tells you about the *average* behavior of a program. For example, we might be looking at CloverLeaf's IPC of `0.2`, while in reality, it may never run with such an IPC, instead, it may have 2 phases of equal duration, one running with an IPC of `0.1`, and the second with IPC of `0.3`. Performance tools tackle this by reporting statistical data for each metric along with the average value. Usually, having min, max, 95th percentile, and variation (stdev/avg) is enough to understand the distribution. Also, some tools allow plotting the data, so you can see how the value for a certain metric changed during the program running time. As an example, Figure @fig:CloverMetricCharts shows the dynamics of IPC, L*MPKI, DRAM BW, and average frequency for the CloverLeaf benchmark. The `pmu-tools` package can automatically build those charts once you add the `--xlsx` and `--xchart` options. The `-I 10000` option aggregates collected samples with 10-second intervals.

Expand All @@ -116,7 +116,7 @@ Even though the deviation from the average values reported in the summary is not

![Performance metrics charts for the CloverLeaf benchmark with 10 second intervals.](../../img/terms-and-metrics/CloverMetricCharts2.png){#fig:CloverMetricCharts width=100% }

In summary, performance metrics help you build the right mental model about what is and what is *not* happening in a program. Going further into analysis, this data will serve you well.
In summary, performance metrics help you build the right mental model about what is and what is *not* happening in a program. Going further into analysis, these data will serve you well.

[^1]: pmu-tools - [https://github.com/andikleen/pmu-tools](https://github.com/andikleen/pmu-tools)
[^2]: A possible explanation for that is because CloverLeaf is very memory-bandwidth bound. All P- and E-cores are equally stalled waiting on memory. Because P-cores have a higher frequency, they waste more CPU clocks than E-cores.
[^2]: A possible explanation for that is because CloverLeaf is very memory-bandwidth bound. All P- and E-cores are equally stalled waiting on memory. Because P-cores have a higher frequency, they waste more CPU clocks than E-cores.
Loading

0 comments on commit 1ac2b43

Please sign in to comment.