From 1ac2b43cc6b0ac870bb53ad1a65da86bffbef843 Mon Sep 17 00:00:00 2001 From: nick black Date: Sat, 14 Sep 2024 16:52:10 -0400 Subject: [PATCH] Chapter 4 edits (#71) * 4-0 * 4-3: fix inversion of CPI/IPC definitions, fix number problems * 4-4: comma splices * 4-9: normalize a lot of entries * 4.10: normalize some things * chapter 4: en dashes, number agreement, word ordering --- ...ogy and metrics in performance analysis.md | 2 +- .../4-10 Memory Latency and Bandwidth.md | 12 +-- .../4-11 Case Study of 4 Benchmarks.md | 10 +- .../4-15 Questions-Exercises.md | 6 +- .../4-16 Chapter summary.md | 8 +- .../4-3 CPI and IPC.md | 16 ++-- chapters/4-Terminology-And-Metrics/4-4 UOP.md | 10 +- .../4-7 Cache miss.md | 4 +- .../4-9 Performance Metrics.md | 92 +++++++++---------- 9 files changed, 80 insertions(+), 80 deletions(-) diff --git a/chapters/4-Terminology-And-Metrics/4-0 Terminology and metrics in performance analysis.md b/chapters/4-Terminology-And-Metrics/4-0 Terminology and metrics in performance analysis.md index 232b8c6042..fbe0862848 100644 --- a/chapters/4-Terminology-And-Metrics/4-0 Terminology and metrics in performance analysis.md +++ b/chapters/4-Terminology-And-Metrics/4-0 Terminology and metrics in performance analysis.md @@ -2,7 +2,7 @@ # Terminology and Metrics in Performance Analysis {#sec:secMetrics} -Like many engineering disciplines, Performance Analysis is quite heavy on using peculiar terms and metrics. For a beginner, it can be a very hard time looking into a profile generated by an analysis tool like Linux `perf` or Intel VTune Profiler. Those tools juggle many complex terms and metrics, however, it is a "must-know" if you're set to do any serious performance engineering work. +Like many engineering disciplines, Performance Analysis is quite heavy on using peculiar terms and metrics. For a beginner, it can be a very hard time looking into a profile generated by an analysis tool like Linux `perf` or Intel VTune Profiler. Those tools juggle many complex terms and metrics, however, they are "must-knows" if you're set to do any serious performance engineering work. Since we have mentioned Linux `perf`, let us briefly introduce the tool as we have many examples of using it in this and later chapters. Linux `perf` is a performance profiler that you can use to find hotspots in a program, collect various low-level CPU performance events, analyze call stacks, and many other things. We will use Linux `perf` extensively throughout the book as it is one of the most popular performance analysis tools. Another reason why we prefer showcasing Linux `perf` is because it is open-source software, which enables enthusiastic readers to explore the mechanics of what's going on inside a modern profiling tool. This is especially useful for learning concepts presented in this book because GUI-based tools, like Intel® VTune™ Profiler, tend to hide all the complexity. We will have a more detailed overview of Linux `perf` in [@sec:secOverviewPerfTools]. diff --git a/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md b/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md index 5f01f4faf9..e2e41630f7 100644 --- a/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md +++ b/chapters/4-Terminology-And-Metrics/4-10 Memory Latency and Bandwidth.md @@ -1,12 +1,12 @@ ## Memory Latency and Bandwidth {#sec:MemLatBw} -Inefficient memory accesses are often a dominant performance bottleneck in modern environments. Thus, how quickly a processor can fetch data from the memory subsystem is a critical factor in determining application performance. There are two aspects of memory performance: 1) how fast a CPU can fetch a single byte from memory (latency), and 2) how many bytes it can fetch per second (bandwidth). Both are important in various scenarios, we will look at a few examples later. In this section, we will focus on measuring the peak performance of the memory subsystem components. +Inefficient memory accesses are often a dominant performance bottleneck in modern environments. Thus, how quickly a processor can fetch data from the memory subsystem is a critical factor in determining application performance. There are two aspects of memory performance: 1) how fast a CPU can fetch a single byte from memory (latency), and 2) how many bytes it can fetch per second (bandwidth). Both are important in various scenarios; we will look at a few examples later. In this section, we will focus on measuring the peak performance of the memory subsystem components. One of the tools that can become helpful on x86 platforms is Intel Memory Latency Checker (MLC),[^1] which is available for free on Windows and Linux. MLC can measure cache and memory latency and bandwidth using different access patterns and under load. On ARM-based systems there is no similar tool, however, users can download and build memory latency and bandwidth benchmarks from sources. Examples of such projects are [lmbench](https://sourceforge.net/projects/lmbench/)[^2], [bandwidth](https://zsmith.co/bandwidth.php)[^4] and [Stream](https://github.com/jeffhammond/STREAM).[^3] We will only focus on a subset of metrics, namely idle read latency and read bandwidth. Let's start with the read latency. Idle means that while we do the measurements, the system is idle. This will give us the minimum time required to fetch data from memory system components, but when the system is loaded by other "memory-hungry" applications, this latency increases as there may be more queueing for resources at various points. MLC measures idle latency by doing dependent loads (also known as pointer chasing). A measuring thread allocates a very large buffer and initializes it so that each (64-byte) cache line within the buffer contains a pointer to another, but non-adjacent, cache line within the buffer. By appropriately sizing the buffer, we can ensure that almost all the loads are hitting a certain level of the cache or in the main memory. -Our system under test is an Intel Alderlake box with Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s dual-channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48 KB of L1 data cache and 1.25 MB of L2 cache. Every E-core has 32 KB of L1 data cache, and four E-cores form a cluster that has access to a shared 2 MB L2 cache. All cores in the system are backed by an 18 MB L3 cache. If we use a 10 MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command: +Our system under test is an Intel Alderlake box with a Core i7-1260P CPU and 16GB DDR4 @ 2400 MT/s dual-channel memory. The processor has 4P (Performance) hyperthreaded and 8E (Efficient) cores. Every P-core has 48 KB of L1 data cache and 1.25 MB of L2 cache. Every E-core has 32 KB of L1 data cache, and four E-cores form a cluster that has access to a shared 2 MB L2 cache. All cores in the system are backed by an 18 MB L3 cache. If we use a 10 MB buffer, we can be certain that repeated accesses to that buffer would miss in L2 but hit in L3. Here is the example `mlc` command: ```bash $ ./mlc --idle_latency -c0 -L -b10m @@ -21,9 +21,9 @@ Each iteration took 31.1 base frequency clocks ( 12.5 ns) The option `--idle_latency` measures read latency without loading the system. MLC has the `--loaded_latency` option to measure latency when there is memory traffic generated by other threads. The option `-c0` pins the measurement thread to logical CPU 0, which is on a P-core. The option `-L` enables large pages to limit TLB effects in our measurements. The option `-b10m` tells MLC to use a 10MB buffer, which will fit in the L3 cache on our system. -Figure @fig:MemoryLatenciesCharts shows the read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1 KB to 48 KB buffer size corresponds to the L1d cache, which is private to each physical core. We can observe 0.9 ns latency for the E-core and a slightly higher 1.1 ns for the P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32 KB but E-core latency stays constant up to 48KB. That confirms that the L1d cache size in E-core is 32 KB, and in P-core it is 48 KB. +Figure @fig:MemoryLatenciesCharts shows the read latencies of L1, L2, and L3 caches. There are four different regions on the chart. The first region on the left from 1 KB to 48 KB buffer size corresponds to the L1 D-cache, which is private to each physical core. We can observe 0.9 ns latency for the E-core and a slightly higher 1.1 ns for the P-core. Also, we can use this chart to confirm the cache sizes. Notice how E-core latency starts climbing after a buffer size goes above 32 KB but E-core latency stays constant up to 48KB. That confirms that the L1 D-cache size in E-core is 32 KB, and in P-core it is 48 KB. -![L1/L2/L3 cache read latencies (lower better) on Intel Core i7-1260P, measured with the mlc tool, large pages enabled.](../../img/terms-and-metrics/MemLatencies.png){#fig:MemoryLatenciesCharts width=100% } +![L1/L2/L3 cache read latencies (lower better) on Intel Core i7-1260P, measured with the MLC tool, large pages enabled.](../../img/terms-and-metrics/MemLatencies.png){#fig:MemoryLatenciesCharts width=100% } The second region shows the L2 cache latencies, which for E-core is almost twice higher than for P-core (5.9 ns vs. 3.2 ns). For P-core, the latency increases after we cross the 1.25 MB buffer size, which is expected. But we expect E-core latency to stay the same until 2 MB, which is not happening in our measurements. @@ -52,11 +52,11 @@ There are a couple of new options here. The `-k` option specifies a list of CPU Cores can draw much higher bandwidth from lower-level caches like L1 and L2 than from shared L3 cache or main memory. Shared caches such as L3 and E-core L2, scale reasonably well to serve requests from multiple cores at the same time. For example, a single E-core L2 bandwidth is 100GB/s. With two E-cores from the same cluster, I measured 140 GB/s, three E-cores - 165 GB/s, and all four E-cores can draw 175 GB/s from the shared L2. The same goes for L3 cache, which allows for 60 GB/s for a single P-core and only 25 GB/s for a single E-core. But when all the cores are used, the L3 cache can sustain a bandwidth of 300 GB/s. -Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, a P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics change to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3Ghz. +Notice, that we measure latency in nanoseconds and bandwidth in GB/s, thus they also depend on the frequency at which cores are running. In various circumstances, the observed numbers may be different. For example, let's assume that when running solely on the system at full turbo frequency, a P-core has L1 latency `X` and L1 bandwidth `Y`. When the system is fully loaded, we may observe these metrics change to `1.25X` and `0.75Y` respectively. To mitigate the frequency effects, instead of nanoseconds, latencies and metrics can be represented using core cycles, normalized to some sample frequency, say 3 GHz. Knowledge of the primary characteristics of a machine is fundamental to assessing how well a program utilizes available resources. We will return to this topic in [@sec:roofline] when discussing the Roofline performance model. If you constantly analyze performance on a single platform, it is a good idea to memorize the latencies and bandwidth of various components of the memory hierarchy or have them handy. It helps to establish the mental model for a system under test which will aid your further performance analysis as you will see next. [^1]: Intel MLC tool - [https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html](https://www.intel.com/content/www/us/en/download/736633/intel-memory-latency-checker-intel-mlc.html) [^2]: lmbench - [https://sourceforge.net/projects/lmbench](https://sourceforge.net/projects/lmbench) [^3]: Stream - [https://github.com/jeffhammond/STREAM](https://github.com/jeffhammond/STREAM) -[^4]: Memory bandwidth benchmark by Zack Smith - [https://zsmith.co/bandwidth.php](https://zsmith.co/bandwidth.php) \ No newline at end of file +[^4]: Memory bandwidth benchmark by Zack Smith - [https://zsmith.co/bandwidth.php](https://zsmith.co/bandwidth.php) diff --git a/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md b/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md index e4c668e0e9..609bfe01c0 100644 --- a/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md +++ b/chapters/4-Terminology-And-Metrics/4-11 Case Study of 4 Benchmarks.md @@ -15,7 +15,7 @@ For this exercise, we run all four benchmarks on the machine with the following * 64-bit Ubuntu 22.04.1 LTS (Jammy Jellyfish) * Clang-15 C++ compiler with the following options: `-O3 -march=core-avx2` -To collect performance metrics, we use `toplev.py` script that is a part of [pmu-tools](https://github.com/andikleen/pmu-tools)[^1] written by Andi Kleen: +To collect performance metrics, we use the `toplev.py` script from Andi Kleen's [pmu-tools](https://github.com/andikleen/pmu-tools):[^1] ```bash $ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v -- @@ -23,7 +23,7 @@ $ ~/workspace/pmu-tools/toplev.py -m --global --no-desc -v -- Table {@tbl:perf_metrics_case_study} provides a side-by-side comparison of performance metrics for our four benchmarks. There is a lot we can learn about the nature of those workloads just by looking at the metrics. Here are the hypotheses we can make about the benchmarks before collecting performance profiles and diving deeper into the code of those applications. -* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction presents a minor bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is quite good. TLB is not a bottleneck as we very rarely miss in STLB. We ignore the `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; an ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low, it's only 1.58 GB/s, far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating-point algorithm (see `IpFLOP` metric), a large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in Blender's source code. Conclusion: Blender's performance is bound by FP compute. +* __Blender__. The work is split fairly equally between P-cores and E-cores, with a decent IPC on both core types. The number of cache misses per kilo instructions is pretty low (see `L*MPKI`). Branch misprediction presents a minor bottleneck: the `Br. Misp. Ratio` metric is at `2%`; we get 1 misprediction every `610` instructions (see `IpMispredict` metric), which is quite good. TLB is not a bottleneck as we very rarely miss in STLB. We ignore the `Load Miss Latency` metric since the number of cache misses is very low. The ILP is reasonably high. Goldencove is a 6-wide architecture; an ILP of `3.67` means that the algorithm utilizes almost `2/3` of the core resources every cycle. Memory bandwidth demand is low (only 1.58 GB/s), far from the theoretical maximum for this machine. Looking at the `Ip*` metrics we can tell that Blender is a floating-point algorithm (see `IpFLOP` metric), a large portion of which is vectorized FP operations (see `IpArith AVX128`). But also, some portions of the algorithm are non-vectorized scalar FP single precision instructions (`IpArith Scal SP`). Also, notice that every 90th instruction is an explicit software memory prefetch (`IpSWPF`); we expect to see those hints in Blender's source code. Conclusion: Blender's performance is bound by FP compute. * __Stockfish__. We ran it using only one hardware thread, so there is zero work on E-cores, as expected. The number of L1 misses is relatively high, but then most of them are contained in L2 and L3 caches. The branch misprediction ratio is high; we pay the misprediction penalty every `215` instructions. We can estimate that we get one mispredict every `215 (instructions) / 1.80 (IPC) = 120` cycles, which is very frequent. Similar to the Blender reasoning, we can say that TLB and DRAM bandwidth is not an issue for Stockfish. Going further, we see that there are almost no FP operations in the workload. Conclusion: Stockfish is an integer compute workload, which is heavily affected by branch mispredictions. @@ -104,7 +104,7 @@ Table: Performance Metrics of Four Benchmarks. {#tbl:perf_metrics_case_study} \normalsize -As you can see from this study, there is a lot one can learn about the behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that, you will need to collect a performance profile, which we will introduce in later chapters. In Part 2 of this book, we will discuss how to mitigate the performance issues we suspect take place in the four benchmarks that we have analyzed. +As you can see from this study, there is a lot one can learn about the behavior of a program just by looking at the metrics. It answers the "what?" question, but doesn't tell you the "why?". For that, you will need to collect a performance profile, which we will introduce in later chapters. In Part 2 of this book, we will discuss how to mitigate the performance issues we suspect to exist in the four benchmarks that we have analyzed. Keep in mind that the summary of performance metrics in Table {@tbl:perf_metrics_case_study} only tells you about the *average* behavior of a program. For example, we might be looking at CloverLeaf's IPC of `0.2`, while in reality, it may never run with such an IPC, instead, it may have 2 phases of equal duration, one running with an IPC of `0.1`, and the second with IPC of `0.3`. Performance tools tackle this by reporting statistical data for each metric along with the average value. Usually, having min, max, 95th percentile, and variation (stdev/avg) is enough to understand the distribution. Also, some tools allow plotting the data, so you can see how the value for a certain metric changed during the program running time. As an example, Figure @fig:CloverMetricCharts shows the dynamics of IPC, L*MPKI, DRAM BW, and average frequency for the CloverLeaf benchmark. The `pmu-tools` package can automatically build those charts once you add the `--xlsx` and `--xchart` options. The `-I 10000` option aggregates collected samples with 10-second intervals. @@ -116,7 +116,7 @@ Even though the deviation from the average values reported in the summary is not ![Performance metrics charts for the CloverLeaf benchmark with 10 second intervals.](../../img/terms-and-metrics/CloverMetricCharts2.png){#fig:CloverMetricCharts width=100% } -In summary, performance metrics help you build the right mental model about what is and what is *not* happening in a program. Going further into analysis, this data will serve you well. +In summary, performance metrics help you build the right mental model about what is and what is *not* happening in a program. Going further into analysis, these data will serve you well. [^1]: pmu-tools - [https://github.com/andikleen/pmu-tools](https://github.com/andikleen/pmu-tools) -[^2]: A possible explanation for that is because CloverLeaf is very memory-bandwidth bound. All P- and E-cores are equally stalled waiting on memory. Because P-cores have a higher frequency, they waste more CPU clocks than E-cores. \ No newline at end of file +[^2]: A possible explanation for that is because CloverLeaf is very memory-bandwidth bound. All P- and E-cores are equally stalled waiting on memory. Because P-cores have a higher frequency, they waste more CPU clocks than E-cores. diff --git a/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md b/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md index 392e1ea007..34a368687c 100644 --- a/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md +++ b/chapters/4-Terminology-And-Metrics/4-15 Questions-Exercises.md @@ -3,10 +3,10 @@ \markright{Questions and Exercises} 1. What is the difference between the CPU core clock and the reference clock? -2. What is the difference between retired and executed instruction? +2. What is the difference between retired and executed instructions? 3. When you increase the frequency, does IPC go up, down, or stay the same? 4. Take a look at the `DRAM BW Use` formula in Table {@tbl:perf_metrics}. Why do you think there is a constant `64`? -5. Measure the bandwidth and latency of the cache hierarchy and memory on the machine you use for development/benchmarking using MLC, stream or other tools. +5. Measure the bandwidth and latency of the cache hierarchy and memory on the machine you use for development/benchmarking using MLC, Stream or other tools. 6. Run the application that you're working with on a daily basis. Collect performance metrics. Does anything surprise you? -**Capacity Planning Exercise**: Imagine you are the owner of four applications we benchmarked in the case study. The management of your company has asked you to build a small computing farm for each of those applications with the primary goal being to maximize performance (throughput). The spending budget you were given is tight but enough to buy 1 mid-level server system (Mac Studio, Supermicro/Dell/HPE server rack, etc.) or 1 high-end desktop (with overclocked CPU, liquid cooling, top GPU, fast DRAM) to run each workload, so 4 machines in total. Those could be all four different systems. Also, you can use the money to buy 3-4 low-end systems, the choice is yours. The management wants to keep it under $10,000 per application, but they are flexible (10-20%) if you can justify the expense. Assume that Stockfish remains single-threaded. Look at the performance characteristics for the four applications once again and write down which computer parts (CPU, memory, discrete GPU if needed) you would buy for each of those workloads. Which specification parameters you will prioritize? Where you will go with the most expensive part and where you can save money? Try to describe it in as much detail as possible, and search the web for exact components and their prices. Account for all the components of the system: motherboard, disk drive, cooling solution, power delivery unit, rack/case/tower, etc. What additional performance experiments you would run to guide your decision? \ No newline at end of file +**Capacity Planning Exercise**: Imagine you are the owner of four applications we benchmarked in the case study. The management of your company has asked you to build a small computing farm for each of those applications with the primary goal being maximum performance (throughput). The spending budget you were given is tight but enough to buy 1 mid-level server system (Mac Studio, Supermicro/Dell/HPE server rack, etc.) or 1 high-end desktop (with overclocked CPU, liquid cooling, top GPU, fast DRAM) to run each workload, so 4 machines in total. Those could be all four different systems. Also, you can use the money to buy 3-4 low-end systems; the choice is yours. The management wants to keep it under $10,000 per application, but they are flexible (10--20%) if you can justify the expense. Assume that Stockfish remains single-threaded. Look at the performance characteristics for the four applications once again and write down which computer parts (CPU, memory, discrete GPU if needed) you would buy for each of those workloads. Which parameters you will prioritize? Where will you go with the most expensive part? Where you can save money? Try to describe it in as much detail as possible, and search the web for exact components and their prices. Account for all the components of the system: motherboard, disk drive, cooling solution, power delivery unit, rack/case/tower, etc. What additional performance experiments would you run to guide your decision? diff --git a/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md b/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md index 8f711a5bad..de1431db15 100644 --- a/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md +++ b/chapters/4-Terminology-And-Metrics/4-16 Chapter summary.md @@ -2,10 +2,10 @@ \markright{Summary} -* In this chapter, we introduced the basic metrics in performance analysis such as retired/executed instructions, CPU utilization, IPC/CPI, $\mu$ops, pipeline slots, core/reference clocks, cache misses and branch mispredictions. We showed how each of these metrics can be collected with Linux perf. -* For more advanced performance analysis, there are many derivative metrics that you can collect. For instance, cache misses per kilo instructions (MPKI), instructions per function call, branch, load, etc (Ip*), ILP, MLP, and others. The case studies in this chapter show how you can get actionable insights from analyzing these metrics. -* Be careful about drawing conclusions just by looking at the aggregate numbers. Don't fall into the trap of "Excel performance engineering", i.e., only collect performance metrics and never look at the code. Always seek a second source of data (e.g., performance profiles, discussed later) to verify your ideas. -* Memory bandwidth and latency are crucial factors in the performance of many production software packages nowadays, including AI, HPC, databases, and many general-purpose applications. Memory bandwidth depends on the DRAM speed (in MT/s) and the number of memory channels. Modern high-end server platforms have 8-12 memory channels and can reach up to 500 GB/s for the whole system and up to 50 GB/s in single-threaded mode. Memory latency nowadays doesn't change a lot, in fact, it is getting slightly worse with new DDR4 and DDR5 generations. The majority of modern systems fall in the range of 70--110 ns latency per memory access. +* In this chapter, we introduced the basic metrics in performance analysis such as retired/executed instructions, CPU utilization, IPC/CPI, $\mu$ops, pipeline slots, core/reference clocks, cache misses and branch mispredictions. We showed how each of these metrics can be collected with Linux `perf`. +* For more advanced performance analysis, there are many derivative metrics that you can collect. For instance, cache misses per kilo instructions (MPKI), instructions per function call, branch, load, etc. (Ip*), ILP, MLP, and others. The case studies in this chapter show how you can get actionable insights from analyzing these metrics. +* Be careful about drawing conclusions just by looking at the aggregate numbers. Don't fall into the trap of "Excel performance engineering", i.e., only collecting performance metrics and never looking at the code. Always seek a second source of data (e.g., performance profiles, discussed later) to verify your ideas. +* Memory bandwidth and latency are crucial factors in the performance of many production software packages nowadays, including AI, HPC, databases, and many general-purpose applications. Memory bandwidth depends on the DRAM speed (in MT/s) and the number of memory channels. Modern high-end server platforms have 8--12 memory channels and can reach up to 500 GB/s for the whole system and up to 50 GB/s in single-threaded mode. Memory latency nowadays doesn't change a lot, in fact, it is getting slightly worse with new DDR4 and DDR5 generations. The majority of modern systems fall in the range of 70--110 ns latency per memory access. \sectionbreak diff --git a/chapters/4-Terminology-And-Metrics/4-3 CPI and IPC.md b/chapters/4-Terminology-And-Metrics/4-3 CPI and IPC.md index d944cc1166..9c3e191d06 100644 --- a/chapters/4-Terminology-And-Metrics/4-3 CPI and IPC.md +++ b/chapters/4-Terminology-And-Metrics/4-3 CPI and IPC.md @@ -4,21 +4,21 @@ Those are two fundamental metrics that stand for: -* Cycles Per Instruction (CPI) - how many cycles it took to retire one instruction on average. +* Instructions Per Cycle (IPC) - how many instructions were retired per cycle on average. $$ IPC = \frac{INST\_RETIRED.ANY}{CPU\_CLK\_UNHALTED.THREAD}, $$ - where `INST_RETIRED.ANY` counts the number of retired instructions, and `CPU_CLK_UNHALTED.THREAD` counts the number of core cycles while the thread is not in a halt state. +where `INST_RETIRED.ANY` counts the number of retired instructions, and `CPU_CLK_UNHALTED.THREAD` counts the number of core cycles while the thread is not in a halt state. -* Instructions Per Cycle (IPC) - how many instructions were retired per cycle on average. +* Cycles Per Instruction (CPI) - how many cycles it took to retire one instruction on average. $$ CPI = \frac{1}{IPC} $$ -Using one or another is a matter of preference. The main author of the book prefers to use `IPC` as it is easier to compare. With IPC, we want as many instructions per cycle as possible, so the higher the IPC, the better. With `CPI`, it's the opposite: we want as few cycles per instruction as possible, so the lower the CPI the better. The comparison that uses "the higher the better" metric is simpler since you don't have to do the mental inversion every time. In the rest of the book, we will mostly use IPC, but again, there is nothing wrong with using CPI either. +Using one or another is a matter of preference. The main author of the book prefers to use IPC as it is easier to compare. With IPC, we want as many instructions per cycle as possible, so the higher the IPC, the better. With `CPI`, it's the opposite: we want as few cycles per instruction as possible, so the lower the CPI the better. The comparison that uses "the higher the better" metric is simpler since you don't have to do the mental inversion every time. In the rest of the book, we will mostly use IPC, but again, there is nothing wrong with using CPI either. The relationship between IPC and CPU clock frequency is very interesting. In the broad sense, `performance = work / time`, where we can express work as the number of instructions and time as seconds. The number of seconds a program was running can be calculated as `total cycles / frequency`: @@ -28,13 +28,13 @@ $$ As we can see, performance is proportional to IPC and frequency. If we increase any of the two metrics, the performance of a program will grow. -From the perspective of benchmarking, IPC and frequency are two independent metrics. We've seen many engineers mistakenly mixing them up and thinking that if you increase the frequency, the IPC will also go up. But it's not, the IPC will stay the same. If you clock a processor at 1 GHz instead of 5 GHz, for many applications you will still get the same IPC.[^1] It may sound very confusing, especially since IPC has all to do with CPU clocks. However, frequency only tells how fast a single clock cycle is, whereas IPC doesn't account for the speed at which clocks change, it counts how much work is done every cycle. So, from the benchmarking perspective, IPC solely depends on the design of the processor regardless of the frequency. Out-of-order cores typically have a much higher IPC than in-order cores. When you increase the size of CPU caches or improve branch prediction, the IPC usually goes up. +From the perspective of benchmarking, IPC and frequency are two independent metrics. We've seen many engineers mistakenly mixing them up and thinking that if you increase the frequency, the IPC will also go up. But that's not the case: the IPC will stay the same. If you clock a processor at 1 GHz instead of 5 GHz, for many applications you will still get the same IPC.[^1] It may sound very confusing, especially since IPC has everything to do with CPU clocks. However, frequency only tells us how fast a single clock cycle is, whereas IPC counts how much work is done every cycle. So, from the benchmarking perspective, IPC solely depends on the design of the processor regardless of the frequency. Out-of-order cores typically have a much higher IPC than in-order cores. When you increase the size of CPU caches or improve branch prediction, the IPC usually goes up. -Now, if you ask a hardware architect, they will certainly tell you there is a dependency between IPC and frequency. From the CPU design perspective, you can deliberately downclock the processor, which will make every cycle longer and make it possible to squeeze more work into each cycle. In the end, you will get a higher IPC but a lower frequency. Hardware vendors approach this performance equation in different ways. For example, Intel and AMD chips usually have very high frequencies, with the recent Intel 13900KS processor providing a 6Ghz turbo frequency out of the box with no overclocking required. On the other hand, Apple M1/M2 chips have lower frequency but compensate with a higher IPC. Lower frequency facilitates lower power consumption. Higher IPC, on the other hand, usually requires a more complicated design, more transistors and a larger die size. We will not go into all the design tradeoffs here as it is a topic for a different book. +Now, if you ask a hardware architect, they will certainly tell you there is a dependency between IPC and frequency. From the CPU design perspective, you can deliberately downclock the processor, which will make every cycle longer and make it possible to squeeze more work into each cycle. In the end, you will get a higher IPC but a lower frequency. Hardware vendors approach this performance equation in different ways. For example, Intel and AMD chips usually have very high frequencies, with the recent Intel 13900KS processor providing a 6 GHz turbo frequency out of the box with no overclocking required. On the other hand, Apple M1/M2 chips have lower frequency but compensate with a higher IPC. Lower frequency facilitates lower power consumption. Higher IPC, on the other hand, usually requires a more complicated design, more transistors and a larger die size. We will not go into all the design tradeoffs here as they are topics for a different book. IPC is useful for evaluating both hardware and software efficiency. Hardware engineers use this metric to compare CPU generations and CPUs from different vendors. Since IPC is the measure of the performance of a CPU microarchitecture, engineers and media use it to express gains over the previous generation. However, to make a fair comparison, you need to run both systems on the same frequency. -IPC is also a useful metric for evaluating software. It gives you an intuition for how quickly instructions in your application progress through the CPU pipeline. Later in this chapter, you will see several production applications with varying IPCs. Memory-intensive applications usually have a low IPC (0-1), while computationally intensive workloads tend to a have high IPC (4-6). +IPC is also a useful metric for evaluating software. It gives you an intuition for how quickly instructions in your application progress through the CPU pipeline. Later in this chapter, you will see several production applications with varying IPCs. Memory-intensive applications usually have a low IPC (0--1), while computationally intensive workloads tend to a have high IPC (4--6). Linux `perf` users can measure the IPC for their workload by running: @@ -46,4 +46,4 @@ $ perf stat -e cycles,instructions -- a.exe $ perf stat ./a.exe ``` -[^1]: When you lower CPU frequency, memory speed becomes faster relative to the CPU. This may hide actual memory bottlenecks and artificially increase IPC. \ No newline at end of file +[^1]: When you lower CPU frequency, memory speed becomes faster relative to the CPU. This may hide actual memory bottlenecks and artificially increase IPC. diff --git a/chapters/4-Terminology-And-Metrics/4-4 UOP.md b/chapters/4-Terminology-And-Metrics/4-4 UOP.md index 5d712f1360..9212905641 100644 --- a/chapters/4-Terminology-And-Metrics/4-4 UOP.md +++ b/chapters/4-Terminology-And-Metrics/4-4 UOP.md @@ -2,7 +2,7 @@ ## Micro-operations {#sec:sec_UOP} -Microprocessors with the x86 architecture translate complex CISC-like instructions into simple RISC-like microoperations, abbreviated as $\mu$ops. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while a more complex instruction like `ADD rax, [mem]` may generate two: one for loading from the `mem` memory location into a temporary (un-named) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for loading from memory, one for adding, and one for storing the result back to memory. Even though x86 ISA is a register-memory architecture, after $\mu$ops conversion, it becomes a load-store architecture since memory is only accessed via load/store $\mu$ops. +Microprocessors with the x86 architecture translate complex CISC instructions into simple RISC microoperations, abbreviated as $\mu$ops. A simple addition instruction such as `ADD rax, rbx` generates only one $\mu$op, while a more complex instruction like `ADD rax, [mem]` may generate two: one for loading from the `mem` memory location into a temporary (unnamed) register, and one for adding it to the `rax` register. The instruction `ADD [mem], rax` generates three $\mu$ops: one for loading from memory, one for adding, and one for storing the result back to memory. Even though the x86 ISA is a register-memory architecture, after $\mu$ops conversion, it becomes a load-store architecture since memory is only accessed via load/store $\mu$ops. The main advantage of splitting instructions into micro-operations is that $\mu$ops can be executed: @@ -15,14 +15,14 @@ The main advantage of splitting instructions into micro-operations is that $\mu$ ``` Often, a function prologue saves multiple registers by using multiple `PUSH` instructions. In our case, the next `PUSH` instruction can start executing after the `SUB` $\mu$op of the previous `PUSH` instruction finishes, and doesn't have to wait for the `STORE` $\mu$op, which can now execute asynchronously. -* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, which will sum up (reduce) two double-precision floating-point values in both `xmm1` and `xmm2` and store two results in `xmm1` as follows: +* **In parallel**: consider `HADDPD xmm1, xmm2` instruction, which will sum up (reduce) two double-precision floating-point values from `xmm1` and `xmm2` and store two results in `xmm1` as follows: ``` xmm1[63:0] = xmm2[127:64] + xmm2[63:0] xmm1[127:64] = xmm1[127:64] + xmm1[63:0] ``` One way to microcode this instruction would be to do the following: 1) reduce `xmm2` and store the result in `xmm_tmp1[63:0]`, 2) reduce `xmm1` and store the result in `xmm_tmp2[63:0]`, 3) merge `xmm_tmp1` and `xmm_tmp2` into `xmm1`. Three $\mu$ops in total. Notice that steps 1) and 2) are independent and thus can be done in parallel. -Even though we were just talking about how instructions are split into smaller pieces, sometimes, $\mu$ops can also be fused together. There are two types of fusion in modern CPUs: +Even though we were just talking about how instructions are split into smaller pieces, sometimes, $\mu$ops can also be fused together. There are two types of fusion in modern x86 CPUs: * **Microfusion**: fuse $\mu$ops from the same machine instruction. Microfusion can only be applied to two types of combinations: memory write operations and read-modify operations. For example: @@ -31,7 +31,7 @@ Even though we were just talking about how instructions are split into smaller p ``` There are two $\mu$ops in this instruction: 1) read the memory location `mem`, and 2) add it to `eax`. With microfusion, two $\mu$ops are fused into one at the decoding step. -* **Macrofusion**: fuse $\mu$ops from different machine instructions. The decoders can fuse arithmetic or logic instruction with a subsequent conditional jump instruction into a single compute-and-branch $\mu$op in certain cases. For example: +* **Macrofusion**: fuse $\mu$ops from different machine instructions. The decoders can fuse arithmetic or logic instructions with a subsequent conditional jump instruction into a single compute-and-branch $\mu$op in certain cases. For example: ```bash .loop: @@ -42,7 +42,7 @@ Even though we were just talking about how instructions are split into smaller p \lstset{linewidth=\textwidth} -Both micro- and macrofusion save bandwidth in all stages of the pipeline from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused $\mu$op uses only one entry. Such a fused ROB entry is later dispatched to two different execution ports but is retired again as a single unit. Readers can learn more about $\mu$op fusion in [@fogMicroarchitecture]. +Both micro- and macrofusion save bandwidth in all stages of the pipeline, from decoding to retirement. The fused operations share a single entry in the reorder buffer (ROB). The capacity of the ROB is utilized better when a fused $\mu$op uses only one entry. Such a fused ROB entry is later dispatched to two different execution ports, but is retired again as a single unit. Readers can learn more about $\mu$op fusion in [@fogMicroarchitecture]. To collect the number of issued, executed, and retired $\mu$ops for an application, you can use Linux `perf` as follows: diff --git a/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md b/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md index 5b228305b1..05811cdd64 100644 --- a/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md +++ b/chapters/4-Terminology-And-Metrics/4-7 Cache miss.md @@ -21,9 +21,9 @@ Memory Table: Typical latency of a memory subsystem in x86-based platforms. {#tbl:mem_latency} -A cache miss might happen both for instructions and data. According to Top-down Microarchitecture Analysis (see [@sec:TMA]), an instruction cache (I-cache) miss is characterized as a Front-End stall, while a data cache (D-cache) miss is characterized as a Back-End stall. Instruction cache miss happens very early in the CPU pipeline during instruction fetch. Data cache miss happens much later during the instruction execution phase. +Both instruction and data fetches can miss in cache. According to Top-down Microarchitecture Analysis (see [@sec:TMA]), an instruction cache (I-cache) miss is characterized as a Front-End stall, while a data cache (D-cache) miss is characterized as a Back-End stall. Instruction cache miss happens very early in the CPU pipeline during instruction fetch. Data cache miss happens much later during the instruction execution phase. -Linux `perf` users can collect the number of L1-cache misses by running: +Linux `perf` users can collect the number of L1 cache misses by running: ```bash $ perf stat -e mem_load_retired.fb_hit,mem_load_retired.l1_miss, diff --git a/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md b/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md index e8788f7909..0b33b633c3 100644 --- a/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md +++ b/chapters/4-Terminology-And-Metrics/4-9 Performance Metrics.md @@ -8,98 +8,98 @@ That's why in addition to the hardware performance events, performance engineers -------------------------------------------------------------------------- Metric Description Formula -Name +Name ------- -------------------------- --------------------------------------- L1MPKI L1 cache true misses 1000 * MEM_LOAD_RETIRED.L1_MISS_PS / per kilo instruction for INST_RETIRED.ANY - retired demand loads. + retired demand loads. L2MPKI L2 cache true misses 1000 * MEM_LOAD_RETIRED.L2_MISS_PS / per kilo instruction for INST_RETIRED.ANY - retired demand loads. + retired demand loads. L3MPKI L3 cache true misses 1000 * MEM_LOAD_RETIRED.L3_MISS_PS / per kilo instruction for INST_RETIRED.ANY - retired demand loads. + retired demand loads. -Branch Ratio of all branches BR_MISP_RETIRED.ALL_BRANCHES / +Branch Ratio of all branches BR_MISP_RETIRED.ALL_BRANCHES / Mispr. which mispredict BR_INST_RETIRED.ALL_BRANCHES -Ratio +Ratio -Code STLB (2nd level TLB) code 1000 * ITLB_MISSES.WALK_COMPLETED +Code STLB (2nd level TLB) code 1000 * ITLB_MISSES.WALK_COMPLETED STLB speculative misses per / INST_RETIRED.ANY -MPKI kilo instruction (misses - of any page size that +MPKI kilo instruction (misses + of any page size that complete the page walk) -Load STLB data load 1000 * DTLB_LD_MISSES.WALK_COMPLETED +Load STLB data load 1000 * DTLB_LD_MISSES.WALK_COMPLETED STLB speculative misses / INST_RETIRED.ANY MPKI per kilo instruction -Store STLB data store 1000 * DTLB_ST_MISSES.WALK_COMPLETED +Store STLB data store 1000 * DTLB_ST_MISSES.WALK_COMPLETED STLB speculative misses / INST_RETIRED.ANY MPKI per kilo instruction -Load Actual Average Latency for L1D_PEND_MISS.PENDING / -Miss L1 data-cache miss demand MEM_LD_COMPLETED.L1_MISS_ANY -Real load operations +Load Average latency for L1D_PEND_MISS.PENDING / +Miss L1 D-cache miss demand MEM_LD_COMPLETED.L1_MISS_ANY +Real load operations Latency (in core cycles) -ILP Instr.-Level-Parallelism UOPS_EXECUTED.THREAD / - per-core (average number UOPS_EXECUTED.CORE_CYCLES_GE1, +ILP Instr. level parallelism UOPS_EXECUTED.THREAD / + per core (average number UOPS_EXECUTED.CORE_CYCLES_GE1, of $\mu$ops executed when divide by 2 if SMT is enabled - there is execution) + there is execution) -MLP Memory-Level-Parallelism L1D_PEND_MISS.PENDING / +MLP Memory level parallelism L1D_PEND_MISS.PENDING / per-thread (average number L1D_PEND_MISS.PENDING_CYCLES - of L1 miss demand load + of L1 miss demand loads when there is at least one such miss.) -DRAM Average external Memory ( 64 * ( UNC_M_CAS_COUNT.RD + -BW Use Bandwidth Use for reads UNC_M_CAS_COUNT.WR ) +DRAM Average external memory ( 64 * ( UNC_M_CAS_COUNT.RD + +BW Use bandwidth use for reads UNC_M_CAS_COUNT.WR ) GB/sec and writes / 1GB ) / Time -IpCall Instructions per near call INST_RETIRED.ANY / +IpCall Instructions per near call INST_RETIRED.ANY / (lower number means higher BR_INST_RETIRED.NEAR_CALL occurrence rate) -Ip Instructions per Branch INST_RETIRED.ANY / +Ip Instructions per branch INST_RETIRED.ANY / Branch BR_INST_RETIRED.ALL_BRANCHES -IpLoad Instructions per Load INST_RETIRED.ANY / +IpLoad Instructions per load INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS_PS -IpStore Instructions per Store INST_RETIRED.ANY / +IpStore Instructions per store INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES_PS -IpMisp Number of Instructions per INST_RETIRED.ANY / -redict non-speculative Branch BR_MISP_RETIRED.ALL_BRANCHES - Misprediction +IpMisp Number of instructions per INST_RETIRED.ANY / +redict non-speculative branch BR_MISP_RETIRED.ALL_BRANCHES + misprediction IpFLOP Instructions per FP See TMA_metrics.xlsx - (Floating Point) operation + (floating point) operation IpArith Instructions per FP See TMA_metrics.xlsx - Arithmetic instruction - -IpArith Instructions per FP Arith. INST_RETIRED.ANY / -Scalar Scalar Single-Precision FP_ARITH_INST.SCALAR_SINGLE -SP instruction - -IpArith Instructions per FP Arith. INST_RETIRED.ANY / -Scalar Scalar Double-Precision FP_ARITH_INST.SCALAR_DOUBLE -DP instruction - -Ip Instructions per FP INST_RETIRED.ANY / ( -Arith Arithmetic AVX/SSE FP_ARITH_INST.128B_PACKED_DOUBLE+ + arithmetic instruction + +IpArith Instructions per FP arith. INST_RETIRED.ANY / +Scalar scalar single-precision FP_ARITH_INST.SCALAR_SINGLE +SP instruction + +IpArith Instructions per FP arith. INST_RETIRED.ANY / +Scalar scalar double-precision FP_ARITH_INST.SCALAR_DOUBLE +DP instruction + +Ip Instructions per INST_RETIRED.ANY / ( +Arith arithmetic AVX/SSE FP_ARITH_INST.128B_PACKED_DOUBLE+ AVX128 128-bit instruction FP_ARITH_INST.128B_PACKED_SINGLE) -Ip Instructions per FP INST_RETIRED.ANY / ( -Arith Arithmetic AVX* FP_ARITH_INST.256B_PACKED_DOUBLE+ +Ip Instructions per INST_RETIRED.ANY / ( +Arith arithmetic AVX* FP_ARITH_INST.256B_PACKED_DOUBLE+ AVX256 256-bit instruction FP_ARITH_INST.256B_PACKED_SINGLE) -Ip Instructions per software INST_RETIRED.ANY / +Ip Instructions per software INST_RETIRED.ANY / SWPF prefetch instruction SW_PREFETCH_ACCESS.T0:u0xF (of any type) -------------------------------------------------------------------------- @@ -108,8 +108,8 @@ Table: A list (not exhaustive) of performance metrics along with descriptions an \normalsize -A few notes on those metrics. First, the ILP and MLP metrics do not represent theoretical maximums for an application; rather they measure the actual ILP and MLP of an application on a given machine. On an ideal machine with infinite resources, these numbers would be higher. Second, all metrics besides "DRAM BW Use" and "Load Miss Real Latency" are fractions; we can apply fairly straightforward reasoning to each of them to tell whether a specific metric is high or low. But to make sense of "DRAM BW Use" and "Load Miss Real Latency" metrics, we need to put it in context. For the former, we would like to know if a program saturates the memory bandwidth or not. The latter gives you an idea of the average cost of a cache miss, which is useless by itself unless you know the latencies of each component in the cache hierarchy. We will discuss how to find out cache latencies and peak memory bandwidth in the next section. +A few notes on those metrics. First, the ILP and MLP metrics do not represent theoretical maximums for an application; rather they measure the actual ILP and MLP of an application on a given machine. On an ideal machine with infinite resources, these numbers would be higher. Second, all metrics besides "DRAM BW Use" and "Load Miss Real Latency" are fractions; we can apply fairly straightforward reasoning to each of them to tell whether a specific metric is high or low. But to make sense of "DRAM BW Use" and "Load Miss Real Latency" metrics, we need to put them in context. For the former, we would like to know if a program saturates the memory bandwidth or not. The latter gives you an idea of the average cost of a cache miss, which is useless by itself unless you know the latencies of each component in the cache hierarchy. We will discuss how to find out cache latencies and peak memory bandwidth in the next section. Some tools can report performance metrics automatically. If not, you can always calculate those metrics manually since you know the formulas and corresponding performance events that must be collected. Table {@tbl:perf_metrics} provides formulas for the Intel Goldencove architecture, but you can build similar metrics on another platform as long as underlying performance events are available. -[^1]: TMA metrics - [https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx). \ No newline at end of file +[^1]: TMA metrics - [https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx](https://github.com/intel/perfmon/blob/main/TMA_Metrics.xlsx).