[Chapter6] Updated AMD topdown section

dendibakh · Feb 22, 2024 · f15e6ed · f15e6ed
1 parent 7c6420d
commit f15e6ed
Showing 1 changed file with 4 additions and 4 deletions.
diff --git a/chapters/6-CPU-Features-For-Performance-Analysis/6-3 TMA-AMD.md b/chapters/6-CPU-Features-For-Performance-Analysis/6-3 TMA-AMD.md
@@ -28,13 +28,13 @@ In the output, numbers in brackets indicate the percentage of runtime duration,
 
 A description of pipeline utilization metrics shown above can be found in [@AMDUprofManual, Chapter 2.8 Pipeline Utilization]. By looking at the metrics, we can see that branch mispredictions are not happening in SHA256 (`bad_speculation` is 0%). Only 26.3% of the available dispatch slots were used (`retiring`), which means the rest 73.7% were wasted due to frontend and backend stalls.
 
-[TODO]: decoders can decode instructions that consist of only a few uops.
+Advanced crypto instructions are not trivial, so internally they are broken into smaller pieces ($\mu$ops). Once a processor encounters such an instruction, it retrieves $\mu$ops for it from the microcode. Microoperations are fetched from the microcode sequencer with a lower bandwidth than from regular instruction decoders, making it a potential source of performance bottlenecks. Crypto++ SHA256 implementation heavily uses instructions such as `SHA256MSG2`, `SHA256RNDS2`, and others which consist of multiple $\mu$ops according to [uops.info](https://uops.info/table.html)[^2] website. 
 
-Crypto instructions are not trivial, so internally they are broken into smaller pieces ($\mu$ops). Once a processor encounters such an instruction, it retrieves $\mu$ops for it from the microcode. Microoperations are fetched from the microcode sequencer with a lower bandwidth than from regular instruction decoders, making it a potential source of performance bottlenecks. Crypto++ SHA256 implementation heavily uses instructions such as `SHA256MSG2`, `SHA256RNDS2`, and others which consist of multiple $\mu$ops according to [uops.info](https://uops.info/table.html)[^2] website. The `retiring_microcode` metric indicates that 6.1% of dispatch slots were used by microcode operations. The same number of dispatch slots were unused due to bandwidth bottleneck in the CPU frontend (`frontend_bound_bandwidth`). Together, the two metrics suggest that those 6.1% of dispatch slots were wasted because the microcode sequencer has not been providing $\mu$ops while the backend could have consumed them.
+The `retiring_microcode` metric indicates that 6.1% of dispatch slots were used by microcode operations that eventually retired. When comparing with its sibling metric `retiring_fastpath`, we can say that roughly every 4th instruction was a microcode operation. If we now look at the `frontend_bound_bandwidth` metric, we will see that 6.1% of dispatch slots were unused due to bandwidth bottleneck in the CPU frontend. This suggest that 6.1% of dispatch slots were wasted because the microcode sequencer has not been providing $\mu$ops while the backend could have consumed them. In this example, the `retiring_microcode` and `frontend_bound_bandwidth` metrics are tightly connected, however, the fact that they are equal is merely a coincidence. 
 
-[TODO]: Why do we have 6.1% for both `frontend_bound_bandwidth` AND `retiring_microcode`? Is there a specific relationship between those metrics? Did I describe it correctly in the text?
+The majority of cycles are stalled in the CPU backend (`backend_bound`), but only 1.7% of cycles are stalled waiting for memory accesses (`backend_bound_memory`). So, we know that the benchmark is mostly limited by the computing capabilities of the machine. As you will know from Part 2 of this book, it could be related to either data flow dependencies or execution throughput of certain cryptographic operations. They are less frequent than traditional `ADD`, `SUB`, `CMP`, and other instructions and thus can be often executed only on a single execution unit. A large number of such operations may saturate the execution throughput of this particular unit. Further analysis should involve a closer look at the source code and generated assembly, checking execution port utilization, finding data dependencies, etc. 
 
-The majority of cycles are stalled in the CPU backend (`backend_bound`), but only 1.7% of cycles are stalled waiting for memory accesses (`backend_bound_memory`). So, we know that the benchmark is mostly limited by the computing capabilities of the machine. As you will know from Part 2 of this book, it could be related to either data flow dependencies or execution throughput of certain cryptographic operations. They are less frequent than traditional `ADD`, `SUB`, `CMP`, and other instructions and thus can be often executed only on a single execution unit. A large number of such operations may saturate the execution throughput of this particular unit. Further analysis should involve a closer look at the source code and generated assembly, checking execution port utilization, finding data dependencies, etc.; we will stop at this point.
+In summary, Crypto++ implementation of SHA-256 on AMD Ryzen 9 7950X utilizes only 26.3% of the available dispatch slots. 6.1% of the dispatch slots were wasted due to the microcode sequencer bandwidth, and the rest 65.9% were stalled due to lack of computing resources of the machine. The code certainly hits a few hardware limitations, so it's unclear if its performance can be improved or not.
 
 When it comes to Windows, at the time of writing, TMA methodology is only supported on server platforms (codename Genoa), and not on client systems (codename Raphael). TMA support was added in AMD uProf version 4.1, but only in the command line tool `AMDuProfPcm` tool which is part of AMD uProf installation. You can consult [@AMDUprofManual, Chapter 2.8 Pipeline Utilization] for more details on how to run the analysis. The graphical version of AMD uProf doesn't have the TMA analysis yet.