Skip to content

Commit

Permalink
[Chapter6] Fixed a few TODOs
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Feb 21, 2024
1 parent 5ddfadc commit ae5dca2
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 10 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
### TMA on AMD Platforms {#sec:secTMA_AMD}

[TODO]: Starting from which kernel version TMA is supported in Linux perf?

Starting from Zen4, AMD processors support Level-1 and Level-2 TMA analysis. According to AMD documentation, it is called "Pipeline Utilization" analysis but the idea remains the same. The L1 and L2 buckets are also very similar to Intel's. Linux users can utilize the `perf` tool to collect the pipeline utilization data.
Starting from Zen4, AMD processors support Level-1 and Level-2 TMA analysis. According to AMD documentation, it is called "Pipeline Utilization" analysis but the idea remains the same. The L1 and L2 buckets are also very similar to Intel's. Since kernel 6.2, Linux users can utilize the `perf` tool to collect the pipeline utilization data.

Next, we will examine [Crypto++](https://github.com/weidai11/cryptopp)[^1] implementation of SHA-256 (Secure Hash Algorithm 256), the fundamental cryptographic algorithm in Bitcoin mining. Crypto++ is an open-source C++ class library of cryptographic algorithms and contains an implementation of many algorithms, not just SHA-256. However, for our example, we disabled benchmarking all other algorithms by commenting out the corresponding line in the `BenchmarkUnkeyedAlgorithms` function in `bench1.cpp`.

Expand Down Expand Up @@ -34,11 +32,9 @@ Crypto instructions are not trivial, so internally they are broken into smaller

[TODO]: Why do we have 6.1% for both `frontend_bound_bandwidth` AND `retiring_microcode`? Is there a specific relationship between those metrics? Did I describe it correctly in the text?

The majority of cycles are stalled in the CPU backend (`backend_bound`), but only 1,7% of cycles are stalled waiting for memory accesses (`backend_bound_memory`). So, we know that the benchmark is mostly limited by the computing capabilities of the machine. As you will know from Part 2 of this book, it could be related to either data flow dependencies or execution throughput of certain cryptographic operations. They are less frequent than traditional `ADD`, `SUB`, `CMP`, and other instructions and thus can be often executed only on a single execution unit. A large number of such operations may saturate the execution throughput of this particular unit. Further analysis should involve a closer look at the source code and generated assembly, checking execution port utilization, finding data dependencies, etc.; we will stop at this point.
The majority of cycles are stalled in the CPU backend (`backend_bound`), but only 1.7% of cycles are stalled waiting for memory accesses (`backend_bound_memory`). So, we know that the benchmark is mostly limited by the computing capabilities of the machine. As you will know from Part 2 of this book, it could be related to either data flow dependencies or execution throughput of certain cryptographic operations. They are less frequent than traditional `ADD`, `SUB`, `CMP`, and other instructions and thus can be often executed only on a single execution unit. A large number of such operations may saturate the execution throughput of this particular unit. Further analysis should involve a closer look at the source code and generated assembly, checking execution port utilization, finding data dependencies, etc.; we will stop at this point.

When it comes to Windows, at the time of writing, TMA methodology is only supported on server platforms (codename Genoa), and not on client systems (codename Raphael). TMA support was added in AMD uProf version 4.1, but only in the command line tool `AMDuProfPcm` tool which is part of AMD uProf installation. You can consult [@AMDUprofManual, Chapter 2.8 Pipeline Utilization] for more details on how to run the analysis. The graphical version of AMD uProf doesn't have the TMA analysis yet.

[TODO]: does the 4.2 version of AMDuProfPcm work on client parts?

[^1]: Crypto++ - [https://github.com/weidai11/cryptopp](https://github.com/weidai11/cryptopp)
[^2]: uops.info - [https://uops.info/table.html](https://uops.info/table.html)
Original file line number Diff line number Diff line change
Expand Up @@ -159,8 +159,6 @@ It's important to mention that we cannot necessarily drive conclusions about fun

Branch recording also enables us to know what were the most frequently taken branches. It is supported on Intel and AMD. According to ARM's BRBE specification, it can be supported, but due to unavailability of processors that implement this extension, it is not possible to verify. Here is an example:

[TODO]: Check: "Adding `-F +srcline_from,srcline_to` slows down building report. Hopefully, in newer versions of perf, decoding time will be improved".

```bash
$ perf record -e cycles -b -- ./a.exe
[ perf record: Woken up 3 times to write data ]
Expand Down Expand Up @@ -237,8 +235,6 @@ This information can be used for a fine-grained tuning of this basic block. This

Before the proper support from profiling tools was in place, building probability density graphs similar to Figure @fig:LBR_timing_BB required manual parsing of raw LBR dumps. Example of how to do this can be found on the [easyperf blog](https://easyperf.net/blog/2019/04/03/Precise-timing-of-machine-code-with-Linux-perf)[^9]. Luckily, in newer versions of Linux perf, getting this information is much easier. The example below demonstrates this method directly using Linux perf on the same 7-zip benchmark from the LLVM test-suite we introduced earlier:

[TODO]: Check: "Adding `-F +srcline_from,srcline_to` slows down building report. Hopefully, in newer versions of perf, decoding time will be improved".

```bash
$ perf record -e cycles -b -- ./7zip.exe b
$ perf report -n --sort symbol_from,symbol_to -F +cycles,srcline_from,srcline_to --stdio
Expand Down

0 comments on commit ae5dca2

Please sign in to comment.