Chapter 6 edits #73

dankamongmen · 2024-09-11T19:25:25Z

Some notes:

Listing 6-2: please don't cast the return value of malloc(). this is a comp.lang.c FAQ: https://c-faq.com/malloc/mallocnocast.html .
If the code is C++, it shouldn't be using malloc().
6.1.2: the reason there are no branch mispredictions in the SHA code is because cryptography code must carefully guard against dynamic branching/cache behavior to defend against timing attacks. fascinating stuff. check out DJB's papers.
6.1.4 (pg 108) is section 4.11 really the one you want to reference here? i'm not sure...?
recommendations to check things with dmesg are pretty bad imho. dmesg dumps a ring buffer into which the kernel prints over its lifetime. different output settings can filter messages from hitting there. what you almost certainly want is cpuid, cat /proc/cpuinfo, or rdmsr. furthermore, dmesg output is in no way a stable api, and it's not available to regular users depending on sysctls.

dankamongmen · 2024-09-11T19:25:44Z

chapters/6-CPU-Features-For-Performance-Analysis/6-7 Precise Event Based Sampling (PEBS).md

@@ -4,15 +4,15 @@ Major CPU vendors provide a set of additional features to enhance performance an

 ### PEBS on Intel Platforms {#sec:secPEBS}

-Similar to the Last Branch Record feature, PEBS is used while profiling the program to capture additional data with every collected sample. When a performance counter is configured for PEBS, the processor saves the set of additional data, which has a defined format and is called the PEBS record. The format of a PEBS record for the Intel Skylake CPU is shown in Figure @fig:PEBS_record. The record contains the state of general-purpose registers (`EAX`, `EBX`, `ESP`, etc.), `EventingIP`, `Data Linear Address`, and `Latency value`, which will discuss later. The content layout of a PEBS record varies across different microarchitectures, see [@IntelOptimizationManual, Volume 3B, Chapter 20 Performance Monitoring].


sneaky will-vs-we'll! boom

dankamongmen · 2024-09-11T19:27:25Z

chapters/6-CPU-Features-For-Performance-Analysis/6-4 TMA-ARM.md

@@ -1,14 +1,14 @@
 ### TMA On ARM Platforms

-ARM CPU architects also have developed a TMA performance analysis methodology for their processors, which we will discuss next. ARM calls it "Topdown" in their documentation [@ARMNeoverseV1TopDown], so we will use their naming. At the time of writing this chapter (late 2023), Topdown is only supported on cores designed by ARM, e.g. Neoverse N1 and Neoverse V1, and their derivatives, e.g. Ampere Altra and AWS Graviton3. Refer to the list of major CPU microarchitectures at the end of this book if you need to refresh your memory on ARM chip families. Processors designed by Apple don't support the ARM Topdown performance analysis methodology yet.


what you have here works, but it violates a very common idiom and sounds weird

dankamongmen · 2024-09-11T19:27:45Z

chapters/6-CPU-Features-For-Performance-Analysis/6-4 TMA-ARM.md

@@ -67,7 +67,7 @@ Stage 2 (uarch metrics)

 In the command above, the option `-n BackendBound` collects all the metrics associated with the `Backend Bound` category as well as its descendants. The description for every metric in the output is given in [@ARMNeoverseV1TopDown]. Note, that they are quite similar to what we have discussed in [@sec:PerfMetricsCaseStudy], so you may want to revisit it as well.

-We don't have a goal of optimizing the benchmark, rather we want to characterize performance bottlenecks. However, if given such a task, here is how our analysis could continue. There is a substantial number of `L1 Data TLB` misses (3.8 MPKI), but then 90% of those misses hit in L2 TLB (see `L2 Unified TLB Miss Ratio`). All in all, only 0.1% of all TLB misses result in a page table walk (see `DTLB Walk Ratio`), which suggests that it is not our primary concern, although a quick experiment that utilizes huge pages is still worth it.
+We don't have a goal of optimizing the benchmark, rather we want to characterize performance bottlenecks. However, if given such a task, here is how our analysis could continue. There are a substantial number of `L1 Data TLB` misses (3.8 MPKI), but then 90% of those misses hit in L2 TLB (see `L2 Unified TLB Miss Ratio`). All in all, only 0.1% of all TLB misses result in a page table walk (see `DTLB Walk Ratio`), which suggests that it is not our primary concern, although a quick experiment that utilizes huge pages is still worthwhile.


definite number disagreement

dankamongmen added 10 commits September 11, 2024 14:37

6-0: can be leveraged

0f4857f

5-1: spaces in step titles

f0ac94b

6.1.2: small changes

91503a2

6.1.3: number disagreements

6e00fe7

blah blah

effe496

LBR is pretty clean

d9e05d6

PEBS

c0966f1

kill wayward comma

8760d81

apostrophe for grouping, no

f0e2350

6-summary: make last item an actual item

e5b5c8e

dankamongmen commented Sep 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 6 edits #73

Chapter 6 edits #73

dankamongmen commented Sep 11, 2024

dankamongmen Sep 11, 2024

dankamongmen Sep 11, 2024

dankamongmen Sep 11, 2024

		@@ -4,15 +4,15 @@ Major CPU vendors provide a set of additional features to enhance performance an

		### PEBS on Intel Platforms {#sec:secPEBS}

		Similar to the Last Branch Record feature, PEBS is used while profiling the program to capture additional data with every collected sample. When a performance counter is configured for PEBS, the processor saves the set of additional data, which has a defined format and is called the PEBS record. The format of a PEBS record for the Intel Skylake CPU is shown in Figure @fig:PEBS_record. The record contains the state of general-purpose registers (`EAX`, `EBX`, `ESP`, etc.), `EventingIP`, `Data Linear Address`, and `Latency value`, which will discuss later. The content layout of a PEBS record varies across different microarchitectures, see [@IntelOptimizationManual, Volume 3B, Chapter 20 Performance Monitoring].

		@@ -1,14 +1,14 @@
		### TMA On ARM Platforms

		ARM CPU architects also have developed a TMA performance analysis methodology for their processors, which we will discuss next. ARM calls it "Topdown" in their documentation [@ARMNeoverseV1TopDown], so we will use their naming. At the time of writing this chapter (late 2023), Topdown is only supported on cores designed by ARM, e.g. Neoverse N1 and Neoverse V1, and their derivatives, e.g. Ampere Altra and AWS Graviton3. Refer to the list of major CPU microarchitectures at the end of this book if you need to refresh your memory on ARM chip families. Processors designed by Apple don't support the ARM Topdown performance analysis methodology yet.

Chapter 6 edits #73

Are you sure you want to change the base?

Chapter 6 edits #73

Conversation

dankamongmen commented Sep 11, 2024

dankamongmen Sep 11, 2024

Choose a reason for hiding this comment

dankamongmen Sep 11, 2024

Choose a reason for hiding this comment

dankamongmen Sep 11, 2024

Choose a reason for hiding this comment