Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 6 edits #73

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

dankamongmen
Copy link
Contributor

Some notes:

  • Listing 6-2: please don't cast the return value of malloc(). this is a comp.lang.c FAQ: https://c-faq.com/malloc/mallocnocast.html .
    If the code is C++, it shouldn't be using malloc().

  • 6.1.2: the reason there are no branch mispredictions in the SHA code is because cryptography code must carefully guard against dynamic branching/cache behavior to defend against timing attacks. fascinating stuff. check out DJB's papers.

  • 6.1.4 (pg 108) is section 4.11 really the one you want to reference here? i'm not sure...?

  • recommendations to check things with dmesg are pretty bad imho. dmesg dumps a ring buffer into which the kernel prints over its lifetime. different output settings can filter messages from hitting there. what you almost certainly want is cpuid, cat /proc/cpuinfo, or rdmsr. furthermore, dmesg output is in no way a stable api, and it's not available to regular users depending on sysctls.

@@ -4,15 +4,15 @@ Major CPU vendors provide a set of additional features to enhance performance an

### PEBS on Intel Platforms {#sec:secPEBS}

Similar to the Last Branch Record feature, PEBS is used while profiling the program to capture additional data with every collected sample. When a performance counter is configured for PEBS, the processor saves the set of additional data, which has a defined format and is called the PEBS record. The format of a PEBS record for the Intel Skylake CPU is shown in Figure @fig:PEBS_record. The record contains the state of general-purpose registers (`EAX`, `EBX`, `ESP`, etc.), `EventingIP`, `Data Linear Address`, and `Latency value`, which will discuss later. The content layout of a PEBS record varies across different microarchitectures, see [@IntelOptimizationManual, Volume 3B, Chapter 20 Performance Monitoring].
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sneaky will-vs-we'll! boom

@@ -1,14 +1,14 @@
### TMA On ARM Platforms

ARM CPU architects also have developed a TMA performance analysis methodology for their processors, which we will discuss next. ARM calls it "Topdown" in their documentation [@ARMNeoverseV1TopDown], so we will use their naming. At the time of writing this chapter (late 2023), Topdown is only supported on cores designed by ARM, e.g. Neoverse N1 and Neoverse V1, and their derivatives, e.g. Ampere Altra and AWS Graviton3. Refer to the list of major CPU microarchitectures at the end of this book if you need to refresh your memory on ARM chip families. Processors designed by Apple don't support the ARM Topdown performance analysis methodology yet.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what you have here works, but it violates a very common idiom and sounds weird

@@ -67,7 +67,7 @@ Stage 2 (uarch metrics)

In the command above, the option `-n BackendBound` collects all the metrics associated with the `Backend Bound` category as well as its descendants. The description for every metric in the output is given in [@ARMNeoverseV1TopDown]. Note, that they are quite similar to what we have discussed in [@sec:PerfMetricsCaseStudy], so you may want to revisit it as well.

We don't have a goal of optimizing the benchmark, rather we want to characterize performance bottlenecks. However, if given such a task, here is how our analysis could continue. There is a substantial number of `L1 Data TLB` misses (3.8 MPKI), but then 90% of those misses hit in L2 TLB (see `L2 Unified TLB Miss Ratio`). All in all, only 0.1% of all TLB misses result in a page table walk (see `DTLB Walk Ratio`), which suggests that it is not our primary concern, although a quick experiment that utilizes huge pages is still worth it.
We don't have a goal of optimizing the benchmark, rather we want to characterize performance bottlenecks. However, if given such a task, here is how our analysis could continue. There are a substantial number of `L1 Data TLB` misses (3.8 MPKI), but then 90% of those misses hit in L2 TLB (see `L2 Unified TLB Miss Ratio`). All in all, only 0.1% of all TLB misses result in a page table walk (see `DTLB Walk Ratio`), which suggests that it is not our primary concern, although a quick experiment that utilizes huge pages is still worthwhile.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definite number disagreement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant