diff --git a/biblio.bib b/biblio.bib index d313794c49..e1d03794a0 100644 --- a/biblio.bib +++ b/biblio.bib @@ -200,7 +200,7 @@ @book{Hennessy @misc{fogOptimizeCpp, title={Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms}, author={Fog, Agner}, - year={2004}, + year={2023}, url = {https://www.agner.org/optimize/optimizing_cpp.pdf}, } @@ -209,15 +209,7 @@ @article{fogMicroarchitecture author={Fog, Agner}, journal={Copenhagen University College of Engineering}, url = {https://www.agner.org/optimize/microarchitecture.pdf}, - year={2012} -} - -@article{fogInstructions, - title={Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs}, - author={Fog, Agner and others}, - journal={Copenhagen University College of Engineering}, - url = {https://www.agner.org/optimize/instruction_tables.pdf}, - year={2011} + year={2023} } @article{GoogleProfiling, diff --git a/chapters/8-Optimizing-Memory-Accesses/8-1 Optimizing Memory Accesses.md b/chapters/8-Optimizing-Memory-Accesses/8-1 Optimizing Memory Accesses.md index 8340e6dbf3..0a9de3a711 100644 --- a/chapters/8-Optimizing-Memory-Accesses/8-1 Optimizing Memory Accesses.md +++ b/chapters/8-Optimizing-Memory-Accesses/8-1 Optimizing Memory Accesses.md @@ -12,6 +12,6 @@ The statement that memory hierarchy performance is critical can be exacerbated b Indeed, a variable can be fetched from the smallest L1 cache in just a few clock cycles, but it can take more than three hundred clock cycles to fetch the variable from DRAM if it is not in the CPU cache. From a CPU perspective, a last-level cache miss feels like a *very* long time, especially if the processor is not doing any useful work during that time. Execution threads may also be starved when the system is highly loaded with threads accessing memory at a very high rate and there is no available memory bandwidth to satisfy all loads and stores promptly. -When an application executes a large number of memory accesses and spends significant time waiting for them to finish, such an application is characterized as being bounded by memory. It means that to further improve its performance, we likely need to improve how we access memory, reduce the number of such accesses or upgrade the memory subsystem itself. +When an application executes a large number of memory accesses and spends significant time waiting for them to finish, such an application is characterized as being bounded by memory. It means that to further improve its performance, we likely need to improve how we access memory, reduce the number of such accesses or upgrade the memory subsystem itself. In the TMA methodology, the `Memory Bound` metric estimates a fraction of slots where a CPU pipeline is likely stalled due to demand for load or store instructions. The first step to solving such a performance problem is to locate the memory accesses that contribute to the high `Memory Bound` metric (see [@sec:secTMA_Intel]). Once guilty memory access is identified, several optimization strategies could be applied. In this chapter, we will discuss techniques to improve memory access patterns. diff --git a/chapters/8-Optimizing-Memory-Accesses/8-2 Cache-Friendly Data Structures copy.md b/chapters/8-Optimizing-Memory-Accesses/8-2 Cache-Friendly Data Structures copy.md index be5d4d09d1..6041da3f59 100644 --- a/chapters/8-Optimizing-Memory-Accesses/8-2 Cache-Friendly Data Structures copy.md +++ b/chapters/8-Optimizing-Memory-Accesses/8-2 Cache-Friendly Data Structures copy.md @@ -1,24 +1,29 @@ ## Cache-Friendly Data Structures {#sec:secCacheFriendly} -[TODO]: Elaborate. +Writing cache-friendly algorithms and data structures, is one of the key items in the recipe for a well-performing application. The key pillar of cache-friendly code is the principles of temporal and spatial locality that we described in [@sec:MemHierar]. The goal here is to have a predictable memory access patern and store data efficiently. -Writing cache-friendly algorithms and data structures, is one of the key items in the recipe for a well-performing application. The key pillar of cache-friendly code is the principles of temporal and spatial locality that we described in [@sec:MemHierar]. The goal here is to allow required data to be fetched from caches efficiently. When designing cache-friendly code, it's helpful to think in terms of cache lines, not only individual variables and their location in memory. +The cache line is the smallest unit of data that can be transferred between the cache and the main memory. When designing cache-friendly code, it's helpful to think in terms of cache lines, not only individual variables and their location in memory. + +Next, we will discuss several techniques to make data structures more cache-friendly. ### Access Data Sequentially. -[TODO]: Elaborate +The best way to exploit the spatial locality of the caches is to make sequential memory accesses. By doing so, we enable the HW prefetching mechanism (see [@sec:HwPrefetch]) to recognize the memory access pattern and bring in the next chunk of data ahead of time. An example of Row-major versus Column-Major traversal is shown in [@lst:CacheFriend]. Notice, there is only one tiny change in the code (swapped `col` and `row` subscripts), but it has a significant impact on performance. -The best way to exploit the spatial locality of the caches is to make sequential memory accesses. By doing so, we allow the HW prefetcher (see [@sec:HwPrefetch]) to recognize the memory access pattern and bring in the next chunk of data ahead of time. An example of a C-code that does such cache-friendly accesses is shown on [@lst:CacheFriend]. The code is "cache-friendly" because it accesses the elements of the matrix in the order in which they are laid out in memory ([row-major traversal](https://en.wikipedia.org/wiki/Row-_and_column-major_order)[^6]). Swapping the order of indexes in the array (i.e., `matrix[column][row]`) will result in column-major order traversal of the matrix, which does not exploit spatial locality and hurts performance. +The code on the left is not cache-friendly because it skips `NCOLS` elements on every iteration of the inner loop. This results in a very inefficient use of caches. In contrast, the code on the right accesses elements of the matrix in the order in which they are laid out in memory. Row-major traversal exploits spatial locality and is cache-friendly. Listing: Cache-friendly memory accesses. ~~~~ {#lst:CacheFriend .cpp} -for (row = 0; row < NUMROWS; row++) - for (column = 0; column < NUMCOLUMNS; column++) - matrix[row][column] = row + column; +// Column-major order // Row-major order +for (row = 0; row < NROWS; row++) for (row = 0; row < NROWS; row++) + for (col = 0; col < NCOLS; col++) for (col = 0; col < NCOLS; col++) + matrix[col][row] = row + col; => matrix[row][col] = row + col; ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The example presented in [@lst:CacheFriend] is classical, but usually, real-world applications are much more complicated than this. Sometimes you need to go an additional mile to write cache-friendly code. For instance, the standard implementation of binary search in a sorted large array does not exploit spatial locality since it tests elements in different locations that are far away from each other and do not share the same cache line. The most famous way of solving this problem is storing elements of the array using the Eytzinger layout [@EytzingerArray]. The idea of it is to maintain an implicit binary search tree packed into an array using the BFS-like layout, usually seen with binary heaps. If the code performs a large number of binary searches in the array, it may be beneficial to convert it to the Eytzinger layout. +The example presented in [@lst:CacheFriend] is classical, but usually, real-world applications are much more complicated than this. Sometimes you need to go an additional mile to write cache-friendly code. If the data is not laid out in memory in a way that is optimal for the algorithm, it may require to rearrange the data first. + +Consider a standard implementation of binary search in a large sorted array, where on each iteration, you access the middle element, compare it with the value you're searching for and go either left or right. This algorithm does not exploit spatial locality since it tests elements in different locations that are far away from each other and do not share the same cache line. The most famous way of solving this problem is storing elements of the array using the Eytzinger layout [@EytzingerArray]. The idea of it is to maintain an implicit binary search tree packed into an array using the BFS-like layout, usually seen with binary heaps. If the code performs a large number of binary searches in the array, it may be beneficial to convert it to the Eytzinger layout. ### Use Appropriate Containers. @@ -30,48 +35,33 @@ Additionally, choose the data storage, bearing in mind what the code will do wit ### Packing the Data. -[TODO]: Cosmetics - -Memory hierarchy utilization can be improved by making the data more compact. There are many ways to pack data. One of the classic examples is to use bitfields. An example of code when packing data might be profitable is shown on [@lst:PackingData1]. If we know that `a`, `b`, and `c` represent enum values which take a certain number of bits to encode, we can reduce the storage of the struct `S` (see [@lst:PackingData2]). - -Listing: Packing Data: baseline struct. +Utilization of data caches can be also improved by making the data more compact. There are many ways to pack data. One of the classic examples is to use bitfields. An example of code when packing data might be profitable is shown on [@lst:DataPacking]. If we know that `a`, `b`, and `c` represent enum values which take a certain number of bits to encode, we can reduce the storage of the struct `S`. -~~~~ {#lst:PackingData1 .cpp} -struct S { - unsigned a; - unsigned b; - unsigned c; -}; // S is `sizeof(unsigned int) * 3` bytes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Listing: Packing Data: packed struct. +Listing: Data Packing -~~~~ {#lst:PackingData2 .cpp} -struct S { - unsigned a:4; - unsigned b:2; - unsigned c:2; -}; // S is only 1 byte +~~~~ {#lst:DataPacking .cpp} +// S is `sizeof(unsigned int) * 3` bytes // S is only 1 byte +struct S { struct S { + unsigned a; unsigned a:4; + unsigned b; => unsigned b:2; + unsigned c; unsigned c:2; +}; }; ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -This greatly reduces the amount of memory transferred back and forth and saves cache space. Keep in mind that this comes with the cost of accessing every packed element. Since the bits of `b` share the same machine word with `a` and `c`, compiler need to perform a `>>` (shift right) and `&` (AND) operation to load it. Similarly, `<<` (shift left) and `|` (OR) operations are needed to store the value back. Packing the data is beneficial in places where additional computation is cheaper than the delay caused by inefficient memory transfers. +This greatly reduces the amount of memory transferred back and forth and saves cache space. Keep in mind that this comes with the cost of accessing every packed element. Since the bits of `a`, `b`, and `c` are packed into a single byte, compiler needs to perform additional bit manipulation operations to load and store them. For example, to load `b`, you need to shift the byte value right (`>>`) by 2 and do logical AND (`&`) with `0x3`. Similarly, shift left (`<<`) and logical OR (`|`) operations are needed to store the value back into the packed format. Data packing is beneficial in places where additional computation is cheaper than the delay caused by inefficient memory transfers. -Also, a programmer can reduce the memory usage by rearranging fields in a struct or class when it avoids padding added by a compiler (see example in [@lst:PackingData3]). The reason for a compiler to insert unused bytes of memory (pads) is to allow efficient storing and fetching of individual members of a struct. In the example, the size of `S1` can be reduced if its members are declared in the order of decreasing their sizes. +Also, a programmer can reduce the memory usage by rearranging fields in a struct or class when it avoids padding added by a compiler. The reason for a compiler to insert unused bytes of memory (pads) is to enable efficient storing and fetching of individual members of a struct. In the example in [@lst:AvoidPadding], the size of `S` can be reduced if its members are declared in the order of decreasing their sizes. Listing: Avoid compiler padding. -~~~~ {#lst:PackingData3 .cpp} -struct S1 { - bool b; - int i; - short s; -}; // S1 is `sizeof(int) * 3` bytes - -struct S2 { - int i; - short s; - bool b; -}; // S2 is `sizeof(int) * 2` bytes +~~~~ {#lst:AvoidPadding .cpp} +// S is `sizeof(int) * 3` bytes // S is `sizeof(int) * 2` bytes +struct S { struct S { + bool b; int i; + int i; => short s; + short s; bool b; +}; }; + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### Field Reordering @@ -212,8 +202,6 @@ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ae [TODO]: Trim footnotes -[^5]: The same applies to memory deallocation. -[^6]: Row- and column-major order - [https://en.wikipedia.org/wiki/Row-_and_column-major_order](https://en.wikipedia.org/wiki/Row-_and_column-major_order). [^8]: Blog article "Vector of Objects vs Vector of Pointers" by B. Filipek - [https://www.bfilipek.com/2014/05/vector-of-objects-vs-vector-of-pointers.html](https://www.bfilipek.com/2014/05/vector-of-objects-vs-vector-of-pointers.html). [^13]: Linux manual page for `memalign` - [https://linux.die.net/man/3/memalign](https://linux.die.net/man/3/memalign). [^14]: Generating aligned memory - [https://embeddedartistry.com/blog/2017/02/22/generating-aligned-memory/](https://embeddedartistry.com/blog/2017/02/22/generating-aligned-memory/). diff --git a/chapters/8-Optimizing-Memory-Accesses/8-3 Dynamic Memory Allocation.md b/chapters/8-Optimizing-Memory-Accesses/8-3 Dynamic Memory Allocation.md index 200fef538d..f0dd9b1d80 100644 --- a/chapters/8-Optimizing-Memory-Accesses/8-3 Dynamic Memory Allocation.md +++ b/chapters/8-Optimizing-Memory-Accesses/8-3 Dynamic Memory Allocation.md @@ -16,6 +16,7 @@ Intel CPUs have a Data Linear Address HW feature (see [@sec:sec_PEBS_DLA]) that [TODO]: Trim footnotes +[^5]: The same applies to memory deallocation. [^9]: Usually, people tune for the size of the L2 cache since it is not shared between the cores. [^10]: Blog article "Detecting false sharing" - [https://easyperf.net/blog/2019/12/17/Detecting-false-sharing-using-perf#2-tune-the-code-for-better-utilization-of-cache-hierarchy](https://easyperf.net/blog/2019/12/17/Detecting-false-sharing-using-perf#2-tune-the-code-for-better-utilization-of-cache-hierarchy). [^11]: In Intel processors `CPUID` instruction is described in [@IntelOptimizationManual, Volume 2]