[Chapter8] Working on cache-friendly data structures

dendibakh · Mar 14, 2024 · 8c2e466 · 8c2e466
1 parent 8f0d3b9
commit 8c2e466
Show file tree

Hide file tree

Showing 3 changed files with 70 additions and 10 deletions.
diff --git a/chapters/8-Optimizing-Memory-Accesses/8-1 Optimizing Memory Accesses.md b/chapters/8-Optimizing-Memory-Accesses/8-1 Optimizing Memory Accesses.md
@@ -4,8 +4,6 @@ typora-root-url: ..\..\img
 
 # Optimizing Memory Accesses {#sec:MemBound}
 
-[TODO]: maybe add example of using `perf mem`.
-
 Modern computers are still being built based on the classical Von Neumann architecture which decouples CPU, memory and input/output units. Nowadays, operations with memory (loads and stores) account for the largest portion of performance bottlenecks and power consumption. It is no surprise that we start with this category first.
 
 The statement that memory hierarchy performance is critical can be exacerbated by Figure @fig:CpuMemGap. It shows the growth of the gap in performance between memory and processors. The vertical axis is on a logarithmic scale and shows the growth of the CPU-DRAM performance gap. The memory baseline is the latency of memory access of 64 KB DRAM chips from 1980. Typical DRAM performance improvement is 7% per year, while CPUs enjoy 20-50% improvement per year. According to this picture, processor performance has plateaued, but even then, the gap in performance remains. [@Hennessy]

diff --git a/chapters/8-Optimizing-Memory-Accesses/8-2 Cache-Friendly Data Structures copy.md b/chapters/8-Optimizing-Memory-Accesses/8-2 Cache-Friendly Data Structures copy.md
@@ -32,8 +32,6 @@ Additionally, choose the data storage, bearing in mind what the code will do wit
 
 [TODO]: Cosmetics
 
-[TODO]: include example of using data-type profiling (https://lwn.net/Articles/955709/). Find a good example for a case study.
-
 Memory hierarchy utilization can be improved by making the data more compact. There are many ways to pack data. One of the classic examples is to use bitfields. An example of code when packing data might be profitable is shown on [@lst:PackingData1]. If we know that `a`, `b`, and `c` represent enum values which take a certain number of bits to encode, we can reduce the storage of the struct `S` (see [@lst:PackingData2]).
 
 Listing: Packing Data: baseline struct.
@@ -76,6 +74,10 @@ struct S2 {
 }; // S2 is `sizeof(int) * 2` bytes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+### Field Reordering
+
+[TODO]: include example of using data-type profiling (https://lwn.net/Articles/955709/). Find a good example for a case study using `perf mem`.
+
 ### Aligning and Padding. {#sec:secMemAlign}
 
 [TODO]: Cosmetics. Mention that vtune tracks it with the `Split Loads` metric.
@@ -142,11 +144,71 @@ One of the most important areas for alignment considerations is the SIMD code. W
 __m512 * ptr = new __m512[N];
 ```
 
-[TODO]: other data structure reorganization techniques:
-* structure splitting
-* field reordering
-* pointer inlining 
+### Other Data Structure Reorganization Techniques
+
+[TODO]: to be written
+
+- **structure splitting**
+
+Simple example:
+```cpp
+struct Point {
+  int X;
+  int Y;
+  int Z;
+  /*many other fields*/
+};
+std::vector<Point> points;
+
+=>
+
+struct PointCoords {
+  int X;
+  int Y;
+  int Z;
+};
+struct PointInfo {
+  /*many other fields*/
+};
+std::vector<PointCoords> pointCoords;
+std::vector<PointInfo> pointInfos;
+
+* **pointer inlining**
+
+```cpp
+struct GraphEdge {
+  unsigned int from;
+  unsigned int to;
+  GraphEdgeProperties* prop;
+};
+struct GraphEdgeProperties {
+  float weight;
+  std::string label;
+  // ...
+};
+
+=>
+
+struct GraphEdge {
+  unsigned int from;
+  unsigned int to;
+  float weight;
+  GraphEdgeProperties* prop;
+};
+struct GraphEdgeProperties {
+  std::string label;
+  // ...
+};
+```
+
+This code was in one of the open-source graph analytics codes.
+
+Use data-type profiling to find opportunities.
 
+Recent kernel history is full of examples of commits that reorder structures, pad fields, or pack them to improve performance.
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=54ff8ad69c6e93c0767451ae170b41c000e565dd
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5598d6ae62626d261b046a2f19347c38681ff51
+https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aee79d4e5271cee4ffa89ed830189929a6272eb8
 
 [TODO]: Trim footnotes
 

diff --git a/chapters/8-Optimizing-Memory-Accesses/8-3 Dynamic Memory Allocation.md b/chapters/8-Optimizing-Memory-Accesses/8-3 Dynamic Memory Allocation.md
@@ -1,12 +1,12 @@
-### Dynamic Memory Allocation.
+## Dynamic Memory Allocation.
 
 [TODO]: Elaborate. Add reference to heaptrack.
 
 First of all, there are many drop-in replacements for `malloc`, which are faster, more scalable,[^15] and address [fragmentation](https://en.wikipedia.org/wiki/Fragmentation_(computing))[^20] problems better. You can have a few percent performance improvement just by using a non-standard memory allocator. A typical issue with dynamic memory allocation is when at startup threads race with each other trying to allocate their memory regions at the same time.[^5] One of the most popular memory allocation libraries are [jemalloc](http://jemalloc.net/)[^17] and [tcmalloc](https://github.com/google/tcmalloc)[^18].
 
 Secondly, it is possible to speed up allocations using custom allocators, for example, [arena allocators](https://en.wikipedia.org/wiki/Region-based_memory_management)[^16]. One of the main advantages is their low overhead since such allocators don't execute system calls for every memory allocation. Another advantage is its high flexibility. Developers can implement their own allocation strategies based on the memory region provided by the OS. One simple strategy could be to maintain two different allocators with their own arenas (memory regions): one for the hot data and one for the cold data. Keeping hot data together creates opportunities for it to share cache lines, which improves memory bandwidth utilization and spatial locality. It also improves TLB utilization since hot data occupies less amount of memory pages. Also, custom memory allocators can use thread-local storage to implement per-thread allocation and get rid of any synchronization between threads. This becomes useful when an application is based on a thread pool and does not spawn a large number of threads.
 
-### Tune the Code for Memory Hierarchy.
+## Tune the Code for Memory Hierarchy.
 
 [TODO]: Elaborate more