[Chapter3] Small fixes.

dendibakh · Sep 14, 2024 · 6ffd668 · 6ffd668
1 parent 90ccf68
commit 6ffd668
Show file tree

Hide file tree

Showing 2 changed files with 4 additions and 4 deletions.
diff --git a/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md b/chapters/3-CPU-Microarchitecture/3-3 Exploiting ILP.md
@@ -28,7 +28,7 @@ Scoreboarding was first implemented in the CDC6600 processor in the 1960s. Its m
 
 To eliminate false dependencies, the Tomasulo algorithm makes use of register renaming, which we discussed in the previous section. Because of that, performance is greatly improved compared to scoreboarding. However, a sequence of instructions that carry a RAW dependency, also known as a *dependency chain*, is still problematic for OOO execution, because there is no increase in the ILP after register renaming since all the RAW dependencies are preserved. Dependency chains are often found in loops (loop carried dependency) where the current loop iteration depends on the results produced on the previous iteration.
 
-Modern processors implement dynamic scheduling techniques that are derivatives of Tomasulo's original algorithm and include the Reorder Buffer (ROB) and the Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors, it has several hundred entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling instructions independently. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when instructions are placed in the ROB.
+Other two key components for implementing dynamic scheduling are the Reorder Buffer (ROB) and the Reservation Station (RS). The ROB is a circular buffer that keeps track of the state of each instruction, and in modern processors, it has several hundred entries. Typically, the size of the ROB determines how far ahead the hardware can look for scheduling instructions independently. Instructions are inserted in the ROB in program order, can execute out of order, and retire in program order. Register renaming is done when instructions are placed in the ROB.
 
 From the ROB, instructions are inserted in the RS, which has much fewer entries. Once instructions are in the RS, they wait for their input operands to become available. When inputs are available, instructions can be issued to the appropriate execution unit. So instructions can be executed in any order once their operands become available and are not tied to the program order any longer. Modern processors are becoming wider (can execute many instructions in one cycle) and deeper (larger ROB, RS, and other buffers), which demonstrates that there is a lot of potential to uncover more ILP in production applications.
 

diff --git a/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md b/chapters/3-CPU-Microarchitecture/3-8 Modern CPU design.md
@@ -80,13 +80,13 @@ In case the L2 miss is confirmed, the load continues to wait for the results of
 
 When a store modifies a memory location, the processor needs to load the full cache line, change it, and then write it back to memory. If the address to write is not in the cache, it goes through a very similar mechanism as with loads to bring that data in. The store cannot be complete until the data is written to the cache hierarchy.
 
-Of course, there are a few optimizations done for store operations as well. First, if we're dealing with a store or multiple adjacent stores (also known as *streaming stores*) that modify an entire cache line, there is no need to read the data first as all of the bytes will be clobbered anyway. So, the processor will try to combine writes to fill an entire cache line. If this succeeds no memory read operation is needed at all.
+Of course, there are a few optimizations done for store operations as well. First, if we're dealing with a store or multiple adjacent stores (also known as *streaming stores*) that modify an entire cache line, there is no need to read the data first as all of the bytes will be clobbered anyway. So, the processor will try to combine writes to fill an entire cache line. If this succeeds no memory read operation is needed.
 
 Second, write combining enables multiple stores to be assembled and written further out in the cache hierarchy as a unit. So, if multiple stores modify the same cache line, only one memory write will be issued to the memory subsystem. All these optimizations are done inside the Store Buffer. A store instruction copies the data that will be written from a register into the Store Buffer. From there it may be written to the L1 cache or it may be combined with other stores to the same cache line. The Store Buffer capacity is limited, so it can hold requests for partial writing to a cache line only for some time. However, while the data sits in the Store Buffer waiting to be written, other load instructions can read the data straight from the store buffers (store-to-load forwarding).
 
-Finally, if we happen to read data before overwriting it, the cache line typically stays in the cache. This behavior can be altered with the help of a *non-temporal* store, a special CPU instruction that will not keep the modified line in the cache. It is useful in situations when we know we will not need the data once we have changed it. Non-temporal stores help to utilize cache space more efficiently by not evicting other data that might be needed soon.
+Finally, there are cases when we can improve cache utilization by using so-called *non-temporal* memory accesses. If we execute a partial store (e.g., we overwrite 8 bytes in a cache line), we need to read the cache line first. This new cache line will displace some other line in the cache. However, if we know that we won't need this data again, then it would be better not to allocate space in the cache for that line. Non-temporal memory accesses are special CPU instructions that do not keep the fetched line in the cache and drop it immediately after using.
 
-During a typical program execution, there could be dozens of memory accesses in flight. In most high-performance processors, the order of load and store operations is not necessarily required to be the same as the program order, which is known as a _weakly ordered memory model_. For optimization purposes, the processor can reorder memory read and write operations. Consider a situation when a load runs into a cache miss and has to wait until the data comes from memory. The processor allows subsequent loads to proceed ahead of the load that is waiting for the data. This allows later loads to finish before the earlier load and doesn't unnecessarily block the execution. Such load/store reordering enables the memory units to process multiple memory accesses in parallel, which translates directly into higher performance.
+During a typical program execution, there could be dozens of memory accesses in flight. In most high-performance processors, the order of load and store operations is not necessarily required to be the same as the program order, which is known as a _weakly ordered memory model_. For optimization purposes, the processor can reorder memory read and write operations. Consider a situation when a load runs into a cache miss and has to wait until the data comes from memory. The processor allows subsequent loads to proceed ahead of the load that is waiting for the data. This allows later loads to finish before the earlier load and doesn't unnecessarily block the execution. Such load/store reordering enables memory units to process multiple memory accesses in parallel, which translates directly into higher performance.
 
 There are a few exceptions. Just like with dependencies through regular arithmetic instructions, there are memory dependencies through loads and stores. In other words, a load can depend on an earlier store and vice-versa. First of all, stores cannot be reordered with older loads: