Skip to content

Commit

Permalink
Chapter 3 edits (#70)
Browse files Browse the repository at this point in the history
* 3-1: number agreement, ARM v8 becomes Armv8-A

* 3-2: markdown syntax, indefinite article

* ch3: Model 91

* chapter4: target, not just direction

* ch3-4: number agreement

* 3-6: simplify to 'per transfer'

* ch3: small things

* 3.8: kill duplicated sentences, number issues

* normalize front-end to frontend

* 3: as is -> as-is

* 3-9: monotype 'perf'
  • Loading branch information
dankamongmen authored Sep 14, 2024
1 parent b0a70e8 commit 49cda28
Show file tree
Hide file tree
Showing 10 changed files with 77 additions and 77 deletions.
8 changes: 4 additions & 4 deletions chapters/3-CPU-Microarchitecture/3-1 ISA.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
## Instruction Set Architecture

The instruction set architecture (ISA) is the contract between the software and the hardware, which defines the rules of communication. Intel x86-64,[^1] ARM v8 and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISAs also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i.
The instruction set architecture (ISA) is the contract between the software and the hardware, which defines the rules of communication. Intel x86-64,[^1] Armv8-A and RISC-V are examples of current-day ISAs that are widely deployed. All of these are 64-bit architectures, i.e., all address computations use 64 bits. ISA developers and CPU architects typically ensure that software or firmware conforming to the specification will execute on any processor built using the specification. Widely deployed ISAs also typically ensure backward compatibility such that code written for the GenX version of a processor will continue to execute on GenX+i.

Most modern architectures can be classified as general-purpose register-based, load-store architectures, such as RISC-V and ARM where the operands are explicitly specified, and memory is accessed only using load and store instructions. The X86 ISA is a register-memory architecture, where operations can be performed on registers, as well as memory operands. In addition to providing the basic functions in an ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to enhance their ISA to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE, RISC-V "V" vector extension) and matrix/tensor instructions (Intel AMX, ARM SME). Software mapped to use these advanced instructions typically provides orders of magnitude improvement in performance.
Most modern architectures can be classified as general-purpose register-based, load-store architectures, such as RISC-V and ARM where the operands are explicitly specified, and memory is accessed only using load and store instructions. The X86 ISA is a register-memory architecture, where operations can be performed on registers, as well as memory operands. In addition to providing the basic functions in an ISA such as load, store, control and scalar arithmetic operations using integers and floating-point, the widely deployed architectures continue to augment their ISAs to support new computing paradigms. These include enhanced vector processing instructions (e.g., Intel AVX2, AVX512, ARM SVE, RISC-V "V" vector extension) and matrix/tensor instructions (Intel AMX, ARM SME). Software mapped to use these advanced instructions typically provides orders of magnitude improvement in performance.

Modern CPUs support 32-bit and 64-bit precision for floating-point and integer arithmetic operations. With the fast-evolving field of machine learning and AI, the industry has a renewed interest in alternative numeric formats for variables to drive significant performance improvements. Research has shown that machine learning models perform just as well, using fewer bits to represent variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8), 16-bit floating-point (fp16, bf16) in the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations.
Modern CPUs support 32-bit and 64-bit precision for floating-point and integer arithmetic operations. With the fast-evolving fields of machine learning and AI, the industry has a renewed interest in alternative numeric formats to drive significant performance improvements. Research has shown that machine learning models perform just as well using fewer bits to represent variables, saving on both compute and memory bandwidth. As a result, several CPU franchises have recently added support for lower precision data types such as 8-bit integers (int8) and 16-bit floating-point (fp16, bf16) to the ISA, in addition to the traditional 32-bit and 64-bit formats for arithmetic operations.

[^1]: In the book we sometimes write x86 for shortness, but we assume x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999.
[^1]: In the book we sometimes write x86 for shortness, but we assume x86-64, which is a 64-bit version of the x86 instruction set, first announced in 1999.
4 changes: 2 additions & 2 deletions chapters/3-CPU-Microarchitecture/3-11 Chapter summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
* The details of the implementation are encapsulated in the term CPU "microarchitecture". This topic has been researched by thousands of computer scientists for a long time. Through the years, many smart ideas were invented and implemented in mass-market CPUs. The most notable are pipelining, out-of-order execution, superscalar engines, speculative execution and SIMD processors. All these techniques help exploit Instruction-Level Parallelism (ILP) and improve singe-threaded performance.
* In parallel with single-threaded performance, hardware designers began pushing multi-threaded performance. The vast majority of modern client-facing devices have a processor containing multiple cores. Some processors double the number of observable CPU cores with the help of Simultaneous Multithreading (SMT). SMT enables multiple software threads to run simultaneously on the same physical core using shared resources. A more recent technique in this direction is called "hybrid" processors which combine different types of cores in a single package to better support a diversity of workloads.
* The memory hierarchy in modern computers includes several levels of cache that reflect different tradeoffs in speed of access vs. size. L1 cache tends to be closest to a core, fast but small. The L3/LLC cache is slower but also bigger. DDR is the predominant DRAM technology used in most platforms. DRAM modules vary in the number of ranks and memory width which may have a slight impact on system performance. Processors may have multiple memory channels to access more than one DRAM module simultaneously.
* Virtual memory is the mechanism for sharing physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, and on ARM is 16KB. Only the page address gets translated, the offset within the page is used as is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are hardware features that improve the performance of address translation: mainly the Translation Lookaside Buffer (TLB) and hardware page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases (see [@sec:secDTLB]).
* We looked at the design of Intel's recent GoldenCove microarchitecture. Logically, the core is split into a Front End and a Back End. The Front-End consists of a Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and the IDQ, which feeds instructions to the CPU Back End. The Back-End consists of the OOO engine, execution units, the load-store unit, the L1-D cache, and the TLB hierarchy.
* Virtual memory is the mechanism for sharing physical memory with all the processes running on the CPU. Programs use virtual addresses in their accesses, which get translated into physical addresses. The memory space is split into pages. The default page size on x86 is 4KB, and on ARM is 16KB. Only the page address gets translated, the offset within the page is used as-is. The OS keeps the translation in the page table, which is implemented as a radix tree. There are hardware features that improve the performance of address translation: mainly the Translation Lookaside Buffer (TLB) and hardware page walkers. Also, developers can utilize Huge Pages to mitigate the cost of address translation in some cases (see [@sec:secDTLB]).
* We looked at the design of Intel's recent GoldenCove microarchitecture. Logically, the core is split into a Front End and a Back End. The Frontend consists of a Branch Predictor Unit (BPU), L1-I cache, instruction fetch and decode logic, and the IDQ, which feeds instructions to the CPU Back End. The Back-End consists of the OOO engine, execution units, the load-store unit, the L1-D cache, and the TLB hierarchy.
* Modern processors have performance monitoring features that are encapsulated into a Performance Monitoring Unit (PMU). This unit is built around a concept of Performance Monitoring Counters (PMC) that enables observation of specific events that happen while a program is running, for example, cache misses and branch mispredictions.

\sectionbreak
Expand Down
8 changes: 4 additions & 4 deletions chapters/3-CPU-Microarchitecture/3-2 Pipelining.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Pipelining

Pipelining is the foundational technique used to make CPUs fast wherein multiple instructions are overlapped during their execution. Pipelining in CPUs drew inspiration from the automotive assembly lines. The processing of instructions is divided into stages. The stages operate in parallel, working on different parts of different instructions. DLX is a relatively simple architecture designed by John L. Hennessy and David A. Patterson in 1994. As defined in [@Hennessy], it has a 5-stage pipeline which consists of:
Pipelining is a foundational technique used to make CPUs fast wherein multiple instructions are overlapped during their execution. Pipelining in CPUs drew inspiration from automotive assembly lines. The processing of instructions is divided into stages. The stages operate in parallel, working on different parts of different instructions. DLX is a relatively simple architecture designed by John L. Hennessy and David A. Patterson in 1994. As defined in [@Hennessy], it has a 5-stage pipeline which consists of:

1. Instruction fetch (IF)
2. Instruction decode (ID)
Expand Down Expand Up @@ -35,9 +35,9 @@ In real implementations, pipelining introduces several constraints that limit th
R2 = R1 ADD 2
```

There is a RAW dependency for register R1. If we take the value directly after addition `R0 ADD 1` is done (from the `EXE` pipeline stage), we don't need to wait until the `WB` stage finishes, and the value will be written to the register file. Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing becomes.
There is a RAW dependency for register R1. If we take the value directly after addition `R0 ADD 1` is done (from the `EXE` pipeline stage), we don't need to wait until the `WB` stage finishes (when the value will be written to the register file). Bypassing helps to save a few cycles. The longer the pipeline, the more effective bypassing becomes.

A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called *register renaming*. It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (*architectural*) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the *architectural state, solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example:
A *write-after-read* (WAR) hazard requires a dependent write to execute after a read. It occurs when instruction `x+1` writes a source before instruction `x` reads the source, resulting in the wrong new value being read. A WAR hazard is not a true dependency and is eliminated by a technique called *register renaming*. It is a technique that abstracts logical registers from physical registers. CPUs support register renaming by keeping a large number of physical registers. Logical (*architectural*) registers, the ones that are defined by the ISA, are just aliases over a wider register file. With such decoupling of the *architectural* state, solving WAR hazards is simple: we just need to use a different physical register for the write operation. For example:

```
; machine code, WAR hazard ; after register renaming
Expand All @@ -61,4 +61,4 @@ In real implementations, pipelining introduces several constraints that limit th

* **Control hazards**: are caused due to changes in the program flow. They arise from pipelining branches and other instructions that change the program flow. The branch condition that determines the direction of the branch (taken vs. not taken) is resolved in the execute pipeline stage. As a result, the fetch of the next instruction cannot be pipelined unless the control hazard is eliminated. Techniques such as dynamic branch prediction and speculative execution described in the next section are used to overcome control hazards.

\lstset{linewidth=\textwidth}
\lstset{linewidth=\textwidth}
Loading

0 comments on commit 49cda28

Please sign in to comment.