Skip to content

Commit

Permalink
[Fixing TODOs] part23
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Aug 11, 2024
1 parent ed3c618 commit c3118ea
Show file tree
Hide file tree
Showing 9 changed files with 34 additions and 28 deletions.
6 changes: 4 additions & 2 deletions chapters/0-Preface/0-2 Preface.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ I joined Intel in 2017, but even before that I never shied away from software op

I sincerely hope that this book will help you learn low-level performance analysis, and, if you make your application faster as a result, I will consider my mission accomplished.

You will find that I use "we" instead of "I" in many places in the book. This is because I received a lot of help from other people. The PDF version of this book and the "Performance Ninja" online course are available for free. This is my way to give back to the community. The full list of contributors can be found at the end of the book in the "Acknowledgements" section.
You will find that I use "we" instead of "I" in many places in the book. This is because I received a lot of help from other people. The full list of contributors can be found at the end of the book in the "Acknowledgements" section.

The PDF version of this book and the "Performance Ninja" online course are available for free. This is my way to give back to the community.

## Target Audience {.unlisted .unnumbered}

Expand All @@ -40,7 +42,7 @@ This book will also be useful for any developer who wants to understand the perf

Readers are expected to have a minimal background in C/C++ programming languages to understand the book's examples. The ability to read basic x86/ARM assembly is desired but is not a strict requirement. I also expect familiarity with basic concepts of computer architecture and operating systems like central processor, memory, process, thread, virtual and physical memory, context switch, etc. If any of the mentioned terms are new to you, I suggest studying this material first.

I suggest you read the book chapter by chapter, starting from the beginning. If you consider yourself a beginner in performance analysis, I do not recommend skipping chapters. After you finish reading, this book can be used as a reference or a checklist for optimizing software applications. The second part of the book can be a source of ideas for code optimizations.
I suggest you read the book chapter by chapter, starting from the beginning. If you consider yourself a beginner in performance analysis, I do not recommend skipping chapters. After you finish reading, you can use this book as a a source of ideas whenever you face a performance issue and it's not immediately clear how to fix it. You can skim through the second part of the book to see which optimizations techniques can be applied to your code.

[TODO]: put a link to an errata webpage

Expand Down
9 changes: 7 additions & 2 deletions chapters/1-Introduction/1-0 Introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,20 @@

They say, "Performance is king". It was true a decade ago, and it certainly is now. According to [@Domo2017], in 2017, the world has been creating 2.5 quintillion[^1] bytes of data every day, and as predicted in [@Statista2024], it will reach 400 quintillion bytes per day in 2024. In our increasingly data-centric world, the growth of information exchange fuels the need for both faster software and faster hardware.

Software programmers have had an "easy ride" for decades, thanks to Moore’s law. It used to be the case that some software vendors preferred to wait for a new generation of hardware to speed up their software products and did not spend human resources on making improvements in the code. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down.
Software programmers have had an "easy ride" for decades, thanks to Moore’s law. It used to be the case that some software vendors preferred to wait for a new generation of hardware to speed up their software products and did not spend human resources on making improvements in their code. By looking at Figure @fig:50YearsProcessorTrend, we can see that single-threaded[^2] performance growth is slowing down. From 1990 to 2000, single-threaded performance grew by a factor of approximately 25 to 30 times based on SPECint benchmarks. The increase in CPU frequency was the key factor driving performance growth.

![50 Years of Microprocessor Trend Data. *© Image by K. Rupp via karlrupp.net*. Original data up to the year 2010 was collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2021 by K. Rupp.](../../img/intro/50-years-processor-trend.png){#fig:50YearsProcessorTrend width=100%}

The original interpretation of Moore's law is still standing, though, as transistor count in modern processors maintains its trajectory. For instance, the number of transistors in Apple chips grew from 16 billion in M1 to 20 billion in M2, to 25 billion in M3, to 28 billion in M4 in a span of roughly four years. The growth in transistor count enables manufacturers to add more cores to a processor. As of 2024, you can buy a high-end server processor that will have more than 100 logical cores on a single CPU socket. This is very impressive, unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores.
However from 2000 to 2010, single-threaded CPU performance growth was more modest compared to the previous decade (approximately 4 to 5 times). Clock speed stagnated due to a combination of power consumption, heat dissipation challenges, limitations in voltage scaling (Dennard Scaling[^3]), and other fundamental problems. Despite slower clock speed improvements, architectural advancements continued, including better branch prediction, deeper pipelines, larger caches, and more efficient execution units.

From 2010 to 2020, single-threaded performance grew only by about 2 to 3 times. During this period, CPU manufacturers began to focus more on multi-core processors and parallelism rather than solely increasing single-threaded performance.

The original interpretation of Moore's law is still standing, as transistor count in modern processors maintains its trajectory. For instance, the number of transistors in Apple chips grew from 16 billion in M1 to 20 billion in M2, to 25 billion in M3, to 28 billion in M4 in a span of roughly four years. The growth in transistor count enables manufacturers to add more cores to a processor. As of 2024, you can buy a high-end server processor that will have more than 100 logical cores on a single CPU socket. This is very impressive, unfortunately, it doesn't always translate into better performance. Very often, application performance doesn't scale with extra CPU cores.

When it's no longer the case that each hardware generation provides a significant performance boost, we must start paying more attention to how fast our code runs. When seeking ways to improve performance, developers should not rely on hardware. Instead, they should start optimizing the code of their applications.

> “Software today is massively inefficient; it’s become prime time again for software programmers to get really good at optimization.” - Marc Andreessen, the US entrepreneur and investor (a16z Podcast)
[^1]: A quintillion is a thousand raised to the power of six (10^18^).
[^2]: Single-threaded performance is the performance of a single hardware thread inside a CPU core when measured in isolation.
[^3]: Dennard Scaling - [https://en.wikipedia.org/wiki/Dennard_scaling](https://en.wikipedia.org/wiki/Dennard_scaling)
4 changes: 2 additions & 2 deletions chapters/1-Introduction/1-1 Why Software Is Slow.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Why Software Is Slow?

If all the software in the world would magically start utilizing all available hardware resources efficiently, then this book would not exist. We would not need any changes on the software side and would rely on what existing processors have to offer. But you already know that the reality is different, right? The reality is that modern software is *massively* inefficient. A regular server system in a public cloud, typically runs poorly optimized code, consuming more power than it could have consumed, which increases carbon emissions and contributes to other environmental issues. If we could make all software run two times faster, this would reduce the carbon footprint of computing by a factor of two.
If all the software in the world would magically start utilizing all available hardware resources efficiently, then this book would not exist. We would not need any changes on the software side and would rely on what existing processors have to offer. But you already know that the reality is different, right? The reality is that modern software is *massively* inefficient. A regular server system in a public cloud, typically runs poorly optimized code, consuming more power than it could have consumed, which increases carbon emissions and contributes to other environmental issues. If we could make all software run two times faster, this would potentially reduce the carbon footprint of computing by a factor of two.

The authors of the paper [@Leisersoneaam9744] provide an excellent example that illustrates the performance gap between "default" and highly optimized software. Table @tbl:PlentyOfRoom summarizes speedups from performance engineering a program that multiplies two 4096-by-4096 matrices. The end result of applying several optimizations is a program that runs over 60,000 times faster. The reason for providing this example is not to pick on Python or Java (which are great languages), but rather to break beliefs that software has "good enough" performance by default. The majority of programs are within rows 1-5. The potential for source-code-level improvements is significant.

Expand Down Expand Up @@ -30,7 +30,7 @@ Table: Speedups from performance engineering a program that multiplies two 4096-
So, let's talk about what prevents systems from achieving optimal performance by default. Here are some of the most important factors:

1. **CPU limitations**: it's so tempting to ask: "*Why doesn't hardware solve all our problems?*" Modern CPUs execute instructions at incredible speed and are getting better with every generation. But still, they cannot do much if instructions that are used to perform the job are not optimal or even redundant. Processors cannot magically transform suboptimal code into something that performs better. For example, if we implement a sorting routine using BubbleSort algorithm, a CPU will not make any attempts to recognize it and use better alternatives, for example, QuickSort. It will blindly execute whatever it was told to do.
2. **Compiler limitations**: "*But isn't it what compilers are supposed to do? Why don't compilers solve all our problems?*" Indeed, compilers are amazingly smart nowadays, but can still generate suboptimal code. Compilers are great at eliminating redundant work, but when it comes to making more complex decisions like vectorization, etc., they may not generate the best possible code. Performance experts often can come up with a clever way to vectorize a loop, which would be extremely hard for a traditional compiler. When compilers have to make a decision whether to perform a code transformation or not, they rely on complex cost models and heuristics, which may not work for every possible scenario. For example, there is no binary "yes" or "no" answer to the question of whether a compiler should always inline a function into the place where it's called. It usually depends on many factors which a compiler should take into account. Additionally, compilers cannot perform optimizations unless they are certain it is safe to do so, and it does not affect the correctness of the resulting machine code. It may be very difficult for compiler developers to ensure that a particular optimization will generate correct code under all possible circumstances, so they often have to be conservative and refrain from doing some optimizations. Finally, compilers generally do not attempt "heroic" optimizations, like transforming data structures used by a program.
2. **Compiler limitations**: "*But isn't it what compilers are supposed to do? Why don't compilers solve all our problems?*" Indeed, compilers are amazingly smart nowadays, but can still generate suboptimal code. Compilers are great at eliminating redundant work, but when it comes to making more complex decisions like vectorization, they may not generate the best possible code. Performance experts often can come up with a clever way to vectorize a loop, which would be extremely hard for a traditional compiler. When compilers have to make a decision whether to perform a code transformation or not, they rely on complex cost models and heuristics, which may not work for every possible scenario. For example, there is no binary "yes" or "no" answer to the question of whether a compiler should always inline a function into the place where it's called. It usually depends on many factors which a compiler should take into account. Additionally, compilers cannot perform optimizations unless they are certain it is safe to do so, and it does not affect the correctness of the resulting machine code. It may be very difficult for a compiler to prove that an optimization will generate correct code under all possible circumstances, so they often have to be conservative and refrain from doing some optimizations. Finally, compilers generally do not attempt "heroic" optimizations, like transforming data structures used by a program.
3. **Algorithmic complexity analysis limitations**: some developers are overly obsessed with algorithmic complexity analysis, which leads them to choose a popular algorithm with the optimal algorithmic complexity, even though it may not be the most efficient for a given problem. Considering two sorting algorithms, InsertionSort and QuickSort, the latter clearly wins in terms of Big O notation for the average case: InsertionSort is O(N^2^) while QuickSort is only O(N log N). Yet for relatively small sizes of `N` (up to 50 elements), InsertionSort outperforms QuickSort. Complexity analysis cannot account for all the branch prediction and caching effects of various algorithms, so people just encapsulate them in an implicit constant `C`, which sometimes can make a drastic impact on performance. Blindly trusting Big O notation without testing on the target workload could lead developers down an incorrect path. So, the best-known algorithm for a certain problem is not necessarily the most performant in practice for every possible input.

In addition to the limitations described above, there are overheads created by programming paradigms. Coding practices that prioritize code clarity, readability, and maintainability, often come at the potential performance cost. Highly generalized and reusable code can introduce unnecessary copies, runtime checks, function calls, memory allocations, and so on. For instance, polymorphism in object-oriented programming is implemented using virtual functions, which introduce a performance overhead.[^1]
Expand Down
Loading

0 comments on commit c3118ea

Please sign in to comment.