Skip to content

Commit

Permalink
[Fixing TODOs] part22
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh committed Aug 11, 2024
1 parent 09378fa commit ed3c618
Show file tree
Hide file tree
Showing 5 changed files with 35 additions and 25 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,8 @@ In contrast, if you have a loop that performs a lot of _independent_ operations,

When you analyze machine code for one of your hot loops, you may find that multiple instructions are assigned to the same execution port. This situation is known as _execution port contention_. So the challenge is to find ways of substituting some of these instructions with the ones that are not assigned to the same port. For example on Intel processors, if you're heavily bottlenecked on `port5`, then you may find that two instructions on `port0` are better than one instruction on `port5`. Often it is not an easy task and it requires deep ISA and microarchitecture knowledge. When in struggle, seek help on specialized forums. Also, keep in mind that some of these things may change in future CPU generations, so consider using CPU dispatch to isolate the effect of your code changes.

### Case Study: When FMA Instructions Hurt Performance {.unlisted .unnumbered}

In [@sec:FMAThroughput], we looked at one example, of when the throughput of FMA instructions becomes critical. Now let's take a look at another example, involving FMA latency. In [@lst:FMAlatency] on the left, we have the `sqSum` function which computes a sum of every element squared. On the right, we present the corresponding machine code generated by Clang-18 when compiled with `-O3 -march=core-avx2`. Notice, that we didn't use `-ffast-math`, perhaps because we want to maintain bit-exact results over multiple platforms. That's why the code was not autovectorized by the compiler.

Listing: FMA latency
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ Once you confirm subnormal values are there, you can enable the FTZ and DAZ mode
* __DAZ__ (Denormals Are Zero). Any denormal inputs are replaced by zero before use.
* __FTZ__ (Flush To Zero). Any outputs that would be denormal are replaced by zero.

When they are enabled, there is no need for costly handling of subnormal values in a CPU floating-point arithmetic. In x86-based platforms, there are two separate bit fields in the `MXCSR`, global control and status register. In ARM Aarch64, two modes are controlled with `FZ`` and `AH` bits of the `FPCR` control register. If you compile your application with `-ffast-math`, you have nothing to worry about, the compiler will automatically insert the required code to enable both flags at the start of the program. The `-ffast-math` compiler option is a little overloaded, so GCC developers created a separate `-mdaz-ftz` option that only controls the behavior of subnormal values. If you'd rather control it from the source code, [@lst:EnableFTZDAZ] shows an example that you can use. If you choose this option, avoid frequent changes to the `MXCSR` register because the operation is relatively expensive. A read of the MXCSR register has a fairly long latency, and a write to the register is a serializing instruction.
When they are enabled, there is no need for costly handling of subnormal values in a CPU floating-point arithmetic. In x86-based platforms, there are two separate bit fields in the `MXCSR`, global control and status register. In ARM Aarch64, two modes are controlled with `FZ` and `AH` bits of the `FPCR` control register. If you compile your application with `-ffast-math`, you have nothing to worry about, the compiler will automatically insert the required code to enable both flags at the start of the program. The `-ffast-math` compiler option is a little overloaded, so GCC developers created a separate `-mdaz-ftz` option that only controls the behavior of subnormal values. If you'd rather control it from the source code, [@lst:EnableFTZDAZ] shows an example that you can use. If you choose this option, avoid frequent changes to the `MXCSR` register because the operation is relatively expensive. A read of the MXCSR register has a fairly long latency, and a write to the register is a serializing instruction.

Listing: Enabling FTZ and DAZ modes manually

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ To confirm that frequency throttling is one of the main reasons for performance

![Thread Count Scalability chart for Blender and Clang with disabled Turbo Boost. Frequency throttling is a major roadblock to achieving good thread count scaling.](../../img/mt-perf/ScalabilityNoTurboChart.png){#fig:ScalabilityNoTurboChart width=100%}

Going back to the main chart shown in Figure fig:ScalabilityMainChart, for the Clang workload, the tipping point of performance scaling is around 10 threads. This is the point where the frequency throttling starts to have a significant impact on performance, and the benefit of adding additional threads is smaller than the penalty of running at a lower frequency.
Going back to the main chart shown in Figure @fig:ScalabilityMainChart, for the Clang workload, the tipping point of performance scaling is around 10 threads. This is the point where the frequency throttling starts to have a significant impact on performance, and the benefit of adding additional threads is smaller than the penalty of running at a lower frequency.

### Zstandard {.unlisted .unnumbered}

Expand Down Expand Up @@ -108,20 +108,23 @@ To determine the root cause of poor scaling, we collected TMA metrics (see [@sec
-----------------------------------------------------------------------------
Metric 1 thread 2 threads 3 threads 4 threads
------------------------------------ --------- --------- --------- ---------
Memory Bound (% of pipeline slots) 34.6 53.7 59.0 65.4
TMA::Memory Bound 34.6 53.7 59.0 65.4
(% of pipeline slots)

DRAM Memory Bandwidth (% of cycles) 71.7 83.9 87.0 91.3
TMA::DRAM Memory Bandwidth 71.7 83.9 87.0 91.3
(% of cycles)

DRAM Mem BW Use (range, GB/s) 20-22 25-28 27-30 27-30
Memory Bandwidth Utilization 20-22 25-28 27-30 27-30
(range, GB/s)
-----------------------------------------------------------------------------

Table: Performance metrics for CloverLeaf workload. {#tbl:CloverLeaf_metrics}

As you can see from those numbers, the pressure on the memory subsystem kept increasing as we added more threads. An increase in the *Memory Bound* metric indicates that threads increasingly spend more time waiting for data and do less useful work. An increase in the *DRAM Memory Bandwidth* metric further highlights that performance is hurt due to approaching bandwidth limits. The *DRAM Mem BW Use* metric indicates the range total of total memory bandwidth utilization while CloverLeaf was running. We captured these numbers by looking at the memory bandwidth utilization chart in VTune's platform view as shown in Figure @fig:CloverLeafMemBandwidth.
As you can see from those numbers, the pressure on the memory subsystem kept increasing as we added more threads. An increase in the *TMA::Memory Bound* metric indicates that threads increasingly spend more time waiting for data and do less useful work. An increase in the *DRAM Memory Bandwidth* metric further highlights that performance is hurt due to approaching bandwidth limits. The *Memory Bandwidth Utilization* metric indicates the range of total memory bandwidth utilization while CloverLeaf was running. We captured these numbers by looking at the memory bandwidth utilization chart in VTune's platform view as shown in Figure @fig:CloverLeafMemBandwidth.

![VTune's platform view of running CloverLeaf with 3 threads.](../../img/mt-perf/CloverLeafMemBandwidth.png){#fig:CloverLeafMemBandwidth width=100%}

Let's put those numbers into perspective, the maximum theoretical memory bandwidth of our platform is `38.4 GB/s`. However, as we measured in [@sec:MemLatBw], the maximum memory bandwidth that can be achieved in practice is `35 GB/s`. With just a single thread, the memory bandwidth utilization reaches `2/3` of the practical limit. CloverLeaf fully saturates the memory bandwidth with three threads. Even when all 16 threads are active, *DRAM Mem BW Use* doesn't go above `30 GB/s`, which is `86%` of the practical limit.
Let's put those numbers into perspective, the maximum theoretical memory bandwidth of our platform is `38.4 GB/s`. However, as we measured in [@sec:MemLatBw], the maximum memory bandwidth that can be achieved in practice is `35 GB/s`. With just a single thread, the memory bandwidth utilization reaches `2/3` of the practical limit. CloverLeaf fully saturates the memory bandwidth with three threads. Even when all 16 threads are active, *Memory Bandwidth Utilization* doesn't go above `30 GB/s`, which is `86%` of the practical limit.

To confirm our hypothesis, we swapped two `8 GB DDR4 2400 MT/s` memory modules with two DDR4 modules of the same capacity, but faster speed: `3200 MT/s`. This brings the theoretical memory bandwidth of the system to `51.2 GB/s` and the practical maximum to `45 GB/s`. The resulting performance boost grows with an increasing number of threads used and is in the range of 10% to 33%. When running CloverLeaf with 16 threads, faster memory modules provide the expected 33% performance as a ratio of the memory bandwidth increase (`3200 / 2400 = 1.33`). But even with a single thread, there is a 10% performance improvement. This means that there are moments when CloverLeaf fully saturates the memory bandwidth with a single thread.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Our second piece of advice is to avoid static partitioning on systems with asymm

In the final example, we switch to using dynamic partitioning. With dynamic partitioning, chunks are distributed to threads dynamically. Each thread processes a chunk of elements, then requests another chunk, until no chunks remain to be distributed. Figure @fig:OmpDynamic shows the result of using dynamic partitioning by dividing the array into 16 chunks. With this scheme, each task becomes more granular, which enables OpenMP runtime to balance the work even when P-cores run two times faster than E-cores. However, notice that there is still some idle time on E-cores.

Performance can be slightly improved if we partition the work into 128 chunks instead of 16. But don't make the jobs too small, otherwise it will result in increased management overhead. The result summary of our experiments is shown in Table [@tbl:TaskSchedulingResults]. Partitioning the work into 128 chunks turns out to be the sweet spot for our example. Even though our example is very simple, learning from it can be applied to production-grade multithreaded software.
Performance can be slightly improved if we partition the work into 128 chunks instead of 16. But don't make the jobs too small, otherwise it will result in increased management overhead. The result summary of our experiments is shown in Table [@tbl:TaskSchedulingResults]. Partitioning the work into 128 chunks turns out to be the sweet spot for our example. Even though our example is very simple, learnings from it can be applied to production-grade multithreaded software.

------------------------------------------------------------------------------------------------
Affinity Static Dynamic, Dynamic, Dynamic, Dynamic,
Expand Down
39 changes: 22 additions & 17 deletions chapters/9-Optimizing-Computations/9-5 Compiler Intrinsics.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,23 +65,28 @@ Notice the explicit handling of remainders after the loop processes multiples of

Highway supports over 200 operations, which can be grouped into the following categories:

* Initialization
* Getting/setting lanes
* Getting/setting blocks
* Printing
* Tuples
* Arithmetic
* Logical
* Masks
* Comparisons
* Memory
* Cache control
* Type conversion
* Combine
* Swizzle/permute
* Swizzling within 128-bit blocks
* Reductions
* Crypto
\begin{multicols}{2}
\begin{itemize}
\tightlist
\item Initialization
\item Getting/setting lanes
\item Getting/setting blocks
\item Printing
\item Tuples
\item Arithmetic
\item Logical
\item Masks
\item Comparisons
\item Memory
\item Cache control
\item Type conversion
\item Combine
\item Swizzle/permute
\item Swizzling within 128-bit blocks
\item Reductions
\item Crypto
\end{itemize}
\end{multicols}

For the full list of operations, see its documentation [^13] and [FAQ](https://github.com/google/highway/blob/master/g3doc/faq.md). You can also experiment with it in the online [Compiler Explorer](https://gcc.godbolt.org/z/zP7MYe9Yf).
Other libraries include Eigen, nsimd, SIMDe, VCL, and xsimd. Note that a C++ standardization effort starting with the Vc library resulted in std::experimental::simd, but this provides a very limited set of operations and as of this writing is only supported on the GCC 11 compiler.
Expand Down

0 comments on commit ed3c618

Please sign in to comment.