Improve JIT loop optimizations #65342

BruceForstall · 2022-02-15T00:29:15Z

RyuJIT has several loop optimization phases that have various issues (both correctness and performance) and can be significantly improved. RyuJIT also lacks some loop optimizations that have been shown to benefit various use cases. This meta-issue collects various links to the most important identified issues in one place, so they can be easily seen without searching the entire GitHub issue database. This issue is long-term. Specific issues will be created to identify work that will be included in each release.

Release-specific issues:

.NET 9 loop optimization work: Improve JIT loop optimizations (.NET 9) #93144
.NET 8 loop optimization work: Improve JIT loop optimizations (.NET 8) #77032
.NET 7 loop optimization work: Improve JIT loop optimizations (.NET 7) #55235
.NET 6 loop optimization work: Improve JIT loop optimizations (.NET 6) #43549

If an item is implemented, it will be removed from this list (so this issue should only contain continuing loop optimization improvement opportunities).

Existing Optimizations

Below is a list of the existing loop-related RyuJIT phases and a short description of the improvement opportunities.

Loop Recognition and Canonicalization

RyuJIT currently has lexical-based loop recognition and only recognizes natural loops. We should consider replacing it with a standard Tarjan SCC algorithm that classifies all loops. Then we can extend some loop optimizations to also work on non-natural loops.

Even if we continue to use the current algorithm, we should verify that it catches the maximal set of natural loops; it is believed that it misses some natural loops.

Certain loops do not get recorded in optLoopTable #43713 Describes two cases where loops are missed due to various issues.
- [JIT] Asm difference for F# and C# methods #58941 F# loops not recognized

Multi-dimensional arrays

Multi-dimensional (MD) arrays are listed in this loop optimization issue because optimizing MD access is most valuable in the context of loop optimization. The first steps to improvement were implemented with #70271. Follow-up work:

Loop Cloning

This optimization creates two copies of a loop: one with bounds checks and one without bounds checks and executes one of them at runtime based on some condition. Several issues have been identified with this optimizations. One recurring theme is unnecessary loop cloning where we first clone a loop and then eliminate range checks from both copies.

Loop cloning driven by type tests #65206
JIT: examples where loop cloning is not useful #8558 JIT: examples where loop cloning is not useful
Poor loop optimization in BilinearInterpol benchmark #31831 Poor loop optimization in BilinearInterpol benchmark
loop cloning and pgo #48850 loop cloning and pgo. Remaining: use PGO data to influence cost/benefit analysis of deciding to clone a loop.
If compReturnBB is unreachable we should remove it #48740 (comment) Poor tracking of return blocks impacts loop cloning
Consider hoisting of class init checks for loop cloning and inversion #49102 Consider hoisting of class init checks for loop cloning and inversion
Support loop cloning of class member arrays #77071 Clone arrays with class member arrays.
If there are several different kinds of cloning criteria (array bounds and type tests, say) we currently require that we be able to satisfy them all in order to clone. In particular array bounds require increasing loops with suitable exit relops, and so we will fail cloning even if we could have just left the array aspects alone and cloned for type tests. Not sure how often this happens but if we add even more kinds of cloning conditions then we might see this fairly often.

Loop Unrolling

The existing loop unrolling phase only does full unrolls, and only for SIMD loops: current heuristic is that the loop bounds test must be a SIMD element count. The impact of the optimization is currently very limited but in general it's a high-impact optimization with the right heuristics.

Implement loop peeling #93142 Implement loop peeling
Loop unrolling support in RyuJIT #4248 Loop unrolling support in RyuJIT
JIT optimization: loop unrolling #8107 JIT optimization: loop unrolling

Loop Invariant Code Hoisting

This phase attempts to hoist code that will produce the same value on each iteration of the loop to the pre-header. There is
at least one (and likely more) correctness issue:

JIT: Loop hoisting re-ordering exceptions #6639 JIT: Loop hoisting re-ordering exceptions

And multiple issues about limitations of the algorithm:

JIT: limitations in hoisting (loop invariant code motion) #35735 JIT: limitations in hoisting (loop invariant code motion)
JIT: Loop hoisting inhibited by phase-ordering issue #6554 JIT: Loop hoisting inhibited by phase-ordering issue
RyuJIT: Loop hoist invariant struct field accesses #7265 RyuJIT: Loop hoist invariant struct field accesses
RyuJIT: missed opportunity for LICM #6666 RyuJIT: missed opportunity for LICM
Indexer.Set of List is much slower than Array #29091 (comment) Indexer.Set of List is much slower than Array
In addition, we should strongly consider hoisting conditionally executed trees.

Missing Optimizations

Several major optimizations are missing even though we have evidence of their effectiveness (at least on microbenchmarks).

Induction Variable Widening

Induction variable widening eliminates unnecessary widening converts from int32 sized induction variables to int64 size address mode register uses. On AMD64, this eliminates unnecessary movsxd instructions prior to array dereferencing. It is expected that better induction variable analysis would also allow for better arm64 post-increment addressing mode usage.

RyuJIT: Index Variable Widening optimization for array accesses #7312 RyuJIT: Index Variable Widening optimization for array accesses
RyuJIT: Implement induction variable analysis #93143 RyuJIT: Implement induction variable analysis

Loop Unswitching

Loop unswitching moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of the loop inside each of the if and else clauses of the conditional. It has elements of both Loop Cloning and Loop Invariant Code Motion.

Benefits

It's easy to show the benefit of improved loop optimizations on microbenchmarks. For example, the team has done analysis of JIT microbenchmarks (benchstones, SciMark, etc.) several years ago. The analysis contains estimates of perf improvement from several of these optimizations (each is low single digit %). Real code is also likely to have hot loops that will benefit from improved loop optimizations.

The benchmarks and other metrics we will measure to show the benefits is TBD.

category:planning
theme:loop-opt
skill-level:expert
cost:large
impact:medium

The text was updated successfully, but these errors were encountered:

ghost · 2022-02-15T00:29:21Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

RyuJIT has several loop optimization phases that have various issues (both correctness and performance) and can be significantly improved. RyuJIT also lacks some loop optimizations that have been shown to benefit various use cases. This meta-issue collects various links to the most important identified issues in one place, so they can be easily seen without searching the entire GitHub issue database. This issue is long-term. Specific issues will be created to identify work that will be included in each release.

Specific issues so far:

.NET 6 loop optimization work: Improve JIT loop optimizations (.NET 6) #43549
.NET 7 loop optimization work: Improve JIT loop optimizations (.NET 7) #55235

If an item is implemented, it will be removed from this list (so this issue should only contain continuing loop optimization improvement opportunities).

Existing Optimizations

Below is a list of the existing loop-related RyuJIT phases and a short description of the improvement opportunities.

Loop Recognition

RyuJIT currently has lexical-based loop recognition and only recognizes natural loops. We should consider replacing it with a standard Tarjan SCC algorithm that classifies all loops. Then we can extend some loop optimizations to also work on non-natural loops.

Even if we continue to use the current algorithm, we should verify that it catches the maximal set of natural loops; it is believed that it misses some natural loops.

Certain loops do not get recorded in optLoopTable #43713 Describes two cases where loops are missed due to various issues.
- [JIT] Asm difference for F# and C# methods #58941 F# loops not recognized

Multi-dimensional arrays

Code generated for multi-dimensional array expressions is sub-optimal, and much worse than for single-dimensional arrays. #60785 describes a set of problems and details improvements that should be made.

Loop Cloning

This optimization creates two copies of a loop: one with bounds checks and one without bounds checks and executes one of them at runtime based on some condition. Several issues have been identified with this optimizations. One recurring theme is unnecessary loop cloning where we first clone a loop and then eliminate range checks from both copies.

Loop cloning driven by type tests #65206
RyuJIT's loop cloning optimization has questionable CQ #4929 RyuJIT's loop cloning optimization has questionable CQ
JIT: examples where loop cloning is not useful #8558 JIT: examples where loop cloning is not useful
Poor loop optimization in BilinearInterpol benchmark #31831 Poor loop optimization in BilinearInterpol benchmark
loop cloning and pgo #48850 loop cloning and pgo. Remaining: use PGO data to influence cost/benefit analysis of deciding to clone a loop.
If compReturnBB is unreachable we should remove it #48740 (comment) Poor tracking of return blocks impacts loop cloning
Support loop cloning with struct arrays #48897 Support loop cloning with struct arrays
Consider hoisting of class init checks for loop cloning and inversion #49102 Consider hoisting of class init checks for loop cloning and inversion

Loop Unrolling

The existing phase only does full unrolls, and only for SIMD loops: current heuristic is that the loop bounds test must be a SIMD element count. The impact of the optimization is currently very limited but in general it's a high-impact optimization with the right heuristics.

Loop unrolling support in RyuJIT #4248 Loop unrolling support in RyuJIT
JIT optimization: loop unrolling #8107 JIT optimization: loop unrolling

Loop Invariant Code Hoisting

This phase attempts to hoist code that will produce the same value on each iteration of the loop to the pre-header. There is
at least one (and likely more) correctness issue:

JIT: Loop hoisting re-ordering exceptions #6639 JIT: Loop hoisting re-ordering exceptions

And multiple issues about limitations of the algorithm:

JIT: limitations in hoisting (loop invariant code motion) #35735 JIT: limitations in hoisting (loop invariant code motion)
JIT: Loop hoisting inhibited by phase-ordering issue #6554 JIT: Loop hoisting inhibited by phase-ordering issue
RyuJIT: Loop hoist invariant struct field accesses #7265 RyuJIT: Loop hoist invariant struct field accesses
RyuJIT: missed opportunity for LICM #6666 RyuJIT: missed opportunity for LICM
Indexer.Set of List is much slower than Array #29091 (comment) Indexer.Set of List is much slower than Array

Loop optimization hygiene

Loop optimizations need to work well with the rest of the compiler phases and IR invariants, such as with PGO.

Loop opts should not be recomputing pred lists from scratch #49030 Loop opts should not be recomputing pred lists from scratch. Remaining phases to fix: optFindNaturalLoops, optUnrollLoops, fgInsertGCPolls.

Missing Optimizations

Several major optimizations are missing even though we have evidence of their effectiveness (at least on microbenchmarks).

Induction Variable Widening

Induction variable widening eliminates unnecessary widening converts from int32 sized induction variables to int64 size address mode register uses. On AMD64, this eliminates unnecessary movsxd instructions prior to array dereferencing.

RyuJIT: Index Variable Widening optimization for array accesses #7312 RyuJIT: Index Variable Widening optimization for array accesses

Strength Reduction

Strength reduction replaces expensive operations with equivalent but less expensive operations.

Strength reduction for add operations performed power of 2 times #34938 Strength reduction for add operations performed power of 2 times
ARM64: loop array indexing inefficiencies #34810 ARM64: loop array indexing inefficiencies

Loop Unswitching

Loop unswitching moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of the loop inside each of the if and else clauses of the conditional. It has elements of both Loop Cloning and Loop Invariant Code Motion.

Benefits

It's easy to show the benefit of improved loop optimizations on microbenchmarks. For example, the team has done analysis of JIT microbenchmarks (benchstones, SciMark, etc.) several years ago. The analysis contains estimates of perf improvement from several of these optimizations (each is low single digit %). Real code is also likely to have hot loops that will benefit from improved loop optimizations.

The benchmarks and other metrics we will measure to show the benefits is TBD.

category:planning
theme:loop-opt
skill-level:expert
cost:large

Author:	BruceForstall
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	Future

BruceForstall added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 15, 2022

BruceForstall added this to the Future milestone Feb 15, 2022

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 15, 2022

BruceForstall mentioned this issue Feb 15, 2022

Improve JIT loop optimizations (.NET 7) #55235

Closed

5 tasks

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Feb 16, 2022

This was referenced Apr 20, 2022

Hoist the invariants out of multi-level nested loops #68061

Merged

Handle more scenarios for loop cloning #67930

Merged

ddrinka mentioned this issue Jul 18, 2022

Optimized reader reviewed ddrinka/ApacheOrcDotNet#9

Merged

BruceForstall mentioned this issue Oct 13, 2022

Improve JIT loop optimizations (.NET 8) #77032

Closed

4 tasks

dubiousconst282 mentioned this issue Jan 2, 2023

Loop Strength Reduction dubiousconst282/DistIL#12

Closed

5 tasks

AndyAyersMS mentioned this issue Mar 3, 2023

Profile Synthesis Work Items #82964

Closed

21 tasks

BruceForstall mentioned this issue Oct 6, 2023

Improve JIT loop optimizations (.NET 9) #93144

Closed

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve JIT loop optimizations #65342

Improve JIT loop optimizations #65342

BruceForstall commented Feb 15, 2022 •

edited

Loading

ghost commented Feb 15, 2022

Existing Optimizations

Loop Recognition

Multi-dimensional arrays

Loop Cloning

Loop Unrolling

Loop Invariant Code Hoisting

Loop optimization hygiene

Missing Optimizations

Induction Variable Widening

Strength Reduction

Loop Unswitching

Benefits

Improve JIT loop optimizations #65342

Improve JIT loop optimizations #65342

Comments

BruceForstall commented Feb 15, 2022 • edited Loading

Existing Optimizations

Loop Recognition and Canonicalization

Multi-dimensional arrays

Loop Cloning

Loop Unrolling

Loop Invariant Code Hoisting

Missing Optimizations

Induction Variable Widening

Loop Unswitching

Benefits

ghost commented Feb 15, 2022

Existing Optimizations

Loop Recognition

Multi-dimensional arrays

Loop Cloning

Loop Unrolling

Loop Invariant Code Hoisting

Loop optimization hygiene

Missing Optimizations

Induction Variable Widening

Strength Reduction

Loop Unswitching

Benefits

BruceForstall commented Feb 15, 2022 •

edited

Loading