Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve JIT loop optimizations #65342

Open
24 tasks
BruceForstall opened this issue Feb 15, 2022 · 1 comment
Open
24 tasks

Improve JIT loop optimizations #65342

BruceForstall opened this issue Feb 15, 2022 · 1 comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Feb 15, 2022

RyuJIT has several loop optimization phases that have various issues (both correctness and performance) and can be significantly improved. RyuJIT also lacks some loop optimizations that have been shown to benefit various use cases. This meta-issue collects various links to the most important identified issues in one place, so they can be easily seen without searching the entire GitHub issue database. This issue is long-term. Specific issues will be created to identify work that will be included in each release.

Release-specific issues:

If an item is implemented, it will be removed from this list (so this issue should only contain continuing loop optimization improvement opportunities).

Existing Optimizations

Below is a list of the existing loop-related RyuJIT phases and a short description of the improvement opportunities.

Loop Recognition and Canonicalization

RyuJIT currently has lexical-based loop recognition and only recognizes natural loops. We should consider replacing it with a standard Tarjan SCC algorithm that classifies all loops. Then we can extend some loop optimizations to also work on non-natural loops.

Even if we continue to use the current algorithm, we should verify that it catches the maximal set of natural loops; it is believed that it misses some natural loops.

Multi-dimensional arrays

Multi-dimensional (MD) arrays are listed in this loop optimization issue because optimizing MD access is most valuable in the context of loop optimization. The first steps to improvement were implemented with #70271. Follow-up work:

Loop Cloning

This optimization creates two copies of a loop: one with bounds checks and one without bounds checks and executes one of them at runtime based on some condition. Several issues have been identified with this optimizations. One recurring theme is unnecessary loop cloning where we first clone a loop and then eliminate range checks from both copies.

Loop Unrolling

The existing loop unrolling phase only does full unrolls, and only for SIMD loops: current heuristic is that the loop bounds test must be a SIMD element count. The impact of the optimization is currently very limited but in general it's a high-impact optimization with the right heuristics.

Loop Invariant Code Hoisting

This phase attempts to hoist code that will produce the same value on each iteration of the loop to the pre-header. There is
at least one (and likely more) correctness issue:

And multiple issues about limitations of the algorithm:

Missing Optimizations

Several major optimizations are missing even though we have evidence of their effectiveness (at least on microbenchmarks).

Induction Variable Widening

Induction variable widening eliminates unnecessary widening converts from int32 sized induction variables to int64 size address mode register uses. On AMD64, this eliminates unnecessary movsxd instructions prior to array dereferencing. It is expected that better induction variable analysis would also allow for better arm64 post-increment addressing mode usage.

Loop Unswitching

Loop unswitching moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of the loop inside each of the if and else clauses of the conditional. It has elements of both Loop Cloning and Loop Invariant Code Motion.

Benefits

It's easy to show the benefit of improved loop optimizations on microbenchmarks. For example, the team has done analysis of JIT microbenchmarks (benchstones, SciMark, etc.) several years ago. The analysis contains estimates of perf improvement from several of these optimizations (each is low single digit %). Real code is also likely to have hot loops that will benefit from improved loop optimizations.

The benchmarks and other metrics we will measure to show the benefits is TBD.

category:planning
theme:loop-opt
skill-level:expert
cost:large
impact:medium

@BruceForstall BruceForstall added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 15, 2022
@BruceForstall BruceForstall added this to the Future milestone Feb 15, 2022
@ghost
Copy link

ghost commented Feb 15, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

RyuJIT has several loop optimization phases that have various issues (both correctness and performance) and can be significantly improved. RyuJIT also lacks some loop optimizations that have been shown to benefit various use cases. This meta-issue collects various links to the most important identified issues in one place, so they can be easily seen without searching the entire GitHub issue database. This issue is long-term. Specific issues will be created to identify work that will be included in each release.

Specific issues so far:

If an item is implemented, it will be removed from this list (so this issue should only contain continuing loop optimization improvement opportunities).

Existing Optimizations

Below is a list of the existing loop-related RyuJIT phases and a short description of the improvement opportunities.

Loop Recognition

RyuJIT currently has lexical-based loop recognition and only recognizes natural loops. We should consider replacing it with a standard Tarjan SCC algorithm that classifies all loops. Then we can extend some loop optimizations to also work on non-natural loops.

Even if we continue to use the current algorithm, we should verify that it catches the maximal set of natural loops; it is believed that it misses some natural loops.

Multi-dimensional arrays

Code generated for multi-dimensional array expressions is sub-optimal, and much worse than for single-dimensional arrays. #60785 describes a set of problems and details improvements that should be made.

Loop Cloning

This optimization creates two copies of a loop: one with bounds checks and one without bounds checks and executes one of them at runtime based on some condition. Several issues have been identified with this optimizations. One recurring theme is unnecessary loop cloning where we first clone a loop and then eliminate range checks from both copies.

Loop Unrolling

The existing phase only does full unrolls, and only for SIMD loops: current heuristic is that the loop bounds test must be a SIMD element count. The impact of the optimization is currently very limited but in general it's a high-impact optimization with the right heuristics.

Loop Invariant Code Hoisting

This phase attempts to hoist code that will produce the same value on each iteration of the loop to the pre-header. There is
at least one (and likely more) correctness issue:

And multiple issues about limitations of the algorithm:

Loop optimization hygiene

Loop optimizations need to work well with the rest of the compiler phases and IR invariants, such as with PGO.

Missing Optimizations

Several major optimizations are missing even though we have evidence of their effectiveness (at least on microbenchmarks).

Induction Variable Widening

Induction variable widening eliminates unnecessary widening converts from int32 sized induction variables to int64 size address mode register uses. On AMD64, this eliminates unnecessary movsxd instructions prior to array dereferencing.

Strength Reduction

Strength reduction replaces expensive operations with equivalent but less expensive operations.

Loop Unswitching

Loop unswitching moves a conditional from inside a loop to outside of it by duplicating the loop's body, and placing a version of the loop inside each of the if and else clauses of the conditional. It has elements of both Loop Cloning and Loop Invariant Code Motion.

Benefits

It's easy to show the benefit of improved loop optimizations on microbenchmarks. For example, the team has done analysis of JIT microbenchmarks (benchstones, SciMark, etc.) several years ago. The analysis contains estimates of perf improvement from several of these optimizations (each is low single digit %). Real code is also likely to have hot loops that will benefit from improved loop optimizations.

The benchmarks and other metrics we will measure to show the benefits is TBD.

category:planning
theme:loop-opt
skill-level:expert
cost:large

Author: BruceForstall
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: Future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

2 participants