JIT: limitations in hoisting (loop invariant code motion) #35735

AndyAyersMS · 2020-05-02T01:06:19Z

Have been looking into #13811 and have found that the current implementation of loop invariant code motion has some awkward limitations.

In particular if the invariant computations are distributed across statements connected by temps, only the first computation in the chain ends up getting hoisted. In the particular example from #13811 the invariant chain was:

         Vector128<byte> result = CreateScalarUnsafe(value);
         return Avx2.BroadcastScalarToVector128(result);

where value was constant. This ended up in a loop after some inlining. Only the CreateScalarUnsafe gets hoisted.

Note the chains can be arbitrary computation and involve more than two statements.

When hoisting we walk statement by statement looking for hoistable subtrees. Local assignments are not considered hoistable -- only their right hand sides. If we hoist a tree we produce an unconsumed copy in the preheader and let CSE come along later and clean things up.

When the analysis gets to the second statement in a dependent chain, it sees the def for the local conveying the value from the first statement as loop varying, and so does not hoist.

We could try fixing this in a variety of ways:

forward substitution might be able to glue together trees connected by single-def single use temps, however it is a big hammer, potentially tricky to get right, and costly to run in full generality
we could try and fuse these trees in the importer, say if we see back to back stloc/ldloc and no other references to the local
we could fix hoisting to handle this case, with a few options:
- we could check if the subtree's VN is already hoisted, and so effectively do forward sub for the temp -- then let CSE clean all this up like we do now; this would potentially end up with quadratic amounts of cloning, though in practice, it might be acceptable;
- we could hoist assignments; this requires some care and rewiring of SSA which might be risky
- we could introduce new temps and/or modify the unconsumed hoisted tree to write to a temp, and use that to propagate the hoisted value from the first clone to second clone.

I am trying to assess how often we see this; it is a bit tricky because while I can spot the second link being blocked I can't easily tell how long the chains are so anything beyond that is harder to spot.

Rough guess based on some crude prototyping is around 2700 hoistable expressions that are second links in the usual FX diff set. There are 152 in the crossgen of SPC, including some sort and span methods.

I'm encouraged enough that I will build a more realistic prototype.

category:cq
theme:loop-opt
skill-level:expert
cost:large

The text was updated successfully, but these errors were encountered:

EgorBo · 2020-05-02T09:31:12Z

Rough guess based on some crude prototyping is around 2700 hoistable expressions that are second links in the usual FX diff set.

I guess BCL developers try to hoist such simd expressions by hands everywhere so the diff could be way higher.

Btw, is there an issue for constant folding for vectors? e,g. Sse.Add(Vector128<float>.One, Vector128<float>.One) (not sure it's profitable though)

AndyAyersMS · 2020-05-02T16:43:41Z

I guess BCL developers try to hoist such simd expressions by hands everywhere

Perhaps? This is the kind of thing that is easy to overlook or take for granted. I haven't looked at the types of computations being hoisted.

I was curious how much hoisting was going on in general (even with current limitations) to try and so did a bit more instrumentation. There are

38378 hoistable trees that look profitable
19431 hoistable trees that are deemed unprofitable

So second level hoisting would increased the amount of hoistable expressions by around 5%.

The unprofitable percentage seems high and is worth deeper exploration. There are hits in some of the key BCL routines, eg (random sample from my instrumentation run):

System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000276] in Dictionary`2:FindValue(__Canon):byref:this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000120] in Dictionary`2:FindValue(__Canon):byref:this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000170] in Collection`1:System.Collections.ICollection.CopyTo(Array,int):this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000170] in ReadOnlyCollection`1:System.Collections.ICollection.CopyTo(Array,int):this
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000088] in ArraySortHelper`1:InternalBinarySearch(ref,int,int,__Canon,IComparer`1):int
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000057] in ArraySortHelper`1:InternalBinarySearch(ref,int,int,ubyte,IComparer`1):int
System.Private.CoreLib.dasm:@@ LICM NOTPROFITABLE at [000064] in ArraySortHelper`1:InternalBinarySearch(ref,int,int,ubyte,IComparer`1):int

Also realized there are a few other issues to consider here:

the optimizer is evaluating the profitability of hoisting expression by expression. This means that if / when we do fix the dependent hoisting limitation, we may block hoisting of a profitable dependent because the hoistable expression it depends on does not look profitable on its own
profitability involves a register pressure computation, and evaluating profitability of expressions in isolation may mis-state their overall impact to pressure. Say we have 3 loop invariant computations A, B, and C, with A and B hoistable currently, and C dependent on both A and B. The pressure cost of hoisting A and B may be greater than the cost of hoisting A, B, and C. To "solve" this we need to know how the results of these computations are used; if the only use of A and B is in C then hoisting all 3 is cheaper; if A and B have other uses in the loop then hoisting all 3 is more costly.
this argues perhaps that hoisting should run as a multi-pass algorithm: for each loop first find the hoistable expressions and their inter-dependencies; then determine profitable sets of expressions to hoist and hoist them. Also track SSA use counts so we can refine cost estimates for dependent sets of expressions.
it looks like we only ever hoist out one loop level, so if an expression in a loop nest is invariant in all the loops, it is still executed more often then necessary (see first example below). Multi-level hosting is more complex so it would be good to assess how often this can happen before attempting any sort of serious implementation effort. You might be able to approximate this by running loop opt repeatedly (rebuilding ssa, etc in between) and watching how far expressions can migrate.
hoisting runs outer->inner. This seems like a bad choice for two reasons. First, it makes multi level hoisting trickier. I think multi level hoists would fall out fairly naturally if we ran hoisting from inner to outer, as invariant expressions could move out loop by loop as we work our way from inner to outer. Second, we should prioritize spending the limited "pressure budget" on improving inner loop codegen, since inner loops are likely to execute more than outer loops; as things are now, outer loop candidates are given preferential treatment.
simply changing hoist order to inner->outer is not sufficient. We also will need to prune away unreachable edges left behind by loop rotation (fgOptWhileLoop). In the double nest below each loop is rotated and so creates the proper conditions for one level of hoisting, but at the time we run hoisting, we still see a path in the outer loop that avoids the inner loop; this path blocks hoisting of fully invariant expressions from inner all the way to top level.

AndyAyersMS · 2020-05-18T20:11:56Z

Some examples.

Computation invariant in both inner and outer loop:

[MethodImpl(MethodImplOptions.NoInlining)]
static int InvariantInBoth(int y)
{
    int r = 0;
    for (int i = 0; i < 10; i++)
    {
        for (int j = 0; j < 10; j++)
        {
            r += y * y;
        }
    }
    return r;
}

The multiply is only hoisted up one level, not two:

G_M48869_IG03:
       4533C0               xor      r8d, r8d
       448BC9               mov      r9d, ecx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M48869_IG04:
       4103C1               add      eax, r9d
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CF4                 jl       SHORT G_M48869_IG04
						;; bbWeight=16    PerfScore 28.00
G_M48869_IG05:
       FFC2                 inc      edx
       83FA0A               cmp      edx, 10
       7CE3                 jl       SHORT G_M48869_IG03

Computations invariant in inner and outer loop, second one dependent on the first.

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int InvariantInBothDependent(int y)
    {
        int r = 0;
        for (int i = 0; i < 10; i++)
        {
            for (int j = 0; j < 10; j++)
            {
                int t = y * y;
                r += y * t + 1;
            }
        }
        return r;
    }

Here the independent statment is hoisted one level, the dependent statement is not hoisted.

G_M60708_IG03:
       4533C0               xor      r8d, r8d
       448BC9               mov      r9d, ecx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M60708_IG04:
       458BD1               mov      r10d, r9d
       440FAFD1             imul     r10d, ecx
       418D440201           lea      eax, [r10+rax+1]
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CEB                 jl       SHORT G_M60708_IG04
						;; bbWeight=16    PerfScore 76.00
G_M60708_IG05:
       FFC2                 inc      edx
       83FA0A               cmp      edx, 10
       7CDA                 jl       SHORT G_M60708_IG03

Same invariant computation in both nested loops:

[MethodImpl(MethodImplOptions.NoInlining)]
static int InvariantInnerTwo(int y)
{
    int r = 0;
    for (int i = 0; i < 10; i++)
    {
        for (int j = 0; j < 10; j++)
        {
            r += i * y;
        }

        for (int j = 0; j < 10; j++)
        {
            r += i * y;
        }
    }
    return r;
}

Computations are hoisted, but not CSE'd:

G_M31873_IG03:
       4533C0               xor      r8d, r8d
       448BCA               mov      r9d, edx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M31873_IG04:
       4103C1               add      eax, r9d
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CF4                 jl       SHORT G_M31873_IG04
						;; bbWeight=16    PerfScore 28.00
G_M31873_IG05:
       4533C0               xor      r8d, r8d
       448BCA               mov      r9d, edx
       440FAFC9             imul     r9d, ecx
						;; bbWeight=4    PerfScore 10.00
G_M31873_IG06:
       4103C1               add      eax, r9d
       41FFC0               inc      r8d
       4183F80A             cmp      r8d, 10
       7CF4                 jl       SHORT G_M31873_IG06
						;; bbWeight=16    PerfScore 28.00
G_M31873_IG07:
       FFC2                 inc      edx
       83FA0A               cmp      edx, 10
       7CCD                 jl       SHORT G_M31873_IG03

AndyAyersMS · 2020-05-19T01:04:25Z

See also related issues:

hoisting may re-order exceptions: JIT: Loop hoisting re-ordering exceptions #6639
hoisting blocked by intra-loop dependence: JIT: Loop hoisting inhibited by phase-ordering issue #6554, RyuJIT: Loop hoist invariant struct field accesses #7265
multi-loop hoisting: RyuJIT: missed opportunity for LICM #6666

AndyAyersMS · 2021-03-30T01:41:10Z

Going to reassign to Bruce. Guessing we won't address this in 6.0.

kunalspathak · 2021-08-16T17:34:20Z

Example where we don't hoist constants out of the loop - example 1 and example 2.

am11 · 2021-08-19T18:20:25Z

LICM / LIH doesn't kick in for pure methods, even when invariant is self-assigned, e.g.

static int GetX()
{
    int x = 42;
    // int y = 0;
    for (int i = 0; i < 100_000; ++i)
      // y = x;
      x = x;
    // return y;
    return x;
}

EgorBo · 2021-08-19T18:36:53Z

@am11 do you mean you expect GetX method to be recognized as pure and all callers should hoist calls to GetX ?

am11 · 2021-08-19T18:55:03Z

@am11 do you mean you expect GetX method to be recognized as pure and all callers should hoist calls to GetX ?

Inlining 42 at callsites would be an additional bonus. :)
but I was expecting that in this isolated example, the method body would first reduce to just:

GetX()
        mov     eax, 42
        ret

EgorBo · 2021-08-19T19:15:59Z

@am11 do you mean you expect GetX method to be recognized as pure and all callers should hoist calls to GetX ?

Inlining 42 at callsites would be an additional bonus. :)
but I was expecting that in this isolated example, the method body would first reduce to just:
GetX()
        mov     eax, 42
        ret

There is a separate issue here - we currently never remove empty loops: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHiUGAWQAUASi492PaROncAZtAbCAlgDsMDFQF4ADAG4tAHnK6DWgNQXRHAL6yG96raA==

am11 · 2021-09-04T10:51:44Z

we currently never remove empty loops

btw, we do remove empty loop if we return from it. e.g. with main (2a1eaa2):

    static int GetY(int x)
    {
        int y = 0;
        for (int i = 0; i < 100_000; ++i)
        {
            y = x;
            return y; // line 7
        }
        return y;
    }

comment out line 7, so it goes from:

G_M39264_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M39264_IG02:
       mov      eax, edi
						;; bbWeight=1    PerfScore 0.25
G_M39264_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 3, prolog size 0, PerfScore 1.55, instruction count 2, allocated bytes for code 3 (MethodHash=40e0669f) for method Program:GetY(int):int
; ============================================================

to

; Total bytes of code 21, prolog size 4, PerfScore 12.35, instruction count 11, allocated bytes for code 21 (MethodHash=40e0669f) for method Program:GetY(int):int

kunalspathak · 2022-05-11T22:47:31Z

Computations are hoisted, but not CSE'd:

This is done now.

G_M28431_IG03:
       xor      r8d, r8d
       mov      r9d, edx
       imul     r9d, ecx
       align    [0 bytes for IG04]
						;; size=10 bbWeight=4    PerfScore 10.00
G_M28431_IG04:
       add      eax, r9d
       inc      r8d
       cmp      r8d, 10
       jl       SHORT G_M28431_IG04
						;; size=12 bbWeight=16    PerfScore 28.00
G_M28431_IG05:
       xor      r8d, r8d
       align    [3 bytes for IG06]
						;; size=6 bbWeight=4    PerfScore 2.00
G_M28431_IG06:
       add      eax, r9d
       inc      r8d
       cmp      r8d, 10
       jl       SHORT G_M28431_IG06
						;; size=12 bbWeight=16    PerfScore 28.00
G_M28431_IG07:
       inc      edx
       cmp      edx, 10
       jl       SHORT G_M28431_IG03
						;; size=7 bbWeight=4    PerfScore 6.00

AndyAyersMS added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 2, 2020

AndyAyersMS added this to the 5.0 milestone May 2, 2020

AndyAyersMS self-assigned this May 2, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 2, 2020

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label May 2, 2020

JulieLeeMSFT modified the milestones: 5.0, Future May 20, 2020

AndyAyersMS mentioned this issue Jun 12, 2020

[Perf -40%] Benchstone.BenchF.Simpsn.Test #37802

Closed

BruceForstall mentioned this issue Oct 17, 2020

Improve JIT loop optimizations (.NET 6) #43549

Closed

25 tasks

AndyAyersMS mentioned this issue Oct 24, 2020

Optimize Math.Pow(x, c) where c is 2, 1, -1 or 0 #31978

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall modified the milestones: Future, 6.0.0 Nov 14, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 14, 2020

AndyAyersMS modified the milestones: 6.0.0, Future Mar 30, 2021

AndyAyersMS assigned BruceForstall and unassigned AndyAyersMS Mar 30, 2021

AndyAyersMS mentioned this issue Apr 9, 2021

Performance decrease when using interfaces #7291

Closed

BruceForstall mentioned this issue Jul 6, 2021

Improve JIT loop optimizations (.NET 7) #55235

Closed

5 tasks

AndyAyersMS mentioned this issue Sep 17, 2021

JIT: update debug dumping for loop hoisting #59287

Merged

AndyAyersMS mentioned this issue Nov 10, 2021

Hoisting the invariant out of multi-level nested loops #61420

Closed

BruceForstall mentioned this issue Feb 15, 2022

Improve JIT loop optimizations #65342

Open

20 tasks

kunalspathak mentioned this issue Apr 14, 2022

Hoist the invariants out of multi-level nested loops #68061

Merged

4 tasks

AndyAyersMS mentioned this issue Feb 6, 2023

Don't clone a large tree in impRuntimeLookupToTree #81472

Closed

AndyAyersMS mentioned this issue Mar 8, 2023

JIT: Expand runtime lookups in a late phase #81635

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: limitations in hoisting (loop invariant code motion) #35735

JIT: limitations in hoisting (loop invariant code motion) #35735

AndyAyersMS commented May 2, 2020 •

edited by BruceForstall

Loading

EgorBo commented May 2, 2020

AndyAyersMS commented May 2, 2020 •

edited

Loading

AndyAyersMS commented May 18, 2020

AndyAyersMS commented May 19, 2020

AndyAyersMS commented Mar 30, 2021

kunalspathak commented Aug 16, 2021

am11 commented Aug 19, 2021

EgorBo commented Aug 19, 2021

am11 commented Aug 19, 2021

EgorBo commented Aug 19, 2021 •

edited

Loading

am11 commented Sep 4, 2021

kunalspathak commented May 11, 2022

JIT: limitations in hoisting (loop invariant code motion) #35735

JIT: limitations in hoisting (loop invariant code motion) #35735

Comments

AndyAyersMS commented May 2, 2020 • edited by BruceForstall Loading

EgorBo commented May 2, 2020

AndyAyersMS commented May 2, 2020 • edited Loading

AndyAyersMS commented May 18, 2020

AndyAyersMS commented May 19, 2020

AndyAyersMS commented Mar 30, 2021

kunalspathak commented Aug 16, 2021

am11 commented Aug 19, 2021

EgorBo commented Aug 19, 2021

am11 commented Aug 19, 2021

EgorBo commented Aug 19, 2021 • edited Loading

am11 commented Sep 4, 2021

kunalspathak commented May 11, 2022

AndyAyersMS commented May 2, 2020 •

edited by BruceForstall

Loading

AndyAyersMS commented May 2, 2020 •

edited

Loading

EgorBo commented Aug 19, 2021 •

edited

Loading