-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: limitations in hoisting (loop invariant code motion) #35735
Comments
I guess BCL developers try to hoist such simd expressions by hands everywhere so the diff could be way higher. Btw, is there an issue for constant folding for vectors? e,g. |
Perhaps? This is the kind of thing that is easy to overlook or take for granted. I haven't looked at the types of computations being hoisted. I was curious how much hoisting was going on in general (even with current limitations) to try and so did a bit more instrumentation. There are
So second level hoisting would increased the amount of hoistable expressions by around 5%. The unprofitable percentage seems high and is worth deeper exploration. There are hits in some of the key BCL routines, eg (random sample from my instrumentation run):
Also realized there are a few other issues to consider here:
|
Some examples. Computation invariant in both inner and outer loop: [MethodImpl(MethodImplOptions.NoInlining)]
static int InvariantInBoth(int y)
{
int r = 0;
for (int i = 0; i < 10; i++)
{
for (int j = 0; j < 10; j++)
{
r += y * y;
}
}
return r;
} The multiply is only hoisted up one level, not two: G_M48869_IG03:
4533C0 xor r8d, r8d
448BC9 mov r9d, ecx
440FAFC9 imul r9d, ecx
;; bbWeight=4 PerfScore 10.00
G_M48869_IG04:
4103C1 add eax, r9d
41FFC0 inc r8d
4183F80A cmp r8d, 10
7CF4 jl SHORT G_M48869_IG04
;; bbWeight=16 PerfScore 28.00
G_M48869_IG05:
FFC2 inc edx
83FA0A cmp edx, 10
7CE3 jl SHORT G_M48869_IG03
Computations invariant in inner and outer loop, second one dependent on the first. [MethodImpl(MethodImplOptions.NoInlining)]
static int InvariantInBothDependent(int y)
{
int r = 0;
for (int i = 0; i < 10; i++)
{
for (int j = 0; j < 10; j++)
{
int t = y * y;
r += y * t + 1;
}
}
return r;
} Here the independent statment is hoisted one level, the dependent statement is not hoisted. G_M60708_IG03:
4533C0 xor r8d, r8d
448BC9 mov r9d, ecx
440FAFC9 imul r9d, ecx
;; bbWeight=4 PerfScore 10.00
G_M60708_IG04:
458BD1 mov r10d, r9d
440FAFD1 imul r10d, ecx
418D440201 lea eax, [r10+rax+1]
41FFC0 inc r8d
4183F80A cmp r8d, 10
7CEB jl SHORT G_M60708_IG04
;; bbWeight=16 PerfScore 76.00
G_M60708_IG05:
FFC2 inc edx
83FA0A cmp edx, 10
7CDA jl SHORT G_M60708_IG03 Same invariant computation in both nested loops: [MethodImpl(MethodImplOptions.NoInlining)]
static int InvariantInnerTwo(int y)
{
int r = 0;
for (int i = 0; i < 10; i++)
{
for (int j = 0; j < 10; j++)
{
r += i * y;
}
for (int j = 0; j < 10; j++)
{
r += i * y;
}
}
return r;
} Computations are hoisted, but not CSE'd: G_M31873_IG03:
4533C0 xor r8d, r8d
448BCA mov r9d, edx
440FAFC9 imul r9d, ecx
;; bbWeight=4 PerfScore 10.00
G_M31873_IG04:
4103C1 add eax, r9d
41FFC0 inc r8d
4183F80A cmp r8d, 10
7CF4 jl SHORT G_M31873_IG04
;; bbWeight=16 PerfScore 28.00
G_M31873_IG05:
4533C0 xor r8d, r8d
448BCA mov r9d, edx
440FAFC9 imul r9d, ecx
;; bbWeight=4 PerfScore 10.00
G_M31873_IG06:
4103C1 add eax, r9d
41FFC0 inc r8d
4183F80A cmp r8d, 10
7CF4 jl SHORT G_M31873_IG06
;; bbWeight=16 PerfScore 28.00
G_M31873_IG07:
FFC2 inc edx
83FA0A cmp edx, 10
7CCD jl SHORT G_M31873_IG03 |
See also related issues:
|
Going to reassign to Bruce. Guessing we won't address this in 6.0. |
LICM / LIH doesn't kick in for pure methods, even when invariant is self-assigned, e.g. static int GetX()
{
int x = 42;
// int y = 0;
for (int i = 0; i < 100_000; ++i)
// y = x;
x = x;
// return y;
return x;
} |
@am11 do you mean you expect |
Inlining 42 at callsites would be an additional bonus. :) GetX()
mov eax, 42
ret |
There is a separate issue here - we currently never remove empty loops: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHiUGAWQAUASi492PaROncAZtAbCAlgDsMDFQF4ADAG4tAHnK6DWgNQXRHAL6yG96raA== |
btw, we do remove empty loop if we return from it. e.g. with main (2a1eaa2): static int GetY(int x)
{
int y = 0;
for (int i = 0; i < 100_000; ++i)
{
y = x;
return y; // line 7
}
return y;
} comment out line 7, so it goes from: G_M39264_IG01:
;; bbWeight=1 PerfScore 0.00
G_M39264_IG02:
mov eax, edi
;; bbWeight=1 PerfScore 0.25
G_M39264_IG03:
ret
;; bbWeight=1 PerfScore 1.00
; Total bytes of code 3, prolog size 0, PerfScore 1.55, instruction count 2, allocated bytes for code 3 (MethodHash=40e0669f) for method Program:GetY(int):int
; ============================================================ to ; Total bytes of code 21, prolog size 4, PerfScore 12.35, instruction count 11, allocated bytes for code 21 (MethodHash=40e0669f) for method Program:GetY(int):int |
This is done now. G_M28431_IG03:
xor r8d, r8d
mov r9d, edx
imul r9d, ecx
align [0 bytes for IG04]
;; size=10 bbWeight=4 PerfScore 10.00
G_M28431_IG04:
add eax, r9d
inc r8d
cmp r8d, 10
jl SHORT G_M28431_IG04
;; size=12 bbWeight=16 PerfScore 28.00
G_M28431_IG05:
xor r8d, r8d
align [3 bytes for IG06]
;; size=6 bbWeight=4 PerfScore 2.00
G_M28431_IG06:
add eax, r9d
inc r8d
cmp r8d, 10
jl SHORT G_M28431_IG06
;; size=12 bbWeight=16 PerfScore 28.00
G_M28431_IG07:
inc edx
cmp edx, 10
jl SHORT G_M28431_IG03
;; size=7 bbWeight=4 PerfScore 6.00 |
Have been looking into #13811 and have found that the current implementation of loop invariant code motion has some awkward limitations.
In particular if the invariant computations are distributed across statements connected by temps, only the first computation in the chain ends up getting hoisted. In the particular example from #13811 the invariant chain was:
where value was constant. This ended up in a loop after some inlining. Only the
CreateScalarUnsafe
gets hoisted.Note the chains can be arbitrary computation and involve more than two statements.
When hoisting we walk statement by statement looking for hoistable subtrees. Local assignments are not considered hoistable -- only their right hand sides. If we hoist a tree we produce an unconsumed copy in the preheader and let CSE come along later and clean things up.
When the analysis gets to the second statement in a dependent chain, it sees the def for the local conveying the value from the first statement as loop varying, and so does not hoist.
We could try fixing this in a variety of ways:
I am trying to assess how often we see this; it is a bit tricky because while I can spot the second link being blocked I can't easily tell how long the chains are so anything beyond that is harder to spot.
Rough guess based on some crude prototyping is around 2700 hoistable expressions that are second links in the usual FX diff set. There are 152 in the crossgen of SPC, including some sort and span methods.
I'm encouraged enough that I will build a more realistic prototype.
category:cq
theme:loop-opt
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: