Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Reorder indirs on arm64 to make them amenable to ldp optimization #92768

Merged
merged 21 commits into from
Oct 6, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Sep 28, 2023

This PR adds an ARM64 optimization in lowering that tries to reorder indirections to be near each other, making them amenable to ldp optimization done in the backend.

Example:

public struct Point
{
    public long X, Y;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static bool Equals(in Point left, in Point right)
    {
        return (left.X == right.X) & (left.Y == right.Y);
    }
}

Before:

G_M28859_IG02:  ;; offset=0x0008
            ldr     x2, [x0]
            ldr     x3, [x1]
            ldr     x0, [x0, #0x08]
            ldr     x1, [x1, #0x08]
            cmp     x2, x3
            ccmp    x0, x1, 0, eq
            cset    x0, eq

After:

G_M28859_IG02:  ;; offset=0x0008
            ldp     x2, x0, [x0]
            ldp     x3, x1, [x1]
            cmp     x2, x3
            ccmp    x0, x1, 0, eq
            cset    x0, eq

The transformation works by moving the trees that are not part of the data flow of the second indirection past it (after doing the necessary interference checks to prove whether this reordering is ok). For example, for the case above, the jitdump illustrates what happens:

[000009] and [000002] are indirs off the same base with offsets +004 and +000
  ..and they are amenable to ldp optimization
  ..and it is close by. Trying to move the following range (where * are nodes part of the data flow):

   N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   byref  V00 arg0         u:1 $80
                                                               ┌──▌  t0     byref  
1. N002 (  3,  2) [000002] ---XG------                    t2 =IND       int    <l:$200, c:$201>
   N003 (  1,  1) [000003] -----------                    t3 =    LCL_VAR   byref  V01 arg1         u:1 $81
                                                               ┌──▌  t3     byref  
   N004 (  3,  2) [000005] ---XG------                    t5 =IND       int    <l:$202, c:$203>
                                                               ┌──▌  t2     int    
                                                               ├──▌  t5     int    
   N005 ( 10,  5) [000006] ---XG------                    t6 =EQ        int    <l:$207, c:$206>
*  N006 (  1,  1) [000007] -----------                    t7 =    LCL_VAR   byref  V00 arg0         u:1 (last use) $80
                                                               ┌──▌  t7     byref  
*  N008 (  3,  4) [000018] -c---------                   t18 =LEA(b+4)  byref 
                                                               ┌──▌  t18    byref  
2. N009 (  4,  3) [000009] n---GO-----                    t9 =IND       int    <l:$208, c:$209>

Have conservative interference with last indir. Trying a smarter interference check...
Interference checks passed. Moving nodes that are not part of data flow tree

Result:

   N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   byref  V00 arg0         u:1 $80
                                                               ┌──▌  t0     byref  
1. N002 (  3,  2) [000002] ---XG------                    t2 =IND       int    <l:$200, c:$201>
*  N006 (  1,  1) [000007] -----------                    t7 =    LCL_VAR   byref  V00 arg0         u:1 (last use) $80
                                                               ┌──▌  t7     byref  
*  N008 (  3,  4) [000018] -c---------                   t18 =LEA(b+4)  byref 
                                                               ┌──▌  t18    byref  
2. N009 (  4,  3) [000009] n---GO-----                    t9 =IND       int    <l:$208, c:$209>
   N003 (  1,  1) [000003] -----------                    t3 =    LCL_VAR   byref  V01 arg1         u:1 $81
                                                               ┌──▌  t3     byref  
   N004 (  3,  2) [000005] ---XG------                    t5 =IND       int    <l:$202, c:$203>
                                                               ┌──▌  t2     int    
                                                               ├──▌  t5     int    
   N005 ( 10,  5) [000006] ---XG------                    t6 =EQ        int    <l:$207, c:$206>

In this case we needed to reorder the indirection with the [000005] IND and [000006] EQ nodes.

The optimization also supports reordering stores with loads to unlock opportunities like #64815. To do that there is a (somewhat ad hoc) alias analysis based on TYP_REF and locals. For example, for the case in #64815:

public class Body { public double x, y, z, vx, vy, vz, mass; }

public static void Foo(double dt, Body[] bodies)
{
    foreach (var b in bodies)
    {
        b.x += dt * b.vx;
        b.y += dt * b.vy;
        b.z += dt * b.vz;
    }
}

the jit dump is:

[000033] and [000029] are indirs off the same base with offsets +040 and +016
  ..but at wrong offset
[000033] and [000021] are indirs off the same base with offsets +040 and +032
  ..and they are amenable to ldp optimization
  ..and it is close by. Trying to move the following range (where * are nodes part of the data flow):

   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 =LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
1. N009 (  4,  3) [000021] n---GO-----                   t21 =IND       double <l:$484, c:$485>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 =MUL       double <l:$489, c:$488>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 =ADD       double <l:$48d, c:$48c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 =LEA(b+8)  byref 
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
   N015 ( 24, 19) [000025] nA-XGO-----STOREIND  double <l:$206, c:$205>
                  [000116] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]
   N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 =LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
2. N009 (  4,  3) [000033] n---GO-----                   t33 =IND       double <l:$492, c:$493>

Have conservative interference with last indir. Trying a smarter interference check...
Cannot interfere with [000025] since they are off the same local V04 and indir range [040..048) does not interfere with store range [008..016)
Interference checks passed. Moving nodes that are not part of data flow tree

Result:

   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 =LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
1. N009 (  4,  3) [000021] n---GO-----                   t21 =IND       double <l:$484, c:$485>
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 =LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
2. N009 (  4,  3) [000033] n---GO-----                   t33 =IND       double <l:$492, c:$493>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 =MUL       double <l:$489, c:$488>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 =ADD       double <l:$48d, c:$48c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 =LEA(b+8)  byref 
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
   N015 ( 24, 19) [000025] nA-XGO-----STOREIND  double <l:$206, c:$205>
                  [000116] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]
   N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100

In this case we needed to reorder with a much larger range of trees -- the [000022] to [000030] trees needed to be moved past the indirection. In this case that involves proving that the [000033] indir does not interfere with the [000025] store.

I want to look into handling LCL_FLD and stores (STOREIND/STORE_LCL_FLD) as a follow-up.

Fix #64815

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 28, 2023
@ghost ghost assigned jakobbotsch Sep 28, 2023
@ghost
Copy link

ghost commented Sep 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Hacky prototype for #92756 to see the diffs.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member

Can it get the example in #64815?

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 29, 2023

@kunalspathak Sadly no (edit: this is now handled), due to the store -- our interference checking isn't smart enough to realize that

        FD000470          str     d16, [x3,#8]
        FD400870          ldr     d16, [x3,#16]

are ok to reorder. It gives up here:

// Can we move it past the 'indir'? We can ignore effect flags when
// checking this, as we know the previous indir will make it non-faulting
// and keep the ordering correct. This makes use of the fact that
// non-volatile indirs have ordering side effects only for the "suppressed
// NRE" case.
// We still need some interference cehcks to ensure that the indir reads
// the same value even after reordering.
if (m_scratchSideEffects.InterferesWith(comp, indir, GTF_EMPTY, true))
{
JITDUMP("Giving up due to interference with [%06u]\n", Compiler::dspTreeID(indir));
UnmarkTree(indir);
return false;
}

It's probably not that difficult to make this particular check smarter (of course we could also make InterferesWith itself smarter, but that's a harder/more expensive)

@jakobbotsch
Copy link
Member Author

@kunalspathak I pushed a new commit that makes the interference check more clever. It means we get the case in that issue now, diff:

@@ -1,20 +1,18 @@
-G_M62947_IG03:  ;; offset=0x0020
+G_M62947_IG03:  ;; offset=0x001C
             ldr     x3, [x0, w1, UXTW #3]
-            ldr     d16, [x3, #0x08]
-            ldr     d17, [x3, #0x20]
-            fmul    d17, d0, d17
-            fadd    d16, d16, d17
+            ldp     d16, d17, [x3, #0x08]
+            ldp     d18, d19, [x3, #0x20]
+            fmul    d18, d0, d18
+            fadd    d16, d16, d18
             str     d16, [x3, #0x08]
-            ldr     d16, [x3, #0x10]
-            ldr     d17, [x3, #0x28]
-            fmul    d17, d0, d17
-            fadd    d16, d16, d17
+            fmul    d16, d0, d19
+            fadd    d16, d17, d16
             str     d16, [x3, #0x10]
             ldr     d16, [x3, #0x18]
             ldr     d17, [x3, #0x30]
             fmul    d17, d0, d17
             fadd    d16, d16, d17
             str     d16, [x3, #0x18]
             add     w1, w1, #1
             cmp     w2, w1
             bgt     G_M62947_IG03

@jakobbotsch
Copy link
Member Author

Of course we need the STOREIND counterpart to also get the stp :-)

@jakobbotsch
Copy link
Member Author

Some data on why cases get rejected (1 = rejected due to that reason):

Volatile
     <=          0 ===>   37438 count ( 99% of total)
      1 ..       1 ===>     166 count (100% of total)

Too far
     <=          0 ===>   32988 count ( 88% of total)
      1 ..       1 ===>    4450 count (100% of total)

Interference
     <=          0 ===>   28934 count ( 87% of total)
      1 ..       1 ===>    4054 count (100% of total)

This is over libraries_tests.run. Could try to adjust the distance we search to see if we can get some more of those "Too far" cases...

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Sep 29, 2023

With 16 distance on smoke_tests:

Volatile
     <=          0 ===>    3829 count ( 99% of total)
      1 ..       1 ===>      10 count (100% of total)

Too far
     <=          0 ===>    3684 count ( 96% of total)
      1 ..       1 ===>     145 count (100% of total)

Interference
     <=          0 ===>    3513 count ( 95% of total)
      1 ..       1 ===>     171 count (100% of total)

With 32:

Volatile
     <=          0 ===>    3829 count ( 99% of total)
      1 ..       1 ===>      10 count (100% of total)

Too far
     <=          0 ===>    3796 count ( 99% of total)
      1 ..       1 ===>      33 count (100% of total)

Interference
     <=          0 ===>    3543 count ( 93% of total)
      1 ..       1 ===>     253 count (100% of total)

So that allowed another 112 cases to pass onwards to the interference analysis, but there we rejected 82 of them... so probably not worth it to increase it.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Comment on lines 8214 to 8217
if (!ind->TypeIs(TYP_INT, TYP_LONG, TYP_DOUBLE, TYP_SIMD16))
{
return false;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix #92766 to add TYP_FLOAT here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are small types supported?

Copy link
Contributor

@neon-sunset neon-sunset Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this needs to have TYP_SIMD8 too? (on a rare occasion there is Vector64-based code supprted by ARM64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this needs to have TYP_SIMD8 too? (on a rare occasion there is Vector64-based code supprted by ARM64)

True, wondering if TYP_SIMD8 was considered in this list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are small types supported?

Nope, there is no small variant of ldp I believe.

Perhaps this needs to have TYP_SIMD8 too? (on a rare occasion there is Vector64-based code supprted by ARM64)

True, wondering if TYP_SIMD8 was considered in this list?

Yep, this can have TYP_SIMD8 -- I forgot about those, thanks for spotting this.

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member

EgorBo commented Oct 6, 2023

Nice diffs!

@BruceForstall
Copy link
Member

cc @a74nh

@BruceForstall
Copy link
Member

Looks like there are some interesting regressions where spilling happens, and where the ldp opt. doesn't actually occur.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool optimization. Just a few nits/questions/comments.

src/coreclr/jit/sideeffects.h Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved

for (GenTree* cur = prevIndir->gtNext; cur != indir; cur = cur->gtNext)
{
if (cur->IsCall() || (cur->OperIsStoreBlk() && (cur->AsBlk()->gtBlkOpKind == GenTreeBlk::BlkOpKindHelper)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should move this check in the for-loop above where you check the distance between the prevIndir and indir.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable -- moved it.

FWIW, we might be able to remove this heuristic after we merge #92744


if ((cur->gtLIRFlags & LIR::Flags::Mark) != 0)
{
// Part of data flow of 'indir', so we will be moving past this node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Part of data flow of 'indir', so we will be moving past this node.
// Part of data flow of 'indir', so we will be moving it upward past this node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion doesn't seem quite right to me, but I rephrased it in a different way which hopefully makes more sense.

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 6, 2023
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 6, 2023
* Move IsVolatile check into OptimizeForLdp
* Allow the optimization for SIMD8 indirs
* Make fldSeq parameter of gtPeelOffsets optional
* Only check for GT_LCL_VARs; reorder some checks for TP purposes
* Avoid adding an entry on successful optimization
* Update some comments and JITDUMP
* Add convenience constructor
@jakobbotsch
Copy link
Member Author

jakobbotsch commented Oct 6, 2023

Looks like there are some interesting regressions where spilling happens, and where the ldp opt. doesn't actually occur.

Some of the failures to combine ldp was because we had already combined the previous indir with another one. Added an early-out in this case to skip adding the "middle" indir to the m_blockIndirs list.

Remaining regressions seem to be mostly about register allocation, but I think it's expected to see some regressions due to suboptimal register allocation when doing this kind of reordering (since the transformation lengthens the lifetime of the moved LIR temp).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@AndyAyersMS
Copy link
Member

Do you have a sense that moving the second indir "earlier" is better than moving the first indir "later"? (or even moving both of them somewhere in between)?

I believe think it's generally desirable to do loads as soon as you know you're going to need them so as to hide load latency, so I think moving the second one up is the right approach, but there may be cases where there is less pressure if we moved the first one later, or if the loads are not on any critical path maybe the CPU would be better off spending limited resources on something else.

Conventional wisdom is that "scheduling" is no longer really critical with today's highly OOO CPUs so perhaps none of this matters and compacting the code better is the key aspect.

@jakobbotsch
Copy link
Member Author

Do you have a sense that moving the second indir "earlier" is better than moving the first indir "later"? (or even moving both of them somewhere in between)?

The decision here was mainly driven by the fact that implementation wise, it is much easier to move the second indirection and its operands backwards, compared to trying to find all transitive users (including through locals) of the first indirection and moving them forwards past the second indirection.

@jakobbotsch
Copy link
Member Author

superpmi-diffs failure is the "out of space" one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arm64: Evaluate if it is possible to combine subsequent field loads in a single load
6 participants