Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Reorder stores to make them amenable to stp optimization #102133

Merged
merged 8 commits into from
May 15, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented May 12, 2024

This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes.

The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.

Example:

class Body { public double x, y, z, vx, vy, vz, mass; }

static void Advance(double dt, Body[] bodies)
{
    foreach (Body b in bodies)
    {
        b.x += dt * b.vx;
        b.y += dt * b.vy;
        b.z += dt * b.vz;
    }
}

Diff:

@@ -1,18 +1,17 @@
-G_M55007_IG04:  ;; offset=0x001C
+G_M55007_IG04:  ;; offset=0x0020
             ldr     x3, [x0, w1, UXTW #3]
             ldp     d16, d17, [x3, #0x08]
             ldp     d18, d19, [x3, #0x20]
             fmul    d18, d0, d18
             fadd    d16, d16, d18
-            str     d16, [x3, #0x08]
-            fmul    d16, d0, d19
-            fadd    d16, d17, d16
-            str     d16, [x3, #0x10]
+            fmul    d18, d0, d19
+            fadd    d17, d17, d18
+            stp     d16, d17, [x3, #0x08]
             ldr     d16, [x3, #0x18]
             ldr     d17, [x3, #0x30]
             fmul    d17, d0, d17
             fadd    d16, d16, d17
             str     d16, [x3, #0x18]
             add     w1, w1, #1
             cmp     w2, w1
             bgt     G_M55007_IG04

This generalizes the indir reordering optimization (that currently only
triggers for loads) to kick in for GT_STOREIND nodes.

The main complication with doing this is the fact that the data node of
the second indirection needs its own reordering with the previous
indirection. The existing logic works by reordering all nodes between
the first and second indirection that are unrelated to the second
indirection's computation to happen after it. Once that is done we know
that there are no uses of the first indirection's result between it and
the second indirection, so after doing the necessary interference checks
we can safely move the previous indirection to happen after the data
node of the second indirection.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 12, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@EgorBo
Copy link
Member

EgorBo commented May 12, 2024

Nice! I see you're hitting similar size regressions that I did too, e.g.:

             fmov    s16, #1.0000
-            stp     s16, s16, [x0, #0x08]
+            fmov    s17, #1.0000
+            stp     s16, s17, [x0, #0x08]

presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64

@jakobbotsch
Copy link
Member Author

Nice! I see you're hitting similar size regressions that I did too, e.g.:

             fmov    s16, #1.0000
-            stp     s16, s16, [x0, #0x08]
+            fmov    s17, #1.0000
+            stp     s16, s17, [x0, #0x08]

presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64

There is a simple workaround (this PR already has it for integer constants). It just doesn't work for float constants because GenTree::Compare does not handle GT_CNS_DBL. I'll fix it in a follow up

// For some reason LSRA is not able to reuse a constant if both LIR
// temps are live simultaneously, so skip moving in those cases and
// expect LSRA to reuse the constant instead.
if (!indir->Data()->IsCnsIntOrI() || !GenTree::Compare(indir->Data(), prevIndir->Data()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it make an exception for 0 value? since ZR reg shouldn't be a problem for LSRA

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 0 the constant will be contained in the STOREIND, so whether we move it or not shouldn't matter since it won't result in any instructions generated between the two indirs regardless.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch jakobbotsch marked this pull request as ready for review May 13, 2024 17:46
@jakobbotsch
Copy link
Member Author

The failures look known.

cc @dotnet/jit-contrib PTAL @EgorBo @kunalspathak

Diffs.

I wonder if there are any worries with stp similar to the ldp issue on Apple Silicon. If so I guess we can handle it as part of that.

FYI @a74nh @SwapnilGaikwad

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.

do you mean "before" the data node of the second indirection? Can you share an example or a dump to visualize this?

src/coreclr/jit/lower.cpp Show resolved Hide resolved
@jakobbotsch
Copy link
Member Author

do you mean "before" the data node of the second indirection? Can you share an example or a dump to visualize this?

No, the first indirection needs to happen after the data node of the second indirection so that we don't end up with something between the two indirections (which would break the str,str -> stp emitter peephole).

Here is the relevant JITDUMP for the example shared above:

[000037] and [000025] are indirs off the same base with offsets +016 and +008
  ..and they are amenable to ldp/stp optimization
  ..and they are close. Trying to move the following range (where * are nodes part of the data flow):

   N001 (  1,  1) [000015] -----------                   t15 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t15    ref    
   N003 (  3,  4) [000073] -c---------                   t73 =LEA(b+8)  byref 
                                                               ┌──▌  t73    byref  
   N004 (  4,  3) [000017] ---XG------                   t17 =IND       double <l:$441, c:$442>
*  N001 (  1,  1) [000027] -----------                   t27 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t27    ref    
*  N003 (  3,  4) [000079] -c---------                   t79 =LEA(b+16) byref 
                                                               ┌──▌  t79    byref  
*  N004 (  4,  3) [000029] n---GO-----                   t29 =IND       double <l:$44f, c:$450>
   N005 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   double V00 arg0         u:1 $100
   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 =LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
   N009 (  4,  3) [000021] n---GO-----                   t21 =IND       double <l:$444, c:$445>
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 =LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
*  N009 (  4,  3) [000033] n---GO-----                   t33 =IND       double <l:$452, c:$453>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 =MUL       double <l:$449, c:$448>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 =ADD       double <l:$44d, c:$44c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 =LEA(b+8)  byref 
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
1. N015 ( 24, 19) [000025] nA-XGO-----STOREIND  double <l:$206, c:$205>
                  [000109] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]
*  N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100
                                                               ┌──▌  t30    double 
                                                               ├──▌  t33    double 
*  N010 ( 10,  8) [000034] ----GO-----                   t34 =MUL       double <l:$457, c:$456>
                                                               ┌──▌  t29    double 
                                                               ├──▌  t34    double 
*  N011 ( 19, 15) [000035] ----GO-----                   t35 =ADD       double <l:$45b, c:$45a>
*  N012 (  1,  1) [000026] -----------                   t26 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t26    ref    
*  N014 (  3,  4) [000077] -c---------                   t77 =LEA(b+16) byref 
                                                               ┌──▌  t77    byref  
                                                               ├──▌  t35    double 
2. N015 ( 24, 19) [000037] nA--GO-----STOREIND  double <l:$206, c:$205>

Interference checks passed: can move unrelated nodes past second indir.
Moving nodes that are not part of data flow of [000037]

Result:

   N001 (  1,  1) [000015] -----------                   t15 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t15    ref    
   N003 (  3,  4) [000073] -c---------                   t73 =LEA(b+8)  byref 
                                                               ┌──▌  t73    byref  
   N004 (  4,  3) [000017] ---XG------                   t17 =IND       double <l:$441, c:$442>
*  N001 (  1,  1) [000027] -----------                   t27 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t27    ref    
*  N003 (  3,  4) [000079] -c---------                   t79 =LEA(b+16) byref 
                                                               ┌──▌  t79    byref  
*  N004 (  4,  3) [000029] n---GO-----                   t29 =IND       double <l:$44f, c:$450>
   N005 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   double V00 arg0         u:1 $100
   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 =LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
   N009 (  4,  3) [000021] n---GO-----                   t21 =IND       double <l:$444, c:$445>
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 =LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
*  N009 (  4,  3) [000033] n---GO-----                   t33 =IND       double <l:$452, c:$453>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 =MUL       double <l:$449, c:$448>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 =ADD       double <l:$44d, c:$44c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 =LEA(b+8)  byref 
*  N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100
                                                               ┌──▌  t30    double 
                                                               ├──▌  t33    double 
*  N010 ( 10,  8) [000034] ----GO-----                   t34 =MUL       double <l:$457, c:$456>
                                                               ┌──▌  t29    double 
                                                               ├──▌  t34    double 
*  N011 ( 19, 15) [000035] ----GO-----                   t35 =ADD       double <l:$45b, c:$45a>
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
1. N015 ( 24, 19) [000025] nA-XGO-----STOREIND  double <l:$206, c:$205>
*  N012 (  1,  1) [000026] -----------                   t26 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t26    ref    
*  N014 (  3,  4) [000077] -c---------                   t77 =LEA(b+16) byref 
                                                               ┌──▌  t77    byref  
                                                               ├──▌  t35    double 
2. N015 ( 24, 19) [000037] nA--GO-----STOREIND  double <l:$206, c:$205>
                  [000109] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]

Here you can see that [000025] was moved past [000035], the data node of the second indir.

@@ -9119,7 +9134,7 @@ bool Lowering::OptimizeForLdp(GenTreeIndir* ind)
bool Lowering::TryMakeIndirsAdjacent(GenTreeIndir* prevIndir, GenTreeIndir* indir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baseAddr for both prevIndir and indir is assumed to be same? Can we have an assert for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that easy to assert (it won't be exactly the same, just something that we expect the emitter peephole to kick in for). But it's also not a precondition for this function that the addresses must be related in some way -- the function works fine even without that. So I don't think there is a good reason to try to assert it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean will we mistakenly use offsets of different base address and combine them, leading to bad codegen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reordering here doesn't combine the stores or loads. It just puts them next to each other. Combining them is done by the peephole in the emitter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, what I meant is we might end up reordering unrelated indir and prevIndir but the peephole emitter will make sure that we combine the correct matching indir and prevIndir only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this function is still correct even if you ask it to make two indirs off of unrelated addresses adjacent. There is no correctness requirement that the addresses relate in some way; this is not a precondition for the function. Hence why I don't see a good reason to try to assert that (and furthermore, it isn't easy to assert it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Ok, just wanted to confirm my understanding.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, diffs are definitely better than in #102126

@jakobbotsch
Copy link
Member Author

@kunalspathak Any other feedback or are you ok with merging this as is?

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I thought I already approved. The JitStress* failures are existing issues I assume?

@jakobbotsch
Copy link
Member Author

Ah, sorry, I thought I already approved. The JitStress* failures are existing issues I assume?

Yep, they look known.

@jakobbotsch jakobbotsch merged commit a930ef5 into dotnet:main May 15, 2024
123 of 132 checks passed
@jakobbotsch jakobbotsch deleted the make-store-indirs-adjacent branch May 15, 2024 21:25
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
…#102133)

This generalizes the indir reordering optimization (that currently only
triggers for loads) to kick in for GT_STOREIND nodes.

The main complication with doing this is the fact that the data node of
the second indirection needs its own reordering with the previous
indirection. The existing logic works by reordering all nodes between
the first and second indirection that are unrelated to the second
indirection's computation to happen after it. Once that is done we know
that there are no uses of the first indirection's result between it and
the second indirection, so after doing the necessary interference checks
we can safely move the previous indirection to happen after the data
node of the second indirection.

Example:
```csharp
class Body { public double x, y, z, vx, vy, vz, mass; }

static void Advance(double dt, Body[] bodies)
{
    foreach (Body b in bodies)
    {
        b.x += dt * b.vx;
        b.y += dt * b.vy;
        b.z += dt * b.vz;
    }
}
```

Diff:
```diff
@@ -1,18 +1,17 @@
-G_M55007_IG04:  ;; offset=0x001C
+G_M55007_IG04:  ;; offset=0x0020
             ldr     x3, [x0, w1, UXTW #3]
             ldp     d16, d17, [x3, #0x08]
             ldp     d18, d19, [x3, #0x20]
             fmul    d18, d0, d18
             fadd    d16, d16, d18
-            str     d16, [x3, #0x08]
-            fmul    d16, d0, d19
-            fadd    d16, d17, d16
-            str     d16, [x3, #0x10]
+            fmul    d18, d0, d19
+            fadd    d17, d17, d18
+            stp     d16, d17, [x3, #0x08]
             ldr     d16, [x3, #0x18]
             ldr     d17, [x3, #0x30]
             fmul    d17, d0, d17
             fadd    d16, d16, d17
             str     d16, [x3, #0x18]
             add     w1, w1, #1
             cmp     w2, w1
             bgt     G_M55007_IG04
```
@github-actions github-actions bot locked and limited conversation to collaborators Jun 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants