JIT: Reorder stores to make them amenable to stp optimization #102133

jakobbotsch · 2024-05-12T14:38:37Z

This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes.

The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.

Example:

class Body { public double x, y, z, vx, vy, vz, mass; }

static void Advance(double dt, Body[] bodies)
{
    foreach (Body b in bodies)
    {
        b.x += dt * b.vx;
        b.y += dt * b.vy;
        b.z += dt * b.vz;
    }
}

Diff:

@@ -1,18 +1,17 @@
-G_M55007_IG04:  ;; offset=0x001C
+G_M55007_IG04:  ;; offset=0x0020
             ldr     x3, [x0, w1, UXTW #3]
             ldp     d16, d17, [x3, #0x08]
             ldp     d18, d19, [x3, #0x20]
             fmul    d18, d0, d18
             fadd    d16, d16, d18
-            str     d16, [x3, #0x08]
-            fmul    d16, d0, d19
-            fadd    d16, d17, d16
-            str     d16, [x3, #0x10]
+            fmul    d18, d0, d19
+            fadd    d17, d17, d18
+            stp     d16, d17, [x3, #0x08]
             ldr     d16, [x3, #0x18]
             ldr     d17, [x3, #0x30]
             fmul    d17, d0, d17
             fadd    d16, d16, d17
             str     d16, [x3, #0x18]
             add     w1, w1, #1
             cmp     w2, w1
             bgt     G_M55007_IG04

This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes. The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.

dotnet-policy-service · 2024-05-12T14:39:05Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

jakobbotsch · 2024-05-12T17:56:51Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

azure-pipelines · 2024-05-12T17:57:13Z

Azure Pipelines successfully started running 3 pipeline(s).

EgorBo · 2024-05-12T18:06:58Z

Nice! I see you're hitting similar size regressions that I did too, e.g.:

             fmov    s16, #1.0000
-            stp     s16, s16, [x0, #0x08]
+            fmov    s17, #1.0000
+            stp     s16, s17, [x0, #0x08]

presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64

jakobbotsch · 2024-05-12T18:09:27Z

Nice! I see you're hitting similar size regressions that I did too, e.g.:
             fmov    s16, #1.0000
-            stp     s16, s16, [x0, #0x08]
+            fmov    s17, #1.0000
+            stp     s16, s17, [x0, #0x08]
presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64

There is a simple workaround (this PR already has it for integer constants). It just doesn't work for float constants because GenTree::Compare does not handle GT_CNS_DBL. I'll fix it in a follow up

EgorBo · 2024-05-12T18:13:08Z

src/coreclr/jit/lower.cpp

+        // For some reason LSRA is not able to reuse a constant if both LIR
+        // temps are live simultaneously, so skip moving in those cases and
+        // expect LSRA to reuse the constant instead.
+        if (!indir->Data()->IsCnsIntOrI() || !GenTree::Compare(indir->Data(), prevIndir->Data()))


should it make an exception for 0 value? since ZR reg shouldn't be a problem for LSRA

For 0 the constant will be contained in the STOREIND, so whether we move it or not shouldn't matter since it won't result in any instructions generated between the two indirs regardless.

…rs-adjacent

jakobbotsch · 2024-05-13T10:04:13Z

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

azure-pipelines · 2024-05-13T10:04:25Z

Azure Pipelines successfully started running 2 pipeline(s).

jakobbotsch · 2024-05-13T17:51:24Z

The failures look known.

cc @dotnet/jit-contrib PTAL @EgorBo @kunalspathak

Diffs.

I wonder if there are any worries with stp similar to the ldp issue on Apple Silicon. If so I guess we can handle it as part of that.

FYI @a74nh @SwapnilGaikwad

kunalspathak

so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.

do you mean "before" the data node of the second indirection? Can you share an example or a dump to visualize this?

src/coreclr/jit/lower.cpp

jakobbotsch · 2024-05-13T18:56:32Z

do you mean "before" the data node of the second indirection? Can you share an example or a dump to visualize this?

No, the first indirection needs to happen after the data node of the second indirection so that we don't end up with something between the two indirections (which would break the str,str -> stp emitter peephole).

Here is the relevant JITDUMP for the example shared above:

[000037] and [000025] are indirs off the same base with offsets +016 and +008
  ..and they are amenable to ldp/stp optimization
  ..and they are close. Trying to move the following range (where * are nodes part of the data flow):

   N001 (  1,  1) [000015] -----------                   t15 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t15    ref    
   N003 (  3,  4) [000073] -c---------                   t73 = ▌  LEA(b+8)  byref 
                                                               ┌──▌  t73    byref  
   N004 (  4,  3) [000017] ---XG------                   t17 = ▌  IND       double <l:$441, c:$442>
*  N001 (  1,  1) [000027] -----------                   t27 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t27    ref    
*  N003 (  3,  4) [000079] -c---------                   t79 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t79    byref  
*  N004 (  4,  3) [000029] n---GO-----                   t29 = ▌  IND       double <l:$44f, c:$450>
   N005 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   double V00 arg0         u:1 $100
   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 = ▌  LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
   N009 (  4,  3) [000021] n---GO-----                   t21 = ▌  IND       double <l:$444, c:$445>
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 = ▌  LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
*  N009 (  4,  3) [000033] n---GO-----                   t33 = ▌  IND       double <l:$452, c:$453>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 = ▌  MUL       double <l:$449, c:$448>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 = ▌  ADD       double <l:$44d, c:$44c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 = ▌  LEA(b+8)  byref 
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
1. N015 ( 24, 19) [000025] nA-XGO-----                         ▌  STOREIND  double <l:$206, c:$205>
                  [000109] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]
*  N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100
                                                               ┌──▌  t30    double 
                                                               ├──▌  t33    double 
*  N010 ( 10,  8) [000034] ----GO-----                   t34 = ▌  MUL       double <l:$457, c:$456>
                                                               ┌──▌  t29    double 
                                                               ├──▌  t34    double 
*  N011 ( 19, 15) [000035] ----GO-----                   t35 = ▌  ADD       double <l:$45b, c:$45a>
*  N012 (  1,  1) [000026] -----------                   t26 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t26    ref    
*  N014 (  3,  4) [000077] -c---------                   t77 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t77    byref  
                                                               ├──▌  t35    double 
2. N015 ( 24, 19) [000037] nA--GO-----                         ▌  STOREIND  double <l:$206, c:$205>

Interference checks passed: can move unrelated nodes past second indir.
Moving nodes that are not part of data flow of [000037]

Result:

   N001 (  1,  1) [000015] -----------                   t15 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t15    ref    
   N003 (  3,  4) [000073] -c---------                   t73 = ▌  LEA(b+8)  byref 
                                                               ┌──▌  t73    byref  
   N004 (  4,  3) [000017] ---XG------                   t17 = ▌  IND       double <l:$441, c:$442>
*  N001 (  1,  1) [000027] -----------                   t27 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t27    ref    
*  N003 (  3,  4) [000079] -c---------                   t79 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t79    byref  
*  N004 (  4,  3) [000029] n---GO-----                   t29 = ▌  IND       double <l:$44f, c:$450>
   N005 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   double V00 arg0         u:1 $100
   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 = ▌  LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
   N009 (  4,  3) [000021] n---GO-----                   t21 = ▌  IND       double <l:$444, c:$445>
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 = ▌  LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
*  N009 (  4,  3) [000033] n---GO-----                   t33 = ▌  IND       double <l:$452, c:$453>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 = ▌  MUL       double <l:$449, c:$448>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 = ▌  ADD       double <l:$44d, c:$44c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 = ▌  LEA(b+8)  byref 
*  N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100
                                                               ┌──▌  t30    double 
                                                               ├──▌  t33    double 
*  N010 ( 10,  8) [000034] ----GO-----                   t34 = ▌  MUL       double <l:$457, c:$456>
                                                               ┌──▌  t29    double 
                                                               ├──▌  t34    double 
*  N011 ( 19, 15) [000035] ----GO-----                   t35 = ▌  ADD       double <l:$45b, c:$45a>
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
1. N015 ( 24, 19) [000025] nA-XGO-----                         ▌  STOREIND  double <l:$206, c:$205>
*  N012 (  1,  1) [000026] -----------                   t26 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t26    ref    
*  N014 (  3,  4) [000077] -c---------                   t77 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t77    byref  
                                                               ├──▌  t35    double 
2. N015 ( 24, 19) [000037] nA--GO-----                         ▌  STOREIND  double <l:$206, c:$205>
                  [000109] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]

Here you can see that [000025] was moved past [000035], the data node of the second indir.

kunalspathak · 2024-05-14T05:07:20Z

src/coreclr/jit/lower.cpp

@@ -9119,7 +9134,7 @@ bool Lowering::OptimizeForLdp(GenTreeIndir* ind)
 bool Lowering::TryMakeIndirsAdjacent(GenTreeIndir* prevIndir, GenTreeIndir* indir)


The baseAddr for both prevIndir and indir is assumed to be same? Can we have an assert for it?

It's not that easy to assert (it won't be exactly the same, just something that we expect the emitter peephole to kick in for). But it's also not a precondition for this function that the addresses must be related in some way -- the function works fine even without that. So I don't think there is a good reason to try to assert it.

I mean will we mistakenly use offsets of different base address and combine them, leading to bad codegen?

This reordering here doesn't combine the stores or loads. It just puts them next to each other. Combining them is done by the peephole in the emitter.

Sure, what I meant is we might end up reordering unrelated indir and prevIndir but the peephole emitter will make sure that we combine the correct matching indir and prevIndir only.

Yes, but this function is still correct even if you ask it to make two indirs off of unrelated addresses adjacent. There is no correctness requirement that the addresses relate in some way; this is not a precondition for the function. Hence why I don't see a good reason to try to assert that (and furthermore, it isn't easy to assert it).

Yes

Ok, just wanted to confirm my understanding.

EgorBo

LGTM, diffs are definitely better than in #102126

jakobbotsch · 2024-05-15T20:18:32Z

@kunalspathak Any other feedback or are you ok with merging this as is?

kunalspathak

Ah, sorry, I thought I already approved. The JitStress* failures are existing issues I assume?

jakobbotsch · 2024-05-15T21:24:32Z

Ah, sorry, I thought I already approved. The JitStress* failures are existing issues I assume?

Yep, they look known.

…#102133) This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes. The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection. Example: ```csharp class Body { public double x, y, z, vx, vy, vz, mass; } static void Advance(double dt, Body[] bodies) { foreach (Body b in bodies) { b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz; } } ``` Diff: ```diff @@ -1,18 +1,17 @@ -G_M55007_IG04: ;; offset=0x001C +G_M55007_IG04: ;; offset=0x0020 ldr x3, [x0, w1, UXTW #3] ldp d16, d17, [x3, #0x08] ldp d18, d19, [x3, #0x20] fmul d18, d0, d18 fadd d16, d16, d18 - str d16, [x3, #0x08] - fmul d16, d0, d19 - fadd d16, d17, d16 - str d16, [x3, #0x10] + fmul d18, d0, d19 + fadd d17, d17, d18 + stp d16, d17, [x3, #0x08] ldr d16, [x3, #0x18] ldr d17, [x3, #0x30] fmul d17, d0, d17 fadd d16, d16, d17 str d16, [x3, #0x18] add w1, w1, #1 cmp w2, w1 bgt G_M55007_IG04 ```

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 12, 2024

dotnet-policy-service bot assigned jakobbotsch May 12, 2024

jakobbotsch mentioned this pull request May 12, 2024

ARM64: Recognize more STP patterns #102126

Closed

jakobbotsch added 4 commits May 12, 2024 16:43

Unmark tree; fix JITDUMP

6696818

Remove outdated comment

ef40b64

Rephrase interference check

da4fbb1

Run jit-format

fb718e5

build-analysis bot mentioned this pull request May 12, 2024

Test failure in System.Numerics.Tensors.Tests.SingleGenericTensorPrimitives.SpanDestinationFunctions_SpecialValues #101731

Closed

EgorBo reviewed May 12, 2024

View reviewed changes

jakobbotsch added 3 commits May 13, 2024 11:03

Handle more constants

233554a

Merge branch 'main' of github.com:dotnet/runtime into make-store-indi…

eab1cbf

…rs-adjacent

Handle GT_CNS_DBL reuse as well

5a78b5e

jakobbotsch marked this pull request as ready for review May 13, 2024 17:46

jakobbotsch requested review from EgorBo and kunalspathak May 13, 2024 17:51

kunalspathak reviewed May 13, 2024

View reviewed changes

src/coreclr/jit/lower.cpp Show resolved Hide resolved

kunalspathak reviewed May 14, 2024

View reviewed changes

EgorBo approved these changes May 14, 2024

View reviewed changes

kunalspathak approved these changes May 15, 2024

View reviewed changes

jakobbotsch merged commit a930ef5 into dotnet:main May 15, 2024
123 of 132 checks passed

jakobbotsch deleted the make-store-indirs-adjacent branch May 15, 2024 21:25

github-actions bot locked and limited conversation to collaborators Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Reorder stores to make them amenable to stp optimization #102133

JIT: Reorder stores to make them amenable to stp optimization #102133

jakobbotsch commented May 12, 2024 •

edited

Loading

dotnet-policy-service bot commented May 12, 2024

jakobbotsch commented May 12, 2024

azure-pipelines bot commented May 12, 2024

EgorBo commented May 12, 2024

jakobbotsch commented May 12, 2024

EgorBo May 12, 2024

jakobbotsch May 12, 2024

jakobbotsch commented May 13, 2024

azure-pipelines bot commented May 13, 2024

jakobbotsch commented May 13, 2024

kunalspathak left a comment

jakobbotsch commented May 13, 2024

kunalspathak May 14, 2024

jakobbotsch May 14, 2024

kunalspathak May 14, 2024

jakobbotsch May 14, 2024

kunalspathak May 14, 2024

jakobbotsch May 14, 2024

kunalspathak May 14, 2024

EgorBo left a comment

jakobbotsch commented May 15, 2024

kunalspathak left a comment

jakobbotsch commented May 15, 2024

		@@ -9119,7 +9134,7 @@ bool Lowering::OptimizeForLdp(GenTreeIndir* ind)
		bool Lowering::TryMakeIndirsAdjacent(GenTreeIndir* prevIndir, GenTreeIndir* indir)

JIT: Reorder stores to make them amenable to stp optimization #102133

JIT: Reorder stores to make them amenable to stp optimization #102133

Conversation

jakobbotsch commented May 12, 2024 • edited Loading

dotnet-policy-service bot commented May 12, 2024

jakobbotsch commented May 12, 2024

azure-pipelines bot commented May 12, 2024

EgorBo commented May 12, 2024

jakobbotsch commented May 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch commented May 13, 2024

azure-pipelines bot commented May 13, 2024

jakobbotsch commented May 13, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

jakobbotsch commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo left a comment

Choose a reason for hiding this comment

jakobbotsch commented May 15, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

jakobbotsch commented May 15, 2024

jakobbotsch commented May 12, 2024 •

edited

Loading