-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Reorder stores to make them amenable to stp optimization #102133
JIT: Reorder stores to make them amenable to stp optimization #102133
Conversation
This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes. The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
Nice! I see you're hitting similar size regressions that I did too, e.g.: fmov s16, #1.0000
- stp s16, s16, [x0, #0x08]
+ fmov s17, #1.0000
+ stp s16, s17, [x0, #0x08] presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64 |
There is a simple workaround (this PR already has it for integer constants). It just doesn't work for float constants because |
src/coreclr/jit/lower.cpp
Outdated
// For some reason LSRA is not able to reuse a constant if both LIR | ||
// temps are live simultaneously, so skip moving in those cases and | ||
// expect LSRA to reuse the constant instead. | ||
if (!indir->Data()->IsCnsIntOrI() || !GenTree::Compare(indir->Data(), prevIndir->Data())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it make an exception for 0 value? since ZR reg shouldn't be a problem for LSRA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 0 the constant will be contained in the STOREIND
, so whether we move it or not shouldn't matter since it won't result in any instructions generated between the two indirs regardless.
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
The failures look known. cc @dotnet/jit-contrib PTAL @EgorBo @kunalspathak I wonder if there are any worries with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.
do you mean "before" the data node of the second indirection? Can you share an example or a dump to visualize this?
No, the first indirection needs to happen after the data node of the second indirection so that we don't end up with something between the two indirections (which would break the Here is the relevant JITDUMP for the example shared above: [000037] and [000025] are indirs off the same base with offsets +016 and +008
..and they are amenable to ldp/stp optimization
..and they are close. Trying to move the following range (where * are nodes part of the data flow):
N001 ( 1, 1) [000015] ----------- t15 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t15 ref
N003 ( 3, 4) [000073] -c--------- t73 = ▌ LEA(b+8) byref
┌──▌ t73 byref
N004 ( 4, 3) [000017] ---XG------ t17 = ▌ IND double <l:$441, c:$442>
* N001 ( 1, 1) [000027] ----------- t27 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t27 ref
* N003 ( 3, 4) [000079] -c--------- t79 = ▌ LEA(b+16) byref
┌──▌ t79 byref
* N004 ( 4, 3) [000029] n---GO----- t29 = ▌ IND double <l:$44f, c:$450>
N005 ( 1, 1) [000018] ----------- t18 = LCL_VAR double V00 arg0 u:1 $100
N006 ( 1, 1) [000019] ----------- t19 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t19 ref
N008 ( 3, 4) [000075] -c--------- t75 = ▌ LEA(b+32) byref
┌──▌ t75 byref
N009 ( 4, 3) [000021] n---GO----- t21 = ▌ IND double <l:$444, c:$445>
* N006 ( 1, 1) [000031] ----------- t31 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t31 ref
* N008 ( 3, 4) [000081] -c--------- t81 = ▌ LEA(b+40) byref
┌──▌ t81 byref
* N009 ( 4, 3) [000033] n---GO----- t33 = ▌ IND double <l:$452, c:$453>
┌──▌ t18 double
├──▌ t21 double
N010 ( 10, 8) [000022] ----GO----- t22 = ▌ MUL double <l:$449, c:$448>
┌──▌ t17 double
├──▌ t22 double
N011 ( 19, 15) [000023] ---XGO----- t23 = ▌ ADD double <l:$44d, c:$44c>
N012 ( 1, 1) [000014] ----------- t14 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t14 ref
N014 ( 3, 4) [000071] -c--------- t71 = ▌ LEA(b+8) byref
┌──▌ t71 byref
├──▌ t23 double
1. N015 ( 24, 19) [000025] nA-XGO----- ▌ STOREIND double <l:$206, c:$205>
[000109] ----------- IL_OFFSET void INLRT @ 0x01F[E-]
* N005 ( 1, 1) [000030] ----------- t30 = LCL_VAR double V00 arg0 u:1 $100
┌──▌ t30 double
├──▌ t33 double
* N010 ( 10, 8) [000034] ----GO----- t34 = ▌ MUL double <l:$457, c:$456>
┌──▌ t29 double
├──▌ t34 double
* N011 ( 19, 15) [000035] ----GO----- t35 = ▌ ADD double <l:$45b, c:$45a>
* N012 ( 1, 1) [000026] ----------- t26 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t26 ref
* N014 ( 3, 4) [000077] -c--------- t77 = ▌ LEA(b+16) byref
┌──▌ t77 byref
├──▌ t35 double
2. N015 ( 24, 19) [000037] nA--GO----- ▌ STOREIND double <l:$206, c:$205>
Interference checks passed: can move unrelated nodes past second indir.
Moving nodes that are not part of data flow of [000037]
Result:
N001 ( 1, 1) [000015] ----------- t15 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t15 ref
N003 ( 3, 4) [000073] -c--------- t73 = ▌ LEA(b+8) byref
┌──▌ t73 byref
N004 ( 4, 3) [000017] ---XG------ t17 = ▌ IND double <l:$441, c:$442>
* N001 ( 1, 1) [000027] ----------- t27 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t27 ref
* N003 ( 3, 4) [000079] -c--------- t79 = ▌ LEA(b+16) byref
┌──▌ t79 byref
* N004 ( 4, 3) [000029] n---GO----- t29 = ▌ IND double <l:$44f, c:$450>
N005 ( 1, 1) [000018] ----------- t18 = LCL_VAR double V00 arg0 u:1 $100
N006 ( 1, 1) [000019] ----------- t19 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t19 ref
N008 ( 3, 4) [000075] -c--------- t75 = ▌ LEA(b+32) byref
┌──▌ t75 byref
N009 ( 4, 3) [000021] n---GO----- t21 = ▌ IND double <l:$444, c:$445>
* N006 ( 1, 1) [000031] ----------- t31 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t31 ref
* N008 ( 3, 4) [000081] -c--------- t81 = ▌ LEA(b+40) byref
┌──▌ t81 byref
* N009 ( 4, 3) [000033] n---GO----- t33 = ▌ IND double <l:$452, c:$453>
┌──▌ t18 double
├──▌ t21 double
N010 ( 10, 8) [000022] ----GO----- t22 = ▌ MUL double <l:$449, c:$448>
┌──▌ t17 double
├──▌ t22 double
N011 ( 19, 15) [000023] ---XGO----- t23 = ▌ ADD double <l:$44d, c:$44c>
N012 ( 1, 1) [000014] ----------- t14 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t14 ref
N014 ( 3, 4) [000071] -c--------- t71 = ▌ LEA(b+8) byref
* N005 ( 1, 1) [000030] ----------- t30 = LCL_VAR double V00 arg0 u:1 $100
┌──▌ t30 double
├──▌ t33 double
* N010 ( 10, 8) [000034] ----GO----- t34 = ▌ MUL double <l:$457, c:$456>
┌──▌ t29 double
├──▌ t34 double
* N011 ( 19, 15) [000035] ----GO----- t35 = ▌ ADD double <l:$45b, c:$45a>
┌──▌ t71 byref
├──▌ t23 double
1. N015 ( 24, 19) [000025] nA-XGO----- ▌ STOREIND double <l:$206, c:$205>
* N012 ( 1, 1) [000026] ----------- t26 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t26 ref
* N014 ( 3, 4) [000077] -c--------- t77 = ▌ LEA(b+16) byref
┌──▌ t77 byref
├──▌ t35 double
2. N015 ( 24, 19) [000037] nA--GO----- ▌ STOREIND double <l:$206, c:$205>
[000109] ----------- IL_OFFSET void INLRT @ 0x01F[E-] Here you can see that |
@@ -9119,7 +9134,7 @@ bool Lowering::OptimizeForLdp(GenTreeIndir* ind) | |||
bool Lowering::TryMakeIndirsAdjacent(GenTreeIndir* prevIndir, GenTreeIndir* indir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The baseAddr
for both prevIndir
and indir
is assumed to be same? Can we have an assert for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that easy to assert (it won't be exactly the same, just something that we expect the emitter peephole to kick in for). But it's also not a precondition for this function that the addresses must be related in some way -- the function works fine even without that. So I don't think there is a good reason to try to assert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean will we mistakenly use offsets of different base address and combine them, leading to bad codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reordering here doesn't combine the stores or loads. It just puts them next to each other. Combining them is done by the peephole in the emitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, what I meant is we might end up reordering unrelated indir
and prevIndir
but the peephole emitter will make sure that we combine the correct matching indir
and prevIndir
only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this function is still correct even if you ask it to make two indirs off of unrelated addresses adjacent. There is no correctness requirement that the addresses relate in some way; this is not a precondition for the function. Hence why I don't see a good reason to try to assert that (and furthermore, it isn't easy to assert it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Ok, just wanted to confirm my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, diffs are definitely better than in #102126
@kunalspathak Any other feedback or are you ok with merging this as is? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, I thought I already approved. The JitStress* failures are existing issues I assume?
Yep, they look known. |
…#102133) This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes. The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection. Example: ```csharp class Body { public double x, y, z, vx, vy, vz, mass; } static void Advance(double dt, Body[] bodies) { foreach (Body b in bodies) { b.x += dt * b.vx; b.y += dt * b.vy; b.z += dt * b.vz; } } ``` Diff: ```diff @@ -1,18 +1,17 @@ -G_M55007_IG04: ;; offset=0x001C +G_M55007_IG04: ;; offset=0x0020 ldr x3, [x0, w1, UXTW #3] ldp d16, d17, [x3, #0x08] ldp d18, d19, [x3, #0x20] fmul d18, d0, d18 fadd d16, d16, d18 - str d16, [x3, #0x08] - fmul d16, d0, d19 - fadd d16, d17, d16 - str d16, [x3, #0x10] + fmul d18, d0, d19 + fadd d17, d17, d18 + stp d16, d17, [x3, #0x08] ldr d16, [x3, #0x18] ldr d17, [x3, #0x30] fmul d17, d0, d17 fadd d16, d16, d17 str d16, [x3, #0x18] add w1, w1, #1 cmp w2, w1 bgt G_M55007_IG04 ```
This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes.
The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.
Example:
Diff: