Investigate unnecessary vector register moves #33975

echesakov · 2020-03-23T17:57:59Z

Copied from @TamarChristinaArm reply in #33749 (comment)

Also am I missing something? I see

        6E084509          mov     v9.d[0], v8.d[1]
...
        6E180528          mov     v8.d[1], v9.d[0]

With no actual usages of v8 or v9. and it does the same odd thing again later

        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E180528          mov     v8.d[1], v9.d[0]
        6E18056A          mov     v10.d[1], v11.d[0]

v9 and v11 are untouched. it only uses v8 and v10.

category:cq
theme:hardware-intrinsics
skill-level:intermediate
cost:medium

tannergooding · 2020-03-23T18:27:57Z

Is v8 possibly a register that needs to be saved due to the method call?

echesakov · 2020-03-23T18:57:05Z

Is v8 possibly a register that needs to be saved due to the method call?

It could be a case since even though it's a non-volatile register (to be precise, its low 64 bits are non-volatile) we still need to preserve its high 64 bits if they are actually used somewhere.

TamarChristinaArm · 2020-03-24T12:24:18Z

Is v8 possibly a register that needs to be saved due to the method call?

It could be a case since even though it's a non-volatile register (to be precise, its low 64 bits are non-volatile) we still need to preserve its high 64 bits if they are actually used somewhere.

That's right, but in the second example the value is used before it's restored

        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E180528          mov     v8.d[1], v9.d[0]

so if the parent did clobber the top bits the result would be wrong. So looks like a dataflow bug..

Also I think the RA should avoid using v8-v15 as much as possible around function calls. This would save it from having to do the save and restore. In C code I think this situation would be rare.. but since there's a lot of software fallbacks you'll see this situation a lot more.

TamarChristinaArm · 2020-03-24T12:26:18Z

or if the JIT know which registers are used in the function call it can safely not do the spill/restore at all. I don't know if it has that information for LRA?

TamarChristinaArm · 2020-03-24T13:06:31Z

Also what's the general preference here? stack space or performance? if stack space is a concern than using the ASIMD MOV is the best option (It's really an INS in this case), but if performance then you're better off just spilling them as a STR has a lower latency and just spill the entire thing. In this case the JIT could have done an STP as well and ended up with 1/4th of the overhead that it has now.

echesakov · 2020-03-24T18:47:08Z

Also I think the RA should avoid using v8-v15 as much as possible around function calls. This would save it from having to do the save and restore. In C code I think this situation would be rare.. but since there's a lot of software fallbacks you'll see this situation a lot more.

I agree this could be an option, but not sure if this can be done easily. @CarolEidt probably knows.

or if the JIT know which registers are used in the function call it can safely not do the spill/restore at all. I don't know if it has that information for LRA?

I believe generally such information is not available to the JIT due to top-down compilation approach.
I remember that @erozenfeld looked in the past for doing such optimization in AOT and even there is was tricky due to some registers being used/trashed by stubs.

Also what's the general preference here? stack space or performance? if stack space is a concern than using the ASIMD MOV is the best option (It's really an INS in this case), but if performance then you're better off just spilling them as a STR has a lower latency and just spill the entire thing. In this case the JIT could have done an STP as well and ended up with 1/4th of the overhead that it has now.

I didn't know that spilling could be cheaper than register move. @BruceForstall @briansull Do you see any issues with JIT doing this instead of ASIMD MOV for volatile SIMD&FP registers around a function call?

CarolEidt · 2020-03-24T19:24:54Z

I think the RA should avoid using v8-v15 as much as possible around function calls

So you're saying it would be better to save & restore the full vector, rather than doing the partial save & restore?

The use of v8 after the call, but before the restore, definitely looks like a bug.

TamarChristinaArm · 2020-03-24T19:26:19Z

I didn't know that spilling could be cheaper than register move. @BruceForstall @briansull Do you see any issues with JIT doing this instead of ASIMD MOV for volatile SIMD&FP registers around a function call?

Actually I take that back, this is a bit nuanced.. while the stores can end up being cheaper as soon as you have more than 1 register to save. The loads are a lot more expensive.. so until you spill a LOT, the INS would be cheaper.

But how does the JIT handle this case and high register pressure? I'm slightly worried that it'll spill another register so it can do the saving of the top part..

CarolEidt · 2020-03-24T19:31:22Z

But how does the JIT handle this case and high register pressure? I'm slightly worried that it'll spill another register so it can do the saving of the top part..

I believe that what it will do is spill it to a volatile register (since those are always available), and then store that upper half to memory. It might be reasonable to add support for spilling to memory if the callee-save registers are all occupied.

TamarChristinaArm · 2020-03-24T19:36:44Z

But how does the JIT handle this case and high register pressure? I'm slightly worried that it'll spill another register so it can do the saving of the top part..

I believe that what it will do is spill it to a volatile register (since those are always available), and then store that upper half to memory. It might be reasonable to add support for spilling to memory if the callee-save registers are all occupied.

Ah ok, that sounds reasonable. Then I think if we can avoid using v8-v15 as much as possible then this case should be less of an issue. the INS look like the best option if there are free registers.

I suspect that once all the helper functions have some naive AArch64 code behind them that having a non-vector call in between vector code would be much rarer... unless they're vector function calls but that would ideally use the vector APCs anyway.

so the only bug here is the use before restore of v8.

CarolEidt · 2020-08-08T00:36:59Z

@echesakovMSFT @TamarChristinaArm - unless I'm looking at the wrong thing, it seems that we are no longer generating the same code for these BitArray methods, so the issue doesn't occur. Do we know of other cases where we see this kind of use-before-restore?

echesakov · 2020-08-08T00:56:08Z

@CarolEidt We might not see it now because Vector128.Create was intrinsified at some point after the issue was opened.
I am wondering if it is still possible to repro the unnecessary moves from volatile to non-volatile registers by writing a method that has intrinsics usages interleaved with a call to a function (that won't be inlined).

TamarChristinaArm · 2020-08-10T18:41:41Z

Yeah what @echesakovMSFT should work. Could probably increase the chances of getting this exact issue by increasing register pressure.

echesakov · 2020-09-15T17:45:52Z

I can take a look at this

echesakov · 2020-10-06T23:02:56Z

I was trying to find instances of the "use-before-restore" issue and I couldn't find them even in the original PR #33749

The code that I quoted on the first message of the thread was copied from Tamar's comment #33749 (comment) and I believe corresponds to BitArray.CopyTo method. However, in the jitDump.txt file I collected back then that code looks correct:

        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        6E180528          mov     v8.d[1], v9.d[0]
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E18056A          mov     v10.d[1], v11.d[0]

Upper 64 bits of v8 were restored before using in and instruction.

Unless, I misunderstood something, I don't think there exists such issue in the JIT.

@TamarChristinaArm @CarolEidt

TamarChristinaArm · 2020-10-07T02:53:19Z

@echesakovMSFT hmm no I think I agree.. I don't know how that bit changed during copy and paste, but clearly it's not the same sequence as in the original comment you made. I usually copy them locally to make inline comments and something must have flipped..

So this looks like a bogus report.. Sorry for the noise, I'll double check these from now on to make sure.

CarolEidt · 2020-10-13T16:19:08Z

@echesakovMSFT - should this now be closed?

echesakov · 2020-10-13T16:24:12Z

@CarolEidt I was looking into another aspect of the issue - whether the mov-s are necessary or not, so I kept the issue open.
So far, I haven't seen anything that is unnecessary but I am thinking how to mitigate the mov-s of upper 64-bits in the jitted code around helpers. Perhaps, we can implement some of them in assembly and guarantee calling convention that doesn't alter any of the SIMD registers?

CarolEidt · 2020-10-13T16:40:00Z

Perhaps, we can implement some of them in assembly and guarantee calling convention that doesn't alter any of the SIMD registers?

That sounds like a great idea!

echesakov · 2021-07-08T20:21:34Z

Closing since the original issue was confirmed to be not an issue.
As for trying an idea in #33975 (comment) we won't have time for this in .NET 6.

echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Mar 23, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 23, 2020

BruceForstall added this to the Future milestone Apr 4, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2020

echesakov self-assigned this Sep 15, 2020

echesakov mentioned this issue Oct 20, 2020

[Arm64] Planned JIT work in .NET 6 #43629

Closed

29 tasks

echesakov closed this as completed Jul 8, 2021

ghost locked as resolved and limited conversation to collaborators Aug 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate unnecessary vector register moves #33975

Investigate unnecessary vector register moves #33975

echesakov commented Mar 23, 2020 •

edited by BruceForstall

Loading

tannergooding commented Mar 23, 2020

echesakov commented Mar 23, 2020

TamarChristinaArm commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020 •

edited

Loading

echesakov commented Mar 24, 2020

CarolEidt commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

CarolEidt commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

CarolEidt commented Aug 8, 2020

echesakov commented Aug 8, 2020

TamarChristinaArm commented Aug 10, 2020

echesakov commented Sep 15, 2020

echesakov commented Oct 6, 2020

TamarChristinaArm commented Oct 7, 2020

CarolEidt commented Oct 13, 2020

echesakov commented Oct 13, 2020

CarolEidt commented Oct 13, 2020

echesakov commented Jul 8, 2021

Investigate unnecessary vector register moves #33975

Investigate unnecessary vector register moves #33975

Comments

echesakov commented Mar 23, 2020 • edited by BruceForstall Loading

tannergooding commented Mar 23, 2020

echesakov commented Mar 23, 2020

TamarChristinaArm commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020 • edited Loading

echesakov commented Mar 24, 2020

CarolEidt commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

CarolEidt commented Mar 24, 2020

TamarChristinaArm commented Mar 24, 2020

CarolEidt commented Aug 8, 2020

echesakov commented Aug 8, 2020

TamarChristinaArm commented Aug 10, 2020

echesakov commented Sep 15, 2020

echesakov commented Oct 6, 2020

TamarChristinaArm commented Oct 7, 2020

CarolEidt commented Oct 13, 2020

echesakov commented Oct 13, 2020

CarolEidt commented Oct 13, 2020

echesakov commented Jul 8, 2021

echesakov commented Mar 23, 2020 •

edited by BruceForstall

Loading

TamarChristinaArm commented Mar 24, 2020 •

edited

Loading