Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate unnecessary vector register moves #33975

Closed
echesakov opened this issue Mar 23, 2020 · 20 comments
Closed

Investigate unnecessary vector register moves #33975

echesakov opened this issue Mar 23, 2020 · 20 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@echesakov
Copy link
Contributor

echesakov commented Mar 23, 2020

Copied from @TamarChristinaArm reply in #33749 (comment)

Also am I missing something? I see

        6E084509          mov     v9.d[0], v8.d[1]
...
        6E180528          mov     v8.d[1], v9.d[0]

With no actual usages of v8 or v9. and it does the same odd thing again later

        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E180528          mov     v8.d[1], v9.d[0]
        6E18056A          mov     v10.d[1], v11.d[0]

v9 and v11 are untouched. it only uses v8 and v10.

category:cq
theme:hardware-intrinsics
skill-level:intermediate
cost:medium

@echesakov echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Mar 23, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 23, 2020
@tannergooding
Copy link
Member

Is v8 possibly a register that needs to be saved due to the method call?

@echesakov
Copy link
Contributor Author

Is v8 possibly a register that needs to be saved due to the method call?

It could be a case since even though it's a non-volatile register (to be precise, its low 64 bits are non-volatile) we still need to preserve its high 64 bits if they are actually used somewhere.

@TamarChristinaArm
Copy link
Contributor

Is v8 possibly a register that needs to be saved due to the method call?

It could be a case since even though it's a non-volatile register (to be precise, its low 64 bits are non-volatile) we still need to preserve its high 64 bits if they are actually used somewhere.

That's right, but in the second example the value is used before it's restored

        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E180528          mov     v8.d[1], v9.d[0]

so if the parent did clobber the top bits the result would be wrong. So looks like a dataflow bug..

Also I think the RA should avoid using v8-v15 as much as possible around function calls. This would save it from having to do the save and restore. In C code I think this situation would be rare.. but since there's a lot of software fallbacks you'll see this situation a lot more.

@TamarChristinaArm
Copy link
Contributor

or if the JIT know which registers are used in the function call it can safely not do the spill/restore at all. I don't know if it has that information for LRA?

@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Mar 24, 2020

Also what's the general preference here? stack space or performance? if stack space is a concern than using the ASIMD MOV is the best option (It's really an INS in this case), but if performance then you're better off just spilling them as a STR has a lower latency and just spill the entire thing. In this case the JIT could have done an STP as well and ended up with 1/4th of the overhead that it has now.

@echesakov
Copy link
Contributor Author

Also I think the RA should avoid using v8-v15 as much as possible around function calls. This would save it from having to do the save and restore. In C code I think this situation would be rare.. but since there's a lot of software fallbacks you'll see this situation a lot more.

I agree this could be an option, but not sure if this can be done easily. @CarolEidt probably knows.

or if the JIT know which registers are used in the function call it can safely not do the spill/restore at all. I don't know if it has that information for LRA?

I believe generally such information is not available to the JIT due to top-down compilation approach.
I remember that @erozenfeld looked in the past for doing such optimization in AOT and even there is was tricky due to some registers being used/trashed by stubs.

Also what's the general preference here? stack space or performance? if stack space is a concern than using the ASIMD MOV is the best option (It's really an INS in this case), but if performance then you're better off just spilling them as a STR has a lower latency and just spill the entire thing. In this case the JIT could have done an STP as well and ended up with 1/4th of the overhead that it has now.

I didn't know that spilling could be cheaper than register move. @BruceForstall @briansull Do you see any issues with JIT doing this instead of ASIMD MOV for volatile SIMD&FP registers around a function call?

@CarolEidt
Copy link
Contributor

I think the RA should avoid using v8-v15 as much as possible around function calls

So you're saying it would be better to save & restore the full vector, rather than doing the partial save & restore?

The use of v8 after the call, but before the restore, definitely looks like a bug.

@TamarChristinaArm
Copy link
Contributor

I didn't know that spilling could be cheaper than register move. @BruceForstall @briansull Do you see any issues with JIT doing this instead of ASIMD MOV for volatile SIMD&FP registers around a function call?

Actually I take that back, this is a bit nuanced.. while the stores can end up being cheaper as soon as you have more than 1 register to save. The loads are a lot more expensive.. so until you spill a LOT, the INS would be cheaper.

But how does the JIT handle this case and high register pressure? I'm slightly worried that it'll spill another register so it can do the saving of the top part..

@CarolEidt
Copy link
Contributor

But how does the JIT handle this case and high register pressure? I'm slightly worried that it'll spill another register so it can do the saving of the top part..

I believe that what it will do is spill it to a volatile register (since those are always available), and then store that upper half to memory. It might be reasonable to add support for spilling to memory if the callee-save registers are all occupied.

@TamarChristinaArm
Copy link
Contributor

But how does the JIT handle this case and high register pressure? I'm slightly worried that it'll spill another register so it can do the saving of the top part..

I believe that what it will do is spill it to a volatile register (since those are always available), and then store that upper half to memory. It might be reasonable to add support for spilling to memory if the callee-save registers are all occupied.

Ah ok, that sounds reasonable. Then I think if we can avoid using v8-v15 as much as possible then this case should be less of an issue. the INS look like the best option if there are free registers.

I suspect that once all the helper functions have some naive AArch64 code behind them that having a non-vector call in between vector code would be much rarer... unless they're vector function calls but that would ideally use the vector APCs anyway.

so the only bug here is the use before restore of v8.

@BruceForstall BruceForstall added this to the Future milestone Apr 4, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 4, 2020
@CarolEidt
Copy link
Contributor

@echesakovMSFT @TamarChristinaArm - unless I'm looking at the wrong thing, it seems that we are no longer generating the same code for these BitArray methods, so the issue doesn't occur. Do we know of other cases where we see this kind of use-before-restore?

@echesakov
Copy link
Contributor Author

@CarolEidt We might not see it now because Vector128.Create was intrinsified at some point after the issue was opened.
I am wondering if it is still possible to repro the unnecessary moves from volatile to non-volatile registers by writing a method that has intrinsics usages interleaved with a call to a function (that won't be inlined).

@TamarChristinaArm
Copy link
Contributor

Yeah what @echesakovMSFT should work. Could probably increase the chances of getting this exact issue by increasing register pressure.

@echesakov
Copy link
Contributor Author

I can take a look at this

@echesakov echesakov self-assigned this Sep 15, 2020
@echesakov
Copy link
Contributor Author

I was trying to find instances of the "use-before-restore" issue and I couldn't find them even in the original PR #33749

The code that I quoted on the first message of the thread was copied from Tamar's comment #33749 (comment) and I believe corresponds to BitArray.CopyTo method. However, in the jitDump.txt file I collected back then that code looks correct:

        6E084509          mov     v9.d[0], v8.d[1]
        6E08454B          mov     v11.d[0], v10.d[1]
        97FFF096          bl      System.Runtime.Intrinsics.Vector128:<Create>g__SoftwareFallback|23_0(int):System.Runtime.Intrinsics.Vector128`1[Int32]
        4E003810          zip1    v16.16b, v0.16b, v0.16b
        4E103A10          zip1    v16.16b, v16.16b, v16.16b
        4E103A11          zip1    v17.16b, v16.16b, v16.16b
        6E180528          mov     v8.d[1], v9.d[0]
        4E281E31          and     v17.16b, v17.16b, v8.16b
        6E18056A          mov     v10.d[1], v11.d[0]

Upper 64 bits of v8 were restored before using in and instruction.

Unless, I misunderstood something, I don't think there exists such issue in the JIT.

@TamarChristinaArm @CarolEidt

@TamarChristinaArm
Copy link
Contributor

@echesakovMSFT hmm no I think I agree.. I don't know how that bit changed during copy and paste, but clearly it's not the same sequence as in the original comment you made. I usually copy them locally to make inline comments and something must have flipped..

So this looks like a bogus report.. Sorry for the noise, I'll double check these from now on to make sure.

@CarolEidt
Copy link
Contributor

@echesakovMSFT - should this now be closed?

@echesakov
Copy link
Contributor Author

@CarolEidt I was looking into another aspect of the issue - whether the mov-s are necessary or not, so I kept the issue open.
So far, I haven't seen anything that is unnecessary but I am thinking how to mitigate the mov-s of upper 64-bits in the jitted code around helpers. Perhaps, we can implement some of them in assembly and guarantee calling convention that doesn't alter any of the SIMD registers?

@CarolEidt
Copy link
Contributor

Perhaps, we can implement some of them in assembly and guarantee calling convention that doesn't alter any of the SIMD registers?

That sounds like a great idea!

@echesakov
Copy link
Contributor Author

Closing since the original issue was confirmed to be not an issue.
As for trying an idea in #33975 (comment) we won't have time for this in .NET 6.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

6 participants