Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Fast Tail Call Optimization for ARM32 #66282

Merged
merged 13 commits into from
Mar 13, 2022

Conversation

clamp03
Copy link
Member

@clamp03 clamp03 commented Mar 7, 2022

  • Not use fast tail call when callee use split struct argument
  • Not use fast tail call when callee use non-standard calling convention
  • Not use fast tail call when it overwrites stack which will be passed to callee.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 7, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 7, 2022
@ghost
Copy link

ghost commented Mar 7, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details
  • Not use fast tail call when callee use split struct argument
  • Not use fast tail call when callee use non-standard calling convention
  • Not use fast tail call when it overwrites stack which will be passed to callee.
Author: clamp03
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@BruceForstall
Copy link
Member

@jakobbotsch PTAL
cc @dotnet/jit-contrib

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like a good start! There are tests that you should enable in src/tests/issues.targets now (look under arm32/ctrl-f for "tail").

We have an ABI stress test that among other things tests tailcalls, but it is disabled on ARM32 for an unrelated reason right now. The way it works is that it generates a lot of callers that do tailcalls with random permutations of arguments/values. When running it I see failures with this PR. You can build the test with the following (adjusting to Linux if you are on Linux):

./dotnet.cmd build .\src\tests\JIT\Stress\ABI\tailcalls_do.csproj -c Release

This should print the path of the built .dll to the console (note that the .dll might be put under a non-arm32 directory, but it works fine on arm32). After that you can run it on arm32 with an arm32 corerun. To test tailcalls you should invoke it as follows:

<path to corerun> <path to tailcalls_do.dll> --tailcalls --num-calls 10000

For me this gives:

Mismatch in tailcall: expected 1164956936, got 1917288685
Int32 ABIStress_TailCaller276(S7U, S9U, System.Numerics.Vector`1[System.Int32], Int16, S16U, Int32, S8P, S4P, S13U, S4P, Byte, S4U, S15U, S4U, S1P, Single, S12U, Byte, S14U, S15U, S8P)
Int32 ABIStress_TailCallee7876(S9U, Int64)

To drill into that you can invoke it with the --caller-index arg, e.g.:

$env:COMPlus_JitDisasm="ABIStress_TailCaller276"
<path to corerun> <path to tailcalls_do.dll> --tailcalls --caller-index 276

which gives

Stressing tailcalls
OSVersion: Microsoft Windows NT 10.0.22000.0
OSArchitecture: Arm64
ProcessArchitecture: Arm
Selecting win86 ABI
; Assembly listing for method DynamicClass:ABIStress_TailCaller276(ABIStress.S7U,ABIStress.S9U,System.Numerics.Vector`1[Int32],short,ABIStress.S16U,int,ABIStress.S8P,ABIStress.S4P,ABIStress.S13U,ABIStress.S4P,ubyte,ABIStress.S4U,ABIStress.S15U,ABIStress.S4U,ABIStress.S1P,float,ABIStress.S12U,ubyte,ABIStress.S14U,ABIStress.S15U,ABIStress.S8P):int
; Emitting BLENDED_CODE for generic ARM CPU - Windows
; optimized code
; sp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )  struct ( 8) zero-ref    do-not-enreg[S] single-def
;  V01 arg1         [V01,T00] (  5,  5   )  struct (12) [sp+10H]   do-not-enreg[SFA] multireg-arg single-def
;* V02 arg2         [V02    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[S] single-def double-align
;* V03 arg3         [V03    ] (  0,  0   )   short  ->  zero-ref    single-def
;* V04 arg4         [V04    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[S] single-def
;* V05 arg5         [V05    ] (  0,  0   )     int  ->  zero-ref    single-def
;* V06 arg6         [V06    ] (  0,  0   )  struct ( 8) zero-ref    single-def double-align
;* V07 arg7         [V07    ] (  0,  0   )  struct ( 4) zero-ref    single-def
;* V08 arg8         [V08    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[S] single-def
;* V09 arg9         [V09    ] (  0,  0   )  struct ( 4) zero-ref    single-def
;* V10 arg10        [V10    ] (  0,  0   )   ubyte  ->  zero-ref    single-def
;* V11 arg11        [V11    ] (  0,  0   )  struct ( 4) zero-ref    do-not-enreg[S] single-def
;* V12 arg12        [V12    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[S] single-def
;* V13 arg13        [V13    ] (  0,  0   )  struct ( 4) zero-ref    do-not-enreg[S] single-def
;* V14 arg14        [V14    ] (  0,  0   )  struct ( 4) zero-ref    single-def
;* V15 arg15        [V15    ] (  0,  0   )   float  ->  zero-ref    single-def
;* V16 arg16        [V16    ] (  0,  0   )  struct (12) zero-ref    do-not-enreg[S] single-def
;* V17 arg17        [V17    ] (  0,  0   )   ubyte  ->  zero-ref    single-def
;* V18 arg18        [V18    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[S] single-def
;* V19 arg19        [V19    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[S] single-def
;  V20 arg20        [V20    ] (  2,  2   )  struct ( 8) [sp+B8H]   single-def double-align
;# V21 OutArgs      [V21    ] (  1,  1   )  lclBlk ( 0) [sp+00H]   "OutgoingArgSpace"
;* V22 tmp1         [V22    ] (  0,  0   )    long  ->  zero-ref    do-not-enreg[] single-def V06.F0(offs=0x00) P-DEP "field V06.F0 (fldOffset=0x0)"
;* V23 tmp2         [V23    ] (  0,  0   )     int  ->  zero-ref    do-not-enreg[] single-def V07.F0(offs=0x00) P-DEP "field V07.F0 (fldOffset=0x0)"
;* V24 tmp3         [V24    ] (  0,  0   )     int  ->  zero-ref    do-not-enreg[] single-def V09.F0(offs=0x00) P-DEP "field V09.F0 (fldOffset=0x0)"
;* V25 tmp4         [V25    ] (  0,  0   )   ubyte  ->  zero-ref    do-not-enreg[] single-def V14.F0(offs=0x00) P-DEP "field V14.F0 (fldOffset=0x0)"
;  V26 tmp5         [V26,T01] (  2,  2   )    long  ->  [sp+B8H]   do-not-enreg[] single-def V20.F0(offs=0x00) P-DEP "field V20.F0 (fldOffset=0x0)"
;
; Lcl frame size = 4

G_M379_IG01:              ;; offset=0000H
000000  B40F           push    {r0,r1,r2,r3}
000002  B508           push    {r3,lr}
                                                ;; bbWeight=1    PerfScore 2.00
G_M379_IG02:              ;; offset=0004H
000004  982E           ldr     r0, [sp+0xb8]
000006  992F           ldr     r1, [sp+0xbc]
000008  9006           str     r0, [sp+0x18]
00000A  9107           str     r1, [sp+0x1c]
00000C  9804           ldr     r0, [sp+0x10]
00000E  9905           ldr     r1, [sp+0x14]
000010  9A06           ldr     r2, [sp+0x18]
000012  F64C 33AD      movw    r3, 0xcbad
000016  F6C0 5356      movt    r3, 0xd56
                                                ;; bbWeight=1    PerfScore 9.00
G_M379_IG03:              ;; offset=001AH
00001A  B001           add     sp, 4
00001C  F85D EB04      pop     lr
000020  B004           add     sp, 16
000022  4718           bx      r3               // DynamicClass:ABIStress_TailCallee7876(ABIStress.S9U,long):int
                                                ;; bbWeight=1    PerfScore 4.00

; Total bytes of code 36, prolog size 4, PerfScore 18.60, instruction count 15, allocated bytes for code 36 (MethodHash=425efe84) for method DynamicClass:ABIStress_TailCaller276(ABIStress.S7U,ABIStress.S9U,System.Numerics.Vector`1[Int32],short,ABIStress.S16U,int,ABIStress.S8P,ABIStress.S4P,ABIStress.S13U,ABIStress.S4P,ubyte,ABIStress.S4U,ABIStress.S15U,ABIStress.S4U,ABIStress.S1P,float,ABIStress.S12U,ubyte,ABIStress.S14U,ABIStress.S15U,ABIStress.S8P):int
; ============================================================

Mismatch in tailcall: expected 1528035826, got -949328163
Int32 ABIStress_TailCaller276(S7U, S9U, System.Numerics.Vector`1[System.Int32], Int16, S16U, Int32, S8P, S4P, S13U, S4P, Byte, S4U, S15U, S4U, S1P, Single, S12U, Byte, S14U, S15U, S8P)
Int32 ABIStress_TailCallee7876(S9U, Int64)
1 tailcalls tested
  Done with 1 mismatches

There is also the --verbose arg that makes it print values of all args that were passed through the calls, which can be helpful.

If you look at the jit dump ($env:COMPlus_JitDump="ABIStress_TailCaller276") you can see the IL at the top:

IL_0000  fe 09 01 00       ldarg        0x1
IL_0004  fe 09 14 00       ldarg        0x14
IL_0008  7b 02 00 00 04    ldfld        0x4000002
IL_000d  fe 14             tail.       
IL_000f  28 03 00 00 0a    call         0xA000003
IL_0014  2a                ret         

So the code passes the second parameter and a field of the 21st parameter to a tailcall. It looks like during the tailcall it places the second argument too early which overrides the first parameter before it is used. We handle these cases in LowerFastTailCall so this is where I would start to look (and once it works you can remove the restrictions you added in fgCanFastTailCall).

src/coreclr/jit/codegenarmarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/codegenlinear.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lsraarmarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 7, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Mar 7, 2022
@clamp03
Copy link
Member Author

clamp03 commented Mar 8, 2022

@jakobbotsch Wow!! Thank you for the comments.

I already checked the tailcall_do (and more) test before the push.
It doesn't fail in my test environment (raspberry pi 4 arm32 / linux).

$ ./tailcalls_do.sh 
BEGIN EXECUTION
/home/dheon/data/tests/coreclr/Linux.arm.Release/Tests/Core_Root//corerun -p System.Reflection.Metadata.MetadataUpdater.IsSupported=false tailcalls_do.dll '--tailcalls' '--num-calls' '1000' '--no-ctrlc-summary'
Stressing tailcalls
OSVersion: Unix 5.4.51.7
OSArchitecture: Arm
ProcessArchitecture: Arm
Selecting armhf ABI.
50 callers done (50 successful tailcalls tested)
100 callers done (100 successful tailcalls tested)
150 callers done (150 successful tailcalls tested)
200 callers done (200 successful tailcalls tested)
250 callers done (250 successful tailcalls tested)
300 callers done (300 successful tailcalls tested)
350 callers done (350 successful tailcalls tested)
400 callers done (400 successful tailcalls tested)
450 callers done (450 successful tailcalls tested)
500 callers done (500 successful tailcalls tested)
550 callers done (550 successful tailcalls tested)
600 callers done (599 successful tailcalls tested)
650 callers done (650 successful tailcalls tested)
700 callers done (699 successful tailcalls tested)
750 callers done (750 successful tailcalls tested)
800 callers done (800 successful tailcalls tested)
850 callers done (850 successful tailcalls tested)
900 callers done (900 successful tailcalls tested)
950 callers done (949 successful tailcalls tested)
1000 callers done (1000 successful tailcalls tested)
1000 tailcalls tested
  Done with 0 mismatches
Expected: 100
Actual: 100
END EXECUTION - PASSED

In my guess, it fails in arm32 windows.

I will update one-by-one based on your comments.
Thank you so much!!!

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 8, 2022
@jakobbotsch
Copy link
Member

jakobbotsch commented Mar 8, 2022

It doesn't fail in my test environment (raspberry pi 4 arm32 / linux).

I can see than on Windows we end up selecting win86 ABI. It means the permutations tested are different.
You can try to hardcode this choice (just return it from SelectAbi() always) or you can try to run with a much larger --num-calls. I saw multiple failures when I increased it so I would expect you to hit one as well if you run it for long enough.

I believe the win-arm32 and linux-arm32 ABIs are practically the same so I would not expect only one of them to have a bug.

@clamp03
Copy link
Member Author

clamp03 commented Mar 8, 2022

It doesn't fail in my test environment (raspberry pi 4 arm32 / linux).

I can see than on Windows we end up selecting win86 ABI. It means the permutations tested are different. You can try to hardcode this choice (just return it from SelectAbi() always) or you can try to run with a much larger --num-calls. I saw multiple failures when I increased it so I would expect you to hit one as well if you run it for long enough.

I believe the win-arm32 and linux-arm32 ABIs are practically the same so I would not expect only one of them to have a bug.

I will check with a much larger --num-calls.
Thank you.

@clamp03
Copy link
Member Author

clamp03 commented Mar 8, 2022

I checked with 10,000 --num-calls. I will test with 100,000 --num-calls tonight. :)
Thank you.

@jakobbotsch
Copy link
Member

jakobbotsch commented Mar 8, 2022

I took another look at this and I am more and more confused about the reliance on the pre spill mask, especially considering it changes depending on many factors (e.g. if profiler is attached). Also, it is hard for me to have confidence in the new ARM32 logic given the many calculations it makes with frame sizes/alignment that can change based on subtle factors.

My understanding is that the ARM32 support of this feature is more difficult than other platforms if we want to support it when the caller stack frame has a parameter split between registers and stack. The reason being that the existing logic assumes that there is a single parameter that can be used as the "base" local for the incoming stack area.

Have you considered disallowing tailcalls out of these functions? I see that this PR already disallows tailcalls into such functions, so hopefully it will not be a significant regression to also disallow tailcalls out of them based on the same reasoning.

If you disallow this I believe the existing logic that is there already will pretty much just work. Then existing logic in the backend of the JIT handles calculating the proper offsets and we do not have to deal with complicated frame size/alignment logic in the tail call logic. Does it sound reasonable?

We do not have an existing easy way to check if the caller has a split parameter, but it should not be too hard to add that information when we go through the user args like this: jakobbotsch@21a43ab
Then you can check this flag in fgCanFastTailCall.

@clamp03
Copy link
Member Author

clamp03 commented Mar 10, 2022

@jakobbotsch Sorry for late response. I was OOO.

If you disallow this I believe the existing logic that is there already will pretty much just work. Then existing logic in the backend of the JIT handles calculating the proper offsets and we do not have to deal with complicated frame size/alignment logic in the tail call logic. Does it sound reasonable?

Thank you so much. It is too hard to support all complex cases. I agree with you!
I will fix and push again.

@clamp03
Copy link
Member Author

clamp03 commented Mar 10, 2022

Tested tailcall_d and tailcall_do with 100,000 num-calls on linux/arm32 and passed all.

@jakobbotsch
Copy link
Member

It looks great to me now, but looks like there are some profiler-related tests that are failing, can you take a look?

@clamp03
Copy link
Member Author

clamp03 commented Mar 10, 2022

Sure!! :)

- Not use fast tail call when callee use split struct argument
- Not use fast tail call when callee use non-standard calling convention
- Not use fast tail call when it overwrites stack which will be passed to callee.
@jakobbotsch
Copy link
Member

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member

@clamp03 I have pushed a commit to your branch that fixes the stress failures. The problem was that the GS security cookie check uses the r12 register in the epilog, so if we allocated the call target to this register it would cause problems.

Since r12 is the only non-parameter volatile register there is no possible register to use when the function does a GS security cookie check if the callee then has 4 parameters passed in registers. Therefore I have disabled fast tailcalls out of functions with GS security cookie checks.

@jakobbotsch
Copy link
Member

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member

Failures in libraries test runs (x64/arm64) are both #66540. Jitstress failures are all #66174. Outerloop failures are #63858, #66368 and #66174. Looks like this is good to go, thank you!

@jakobbotsch jakobbotsch merged commit 51b90cc into dotnet:main Mar 13, 2022
@clamp03
Copy link
Member Author

clamp03 commented Mar 13, 2022

@jakobbotsch Thank you so much!!!

@clamp03
Copy link
Member Author

clamp03 commented Mar 17, 2022

@jakobbotsch
In our company, we will release our products with .NET 6 on 2022.
I found that there are many patches are merged to improve JIT for .NET 7.
Actually, I hope our products also contains these good patches as much as possible. :)
Do you have any plan to backport some patches to .NET 6?
I am really appreciated about hard work of you and your team.
Thank you.

@jakobbotsch
Copy link
Member

@clamp03 We generally do not backport performance improvements, only fixes to bugs that were hit in real customer scenarios. So we do not plan to backport those improvements to .NET 6.

@clamp03
Copy link
Member Author

clamp03 commented Mar 17, 2022

@jakobbotsch It is okay. Thank you!!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants