-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Fast Tail Call Optimization for ARM32 #66282
Conversation
clamp03
commented
Mar 7, 2022
- Not use fast tail call when callee use split struct argument
- Not use fast tail call when callee use non-standard calling convention
- Not use fast tail call when it overwrites stack which will be passed to callee.
Tagging subscribers to this area: @JulieLeeMSFT Issue Details
|
@jakobbotsch PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks like a good start! There are tests that you should enable in src/tests/issues.targets
now (look under arm32/ctrl-f for "tail").
We have an ABI stress test that among other things tests tailcalls, but it is disabled on ARM32 for an unrelated reason right now. The way it works is that it generates a lot of callers that do tailcalls with random permutations of arguments/values. When running it I see failures with this PR. You can build the test with the following (adjusting to Linux if you are on Linux):
./dotnet.cmd build .\src\tests\JIT\Stress\ABI\tailcalls_do.csproj -c Release
This should print the path of the built .dll to the console (note that the .dll might be put under a non-arm32 directory, but it works fine on arm32). After that you can run it on arm32 with an arm32 corerun. To test tailcalls you should invoke it as follows:
<path to corerun> <path to tailcalls_do.dll> --tailcalls --num-calls 10000
For me this gives:
Mismatch in tailcall: expected 1164956936, got 1917288685
Int32 ABIStress_TailCaller276(S7U, S9U, System.Numerics.Vector`1[System.Int32], Int16, S16U, Int32, S8P, S4P, S13U, S4P, Byte, S4U, S15U, S4U, S1P, Single, S12U, Byte, S14U, S15U, S8P)
Int32 ABIStress_TailCallee7876(S9U, Int64)
To drill into that you can invoke it with the --caller-index
arg, e.g.:
$env:COMPlus_JitDisasm="ABIStress_TailCaller276"
<path to corerun> <path to tailcalls_do.dll> --tailcalls --caller-index 276
which gives
Stressing tailcalls
OSVersion: Microsoft Windows NT 10.0.22000.0
OSArchitecture: Arm64
ProcessArchitecture: Arm
Selecting win86 ABI
; Assembly listing for method DynamicClass:ABIStress_TailCaller276(ABIStress.S7U,ABIStress.S9U,System.Numerics.Vector`1[Int32],short,ABIStress.S16U,int,ABIStress.S8P,ABIStress.S4P,ABIStress.S13U,ABIStress.S4P,ubyte,ABIStress.S4U,ABIStress.S15U,ABIStress.S4U,ABIStress.S1P,float,ABIStress.S12U,ubyte,ABIStress.S14U,ABIStress.S15U,ABIStress.S8P):int
; Emitting BLENDED_CODE for generic ARM CPU - Windows
; optimized code
; sp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) struct ( 8) zero-ref do-not-enreg[S] single-def
; V01 arg1 [V01,T00] ( 5, 5 ) struct (12) [sp+10H] do-not-enreg[SFA] multireg-arg single-def
;* V02 arg2 [V02 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[S] single-def double-align
;* V03 arg3 [V03 ] ( 0, 0 ) short -> zero-ref single-def
;* V04 arg4 [V04 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[S] single-def
;* V05 arg5 [V05 ] ( 0, 0 ) int -> zero-ref single-def
;* V06 arg6 [V06 ] ( 0, 0 ) struct ( 8) zero-ref single-def double-align
;* V07 arg7 [V07 ] ( 0, 0 ) struct ( 4) zero-ref single-def
;* V08 arg8 [V08 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[S] single-def
;* V09 arg9 [V09 ] ( 0, 0 ) struct ( 4) zero-ref single-def
;* V10 arg10 [V10 ] ( 0, 0 ) ubyte -> zero-ref single-def
;* V11 arg11 [V11 ] ( 0, 0 ) struct ( 4) zero-ref do-not-enreg[S] single-def
;* V12 arg12 [V12 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[S] single-def
;* V13 arg13 [V13 ] ( 0, 0 ) struct ( 4) zero-ref do-not-enreg[S] single-def
;* V14 arg14 [V14 ] ( 0, 0 ) struct ( 4) zero-ref single-def
;* V15 arg15 [V15 ] ( 0, 0 ) float -> zero-ref single-def
;* V16 arg16 [V16 ] ( 0, 0 ) struct (12) zero-ref do-not-enreg[S] single-def
;* V17 arg17 [V17 ] ( 0, 0 ) ubyte -> zero-ref single-def
;* V18 arg18 [V18 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[S] single-def
;* V19 arg19 [V19 ] ( 0, 0 ) struct (16) zero-ref do-not-enreg[S] single-def
; V20 arg20 [V20 ] ( 2, 2 ) struct ( 8) [sp+B8H] single-def double-align
;# V21 OutArgs [V21 ] ( 1, 1 ) lclBlk ( 0) [sp+00H] "OutgoingArgSpace"
;* V22 tmp1 [V22 ] ( 0, 0 ) long -> zero-ref do-not-enreg[] single-def V06.F0(offs=0x00) P-DEP "field V06.F0 (fldOffset=0x0)"
;* V23 tmp2 [V23 ] ( 0, 0 ) int -> zero-ref do-not-enreg[] single-def V07.F0(offs=0x00) P-DEP "field V07.F0 (fldOffset=0x0)"
;* V24 tmp3 [V24 ] ( 0, 0 ) int -> zero-ref do-not-enreg[] single-def V09.F0(offs=0x00) P-DEP "field V09.F0 (fldOffset=0x0)"
;* V25 tmp4 [V25 ] ( 0, 0 ) ubyte -> zero-ref do-not-enreg[] single-def V14.F0(offs=0x00) P-DEP "field V14.F0 (fldOffset=0x0)"
; V26 tmp5 [V26,T01] ( 2, 2 ) long -> [sp+B8H] do-not-enreg[] single-def V20.F0(offs=0x00) P-DEP "field V20.F0 (fldOffset=0x0)"
;
; Lcl frame size = 4
G_M379_IG01: ;; offset=0000H
000000 B40F push {r0,r1,r2,r3}
000002 B508 push {r3,lr}
;; bbWeight=1 PerfScore 2.00
G_M379_IG02: ;; offset=0004H
000004 982E ldr r0, [sp+0xb8]
000006 992F ldr r1, [sp+0xbc]
000008 9006 str r0, [sp+0x18]
00000A 9107 str r1, [sp+0x1c]
00000C 9804 ldr r0, [sp+0x10]
00000E 9905 ldr r1, [sp+0x14]
000010 9A06 ldr r2, [sp+0x18]
000012 F64C 33AD movw r3, 0xcbad
000016 F6C0 5356 movt r3, 0xd56
;; bbWeight=1 PerfScore 9.00
G_M379_IG03: ;; offset=001AH
00001A B001 add sp, 4
00001C F85D EB04 pop lr
000020 B004 add sp, 16
000022 4718 bx r3 // DynamicClass:ABIStress_TailCallee7876(ABIStress.S9U,long):int
;; bbWeight=1 PerfScore 4.00
; Total bytes of code 36, prolog size 4, PerfScore 18.60, instruction count 15, allocated bytes for code 36 (MethodHash=425efe84) for method DynamicClass:ABIStress_TailCaller276(ABIStress.S7U,ABIStress.S9U,System.Numerics.Vector`1[Int32],short,ABIStress.S16U,int,ABIStress.S8P,ABIStress.S4P,ABIStress.S13U,ABIStress.S4P,ubyte,ABIStress.S4U,ABIStress.S15U,ABIStress.S4U,ABIStress.S1P,float,ABIStress.S12U,ubyte,ABIStress.S14U,ABIStress.S15U,ABIStress.S8P):int
; ============================================================
Mismatch in tailcall: expected 1528035826, got -949328163
Int32 ABIStress_TailCaller276(S7U, S9U, System.Numerics.Vector`1[System.Int32], Int16, S16U, Int32, S8P, S4P, S13U, S4P, Byte, S4U, S15U, S4U, S1P, Single, S12U, Byte, S14U, S15U, S8P)
Int32 ABIStress_TailCallee7876(S9U, Int64)
1 tailcalls tested
Done with 1 mismatches
There is also the --verbose
arg that makes it print values of all args that were passed through the calls, which can be helpful.
If you look at the jit dump ($env:COMPlus_JitDump="ABIStress_TailCaller276"
) you can see the IL at the top:
IL_0000 fe 09 01 00 ldarg 0x1
IL_0004 fe 09 14 00 ldarg 0x14
IL_0008 7b 02 00 00 04 ldfld 0x4000002
IL_000d fe 14 tail.
IL_000f 28 03 00 00 0a call 0xA000003
IL_0014 2a ret
So the code passes the second parameter and a field of the 21st parameter to a tailcall. It looks like during the tailcall it places the second argument too early which overrides the first parameter before it is used. We handle these cases in LowerFastTailCall
so this is where I would start to look (and once it works you can remove the restrictions you added in fgCanFastTailCall
).
@jakobbotsch Wow!! Thank you for the comments. I already checked the tailcall_do (and more) test before the push.
In my guess, it fails in arm32 windows. I will update one-by-one based on your comments. |
I can see than on Windows we end up selecting win86 ABI. It means the permutations tested are different. I believe the win-arm32 and linux-arm32 ABIs are practically the same so I would not expect only one of them to have a bug. |
I will check with a much larger |
I checked with 10,000 |
I took another look at this and I am more and more confused about the reliance on the pre spill mask, especially considering it changes depending on many factors (e.g. if profiler is attached). Also, it is hard for me to have confidence in the new ARM32 logic given the many calculations it makes with frame sizes/alignment that can change based on subtle factors. My understanding is that the ARM32 support of this feature is more difficult than other platforms if we want to support it when the caller stack frame has a parameter split between registers and stack. The reason being that the existing logic assumes that there is a single parameter that can be used as the "base" local for the incoming stack area. Have you considered disallowing tailcalls out of these functions? I see that this PR already disallows tailcalls into such functions, so hopefully it will not be a significant regression to also disallow tailcalls out of them based on the same reasoning. If you disallow this I believe the existing logic that is there already will pretty much just work. Then existing logic in the backend of the JIT handles calculating the proper offsets and we do not have to deal with complicated frame size/alignment logic in the tail call logic. Does it sound reasonable? We do not have an existing easy way to check if the caller has a split parameter, but it should not be too hard to add that information when we go through the user args like this: jakobbotsch@21a43ab |
@jakobbotsch Sorry for late response. I was OOO.
Thank you so much. It is too hard to support all complex cases. I agree with you! |
Tested tailcall_d and tailcall_do with 100,000 num-calls on linux/arm32 and passed all. |
It looks great to me now, but looks like there are some profiler-related tests that are failing, can you take a look? |
Sure!! :) |
- Not use fast tail call when callee use split struct argument - Not use fast tail call when callee use non-standard calling convention - Not use fast tail call when it overwrites stack which will be passed to callee.
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
@clamp03 I have pushed a commit to your branch that fixes the stress failures. The problem was that the GS security cookie check uses the Since |
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
@jakobbotsch Thank you so much!!! |
@jakobbotsch |
@clamp03 We generally do not backport performance improvements, only fixes to bugs that were hit in real customer scenarios. So we do not plan to backport those improvements to .NET 6. |
@jakobbotsch It is okay. Thank you!!! |