-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Improve call args interference checking when stores are involved #97409
JIT: Improve call args interference checking when stores are involved #97409
Conversation
- Fix the propagation of `GTF_ASG` during call args morphing - Introduce `Compiler::gtHaveStoreInterference` that can check whether two trees interfere with each other due to a store in one tree that stores to a local read by the other tree - Use the new helper when checking for whether we should reverse `GTF_STOREIND` nodes - Use the new helper when deciding whether previous args need to be evaluated to temps because we see an argument with an embedded store (typically a call). Fix dotnet#13758
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue Details
Fix #13758
|
This comment was marked as outdated.
This comment was marked as outdated.
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 3 pipeline(s). |
jitstress failures are #97437 |
cc @dotnet/jit-contrib PTAL @AndyAyersMS Diffs. Wins across the board from enabling this optimization and also doing it in MinOpts. Initially I just wanted to fix #13758. However, fixing the propagation causes several regressions in places that are using
and switched them to more precise checking. The first seems to be by far the most impactful. I've looked at most of the benchmarks.run_pgo FullOpts regressions. Almost all of them are because the removal of these temps means less live range splitting for LSRA. This doesn't usually result in new spills, but it does mean we sometimes pick registers that take more space to encode, or where we end up needing another reg-reg mov. For example, most regressions are similar in spirit to the following: ; Assembly listing for method Microsoft.Cci.MetadataWriter:GetTypeAttributes(Microsoft.Cci.ITypeDefinition):int:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; Final local variable assignments
;
; V00 this [V00,T00] ( 3, 3 ) ref -> rcx this class-hnd single-def <Microsoft.Cci.MetadataWriter>
-; V01 arg1 [V01,T01] ( 3, 3 ) ref -> rdx class-hnd single-def <Microsoft.Cci.ITypeDefinition>
+; V01 arg1 [V01,T01] ( 3, 3 ) ref -> rax class-hnd single-def <Microsoft.Cci.ITypeDefinition>
; V02 OutArgs [V02 ] ( 1, 1 ) struct (32) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
; V03 tmp1 [V03 ] ( 2, 4 ) struct (48) [rsp+0x28] do-not-enreg[XS] addr-exposed "by-value struct argument" <Microsoft.CodeAnalysis.Emit.EmitContext>
-; V04 tmp2 [V04,T02] ( 2, 4 ) ref -> rdx single-def "argument with side effect"
;
; Lcl frame size = 88
G_M23091_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
sub rsp, 88
vzeroupper
- ;; size=7 bbWeight=1 PerfScore 1.25
-G_M23091_IG02: ; bbWeight=1, gcrefRegs=0006 {rcx rdx}, byrefRegs=0000 {}, byref, nogc
- ; gcrRegs +[rcx rdx]
+ mov rax, rdx
+ ; gcrRegs +[rax]
+ ;; size=10 bbWeight=1 PerfScore 1.50
+G_M23091_IG02: ; bbWeight=1, gcrefRegs=0003 {rax rcx}, byrefRegs=0000 {}, byref, nogc
+ ; gcrRegs +[rcx]
vmovdqu ymm0, ymmword ptr [rcx+0xD0]
vmovdqu ymmword ptr [rsp+0x28], ymm0
vmovdqu xmm0, xmmword ptr [rcx+0xF0]
vmovdqu xmmword ptr [rsp+0x48], xmm0
;; size=28 bbWeight=1 PerfScore 11.00
G_M23091_IG03: ; bbWeight=1, extend
- mov rcx, rdx
lea rdx, [rsp+0x28]
- ; gcrRegs -[rdx]
+ mov rcx, rax
call [<unknown method>]
- ; gcrRegs -[rcx]
+ ; gcrRegs -[rax rcx]
; gcr arg pop 0
nop
;; size=15 bbWeight=1 PerfScore 4.00
G_M23091_IG04: ; bbWeight=1, epilog, nogc, extend
add rsp, 88
ret
;; size=5 bbWeight=1 PerfScore 1.25
-; Total bytes of code 55, prolog size 7, PerfScore 17.50, instruction count 12, allocated bytes for code 55 (MethodHash=af30a5cc) for method Microsoft.Cci.MetadataWriter:GetTypeAttributes(Microsoft.Cci.ITypeDefinition):int:this (Tier1)
+; Total bytes of code 58, prolog size 7, PerfScore 17.75, instruction count 13, allocated bytes for code 58 (MethodHash=af30a5cc) for method Microsoft.Cci.MetadataWriter:GetTypeAttributes(Microsoft.Cci.ITypeDefinition):int:this (Tier1)
; ============================================================ which results in the one additional reg-reg mov. |
@@ -2148,7 +2139,7 @@ bool Compiler::optRedundantRelop(BasicBlock* const block) | |||
|
|||
for (unsigned int i = 0; i < definedLocalsCount; i++) | |||
{ | |||
if (gtHasRef(prevTreeData, definedLocals[i])) | |||
if (gtTreeHasLocalRead(prevTreeData, definedLocals[i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a small bug fix here -- gtHasRef
does not take promotion into account.
Diff results for #97409Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,501,156 contexts (1,003,806 MinOpts, 1,497,350 FullOpts). MISSED contexts: base: 4,060 (0.16%), diff: 4,061 (0.16%) Overall (-1,073,216 bytes)
MinOpts (-911,184 bytes)
FullOpts (-162,032 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,595,003 contexts (1,052,329 MinOpts, 1,542,674 FullOpts). MISSED contexts: base: 3,628 (0.14%), diff: 3,632 (0.14%) Overall (-1,002,204 bytes)
MinOpts (-727,753 bytes)
FullOpts (-274,451 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,262,707 contexts (930,876 MinOpts, 1,331,831 FullOpts). MISSED contexts: base: 3,256 (0.14%), diff: 3,258 (0.14%) Overall (-803,772 bytes)
MinOpts (-652,688 bytes)
FullOpts (-151,084 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,318,205 contexts (931,543 MinOpts, 1,386,662 FullOpts). MISSED contexts: base: 2,687 (0.12%), diff: 2,689 (0.12%) Overall (-780,956 bytes)
MinOpts (-643,252 bytes)
FullOpts (-137,704 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,492,895 contexts (983,689 MinOpts, 1,509,206 FullOpts). MISSED contexts: base: 3,899 (0.16%), diff: 3,916 (0.16%) Overall (-2,840,990 bytes)
MinOpts (-2,692,403 bytes)
FullOpts (-148,587 bytes)
Details here Assembly diffs for linux/arm ran on windows/x86Diffs are based on 2,237,690 contexts (827,812 MinOpts, 1,409,878 FullOpts). MISSED contexts: 74,588 (3.23%) Overall (-277,852 bytes)
MinOpts (-137,640 bytes)
FullOpts (-140,212 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,296,119 contexts (841,817 MinOpts, 1,454,302 FullOpts). MISSED contexts: base: 5,093 (0.22%), diff: 5,251 (0.23%) Overall (-182,725 bytes)
MinOpts (-105,402 bytes)
FullOpts (-77,323 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.17% to -0.02%)
MinOpts (-0.36% to +0.01%)
FullOpts (-0.13% to -0.02%)
Throughput diffs for linux/x64 ran on windows/x64Overall (-0.06% to -0.01%)
MinOpts (-0.09% to +0.01%)
FullOpts (-0.06% to -0.02%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.13% to -0.04%)
MinOpts (-0.38% to +0.01%)
FullOpts (-0.13% to -0.02%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.13% to -0.02%)
MinOpts (-0.38% to +0.01%)
FullOpts (-0.13% to -0.02%)
Throughput diffs for windows/x64 ran on windows/x64Overall (-0.31% to -0.08%)
MinOpts (-0.45% to +0.01%)
FullOpts (-0.31% to -0.06%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (-0.03% to +0.02%)
MinOpts (-0.03% to +0.02%)
FullOpts (-0.03% to +0.01%)
Throughput diffs for windows/x86 ran on windows/x86Overall (-0.01% to +0.03%)
MinOpts (-0.03% to +0.03%)
FullOpts (-0.01% to +0.03%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.17% to -0.02%)
MinOpts (-0.36% to +0.00%)
FullOpts (-0.14% to -0.02%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.07% to -0.01%)
MinOpts (-0.10% to +0.00%)
FullOpts (-0.07% to -0.02%)
Details here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good.
I wonder if it would be more efficient to generalize this to sets of locals? You could have the same kind of cap where if the sets grow too large you just bail out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to approve, comments above are for possible follow up.
Seems likely it would be, e.g. I could factor |
GTF_ASG
during call args morphingCompiler::gtMayHaveStoreInterference
that can check whether two trees interfere with each other due to a store in one tree that stores to a local read by the other treeGT_STOREIND
nodesoptRedundantRelop
Fix #13758