-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization to remove redundant zero initializations. #36918
Conversation
x64 framework pmi diffs:
|
x64 benchmarks pmi diffs:
|
Example where the optimization removed a prolog zero initialization:
G_M5523_IG01:
push rbp
sub rsp, 48
lea rbp, [rsp+30H]
- xor rax, rax
- mov qword ptr [rbp-08H], rax
G_M5523_IG02:
mov r8, bword ptr [rdx]
mov bword ptr [rbp-08H], r8
mov qword ptr [rbp-10H], r8
mov r8d, dword ptr [rdx+8]
mov rdx, qword ptr [rbp-10H]
call Kernel32:GetEnvironmentVariable(String,long,int):int
nop
G_M5523_IG03:
lea rsp, [rbp]
pop rbp
ret
-; Total bytes of code 47, prolog size 16, PerfScore 17.95, (MethodHash=18b8ea6c) for method Kernel32:GetEnvironmentVariable(String,Span`1):int
+; Total bytes of code 41, prolog size 10, PerfScore 16.10, (MethodHash=18b8ea6c) for method Kernel32:GetEnvironmentVariable(String,Span`1):int
|
Example where the optimization removed a prolog zero initialization for one variable and and an explicit zero initialization for another variable:
G_M43056_IG01:
push rbp
sub rsp, 64
lea rbp, [rsp+40H]
xor rax, rax
mov qword ptr [rbp-08H], rax
- mov qword ptr [rbp-10H], rax
G_M43056_IG02:
- xor r8, r8
- mov gword ptr [rbp-08H], r8
mov gword ptr [rbp-10H], rcx
lea rcx, [rbp-10H]
mov r8, gword ptr [rbp-10H]
mov r8, qword ptr [r8+32]
lea rax, [rbp-08H]
lea r9, bword ptr [rbp-20H]
mov qword ptr [r9], rcx
mov qword ptr [r9+8], r8
lea rcx, bword ptr [rbp-20H]
movzx rdx, dl
mov r8, rax
call RuntimeAssembly:GetCodeBase(QCallAssembly,bool,StringHandleOnStack)
mov rax, gword ptr [rbp-08H]
G_M43056_IG03:
lea rsp, [rbp]
pop rbp
ret
-; Total bytes of code 83, prolog size 20, PerfScore 26.05, (MethodHash=dd2357cf) for method RuntimeAssembly:GetCodeBase(bool):String:this
+; Total bytes of code 72, prolog size 16, PerfScore 22.70, (MethodHash=dd2357cf) for method RuntimeAssembly:GetCodeBase(bool):String:this
|
Regressions are all cases where we initialize less memory in the prolog but that results in more static instructions. We initialize in a loop with 48 bytes zero-ed on each iteration so we end up with the minimum number of static instructions when we initialize 0 mod 48 bytes:
G_M36805_IG01:
push r14
push rdi
push rsi
push rbp
push rbx
sub rsp, 176
vzeroupper
vxorps xmm4, xmm4
- mov rax, -144
- vmovdqa xmmword ptr [rsp+rax+B0H], xmm4
- vmovdqa xmmword ptr [rsp+rax+C0H], xmm4
- vmovdqa xmmword ptr [rsp+rax+D0H], xmm4
+ mov rax, -96
+ vmovdqa xmmword ptr [rsp+rax+80H], xmm4
+ vmovdqa xmmword ptr [rsp+rax+90H], xmm4
+ vmovdqa xmmword ptr [rsp+rax+A0H], xmm4
add rax, 48
jne SHORT -5 instr
+ mov qword ptr [rsp+80H], rax
mov rdi, rcx
mov rsi, rdx
mov rbx, r8
;; bbWeight=1 PerfScore 11.83
;; bbWeight=1 PerfScore 12.83
G_M36805_IG02:
mov rcx, gword ptr [rdi+496]
mov gword ptr [rsp+48H], rcx
xor eax, eax
mov dword ptr [rsp+50H], eax
mov ecx, dword ptr [rcx+20]
mov dword ptr [rsp+54H], ecx
vxorps xmm0, xmm0
vmovdqu xmmword ptr [rsp+58H], xmm0
mov qword ptr [rsp+68H], rax
;; bbWeight=1 PerfScore 9.58
G_M36805_IG03:
vmovdqu xmm0, xmmword ptr [rsp+48H]
vmovdqu xmmword ptr [rsp+88H], xmm0
vmovdqu xmm0, xmmword ptr [rsp+58H]
vmovdqu xmmword ptr [rsp+98H], xmm0
mov rcx, qword ptr [rsp+68H]
mov qword ptr [rsp+A8H], rcx
|
pin-icount of SPC crossgen shows 0.13% throughput regression. Note that this phase doesn't run with minopts. |
@dotnet/jit-contrib @AndyAyersMS PTAL |
Looks like an assert is failing? https://helix.dot.net/api/2019-06-17/jobs/51a5759f-188d-469c-b161-07b2bb034fbb/workitems/JIT.Directed/files/console.2520679e.log
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Left a few notes for clarification / consideration.
Did you look at handling more general flow patterns?
@@ -4690,6 +4690,8 @@ void Compiler::compCompile(void** methodCodePtr, ULONG* methodCodeSize, JitFlags | |||
DoPhase(this, PHASE_BUILD_SSA, &Compiler::fgSsaBuild); | |||
} | |||
|
|||
DoPhase(this, PHASE_ZERO_INITS, &Compiler::optRemoveRedundantZeroInits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it would be less confusing to run this before we build SSA, since it doesn't really leverage or involve SSA (other than skipping phi defs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization needs to run after liveness so that it can use lvTracked and lvLiveInOutOfHndlr. We run liveness when we build SSA.
{ | ||
if ((tree->gtFlags & GTF_CALL) != 0) | ||
{ | ||
hasGCSafePoint = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have some calls (GTF_CALL_M_SUPPRESS_GC_TRANSITION
) that are not safe points? I suppose it is conservatively correct to assume all calls are GC safepoints though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added a check for calls with GTF_CALL_M_SUPPRESS_GC_TRANSITION
. No new diffs in framework and benchmarks.
src/coreclr/src/jit/optimizer.cpp
Outdated
(!GetInterruptible() && !hasGCSafePoint && !compMethodRequiresPInvokeFrame())) | ||
{ | ||
// The local hasn't been used and won't be reported to the gc between | ||
// the prolog and this explicit zero intialization. Therefore, it doesn't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't check for zero init, it could be any init, so update the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (!lclDsc->lvTracked && treeOp->gtOp2->IsIntegralConst() && | ||
(treeOp->gtOp2->AsIntCon()->IconValue() == 0)) | ||
{ | ||
bool bbInALoop = (block->bbFlags & BBF_BACKWARD_JUMP) != 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BBF_BACKWARD_JUMP may be too conservative? Seems like we should have enough graph analysis state lying around to detect loops more accurately. Might not be worth the trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked around and found that we call optMarkLoopBlocks
but that just sets the weights for the blocks and it doesn't feel right to use that here. I'll leave it for later to try to add a basic block flag and use it in this optimization. I doubt it will make much of a difference.
// We are guaranteed to have a zero initialization in the prolog and | ||
// the local hasn't been redefined between the prolog and this explicit | ||
// zero initialization so the assignment can be safely removed. | ||
if (tree == stmt->GetRootNode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-root assignments can't we just bash the asg tree to a nop? Or if bashing during walking is too painful, keep track of these trees and bash them later?
Or are these rare enough that it's not worth trying to handle them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that and saw no new diffs in framework and benchmarks.
{ | ||
if (!lclDsc->HasGCPtr() || | ||
(!GetInterruptible() && !hasGCSafePoint && !compMethodRequiresPInvokeFrame())) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add to your comment above or below and explain why we're checking compMethodRequiresPInvokeFrame
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
case GT_LCL_FLD: | ||
case GT_LCL_VAR_ADDR: | ||
case GT_LCL_FLD_ADDR: | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-address exposed locals we only care about seeing defs, not uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We care about seeing defs when removing explicit zero initializations and we care about seeing uses when marking locals with lvHasExplicitInit
. I tried a change where I tracked defs and uses separately and saw no new diffs in framework or benchmarks so will keep it simple.
9474931
to
997792a
Compare
I didn't try to handle more general flow patterns. I suspect it's not worth it as we'll find few cases and the optimization will be more expensive. I'll experiment with that outside of this PR. |
@AndyAyersMS I pushed a commit that addressed your feedback and also fixed several issues:
Only one change in framework diffs from what I quoted above. Current numbers: Framework:
Benchmarks:
PTAL |
This change adds a phase that iterates over basic blocks starting with the first basic block until there is no unique basic block successor or until it detects a loop. It keeps track of local nodes it encounters. When it gets to an assignment to a local variable or a local field, it checks whether the assignment is the first reference to the local (or to the parent of the local field), and, if so, it may do one of two optimizations: 1. If the local is untracked, the rhs of the assignment is 0, and the local is guaranteed to be fully initialized in the prolog, the explicit zero initialization is removed. 2. If the assignment is to a local (and not a field) and either the local has no gc pointers or there are no gc-safe points between the prolog and the assignment, it marks the local with lvHasExplicitInit which tells the codegen not to insert zero initialization for this local in the prolog. This addresses one of the examples in dotnet#2325 and 5 examples in dotnet#1007.
We should keep prolog zero initialization if the local is lvLiveInOutOfHndlr and some node between the prolog and the explicit assignment may throw. Some variables are never zero initialized in the prolog. This commit disables the optimization for them. Fixed the statement that incremented the ref count: msvc and clang had different behavior so test on Linux were failing. Code review feedback: relaxed the gc-safe point condition to recognize calls with suppressed gc transitions; fixed comments.
gcstress runs show the same failures with and without my changes except for WeakReferenceTest in Windows x64 checked 0x3 and 0xc runs and Windows x86 0x3 run. The symptoms are the same as in #36970. #37002 is the proposed fix. I tried to run this test locally and I see the same behavior with and without my changes. So, I'll ignore this failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this version looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - only a couple of minor observations, but no need to change for this PR
@@ -13883,7 +13883,7 @@ void Compiler::impImportBlockCode(BasicBlock* block) | |||
bool bbIsReturn = (block->bbJumpKind == BBJ_RETURN) && | |||
(!compIsForInlining() || (impInlineInfo->iciBlock->bbJumpKind == BBJ_RETURN)); | |||
LclVarDsc* const lclDsc = lvaGetDesc(lclNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but this code inexplicably usees both lclDsc
and lvaTable[lclNum]
LclVarDsc* const lclDsc = lvaGetDesc(lclNum); | ||
unsigned* pRefCount = refCounts.LookupPointer(lclNum); | ||
assert(pRefCount != nullptr); | ||
if (*pRefCount == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion - it might be worth a comment (or perhaps add to the comment below) that *pRefCount
can never be null, because the local node on the rhs of the assignment must have already been seend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a comment with my next PR.
This is a follow-up to dotnet#36918. It addresses one of the examples in dotnet#1007 where we remove a struct zero initialization but fail to clean up a dead field assignment. The change is not to mark a dependently promoted field as untracked if we know that the struct local is no longer referenced. I also addressed a couple of late cosmetic review comments from dotnet#36918. No diffs in framework and benchmarks.
) This is a follow-up to #36918. It addresses one of the examples in #1007 where we remove a struct zero initialization but fail to clean up a dead field assignment. The change is not to mark a dependently promoted field as untracked if we know that the struct local is no longer referenced. I also addressed a couple of late cosmetic review comments from #36918. No diffs in framework and benchmarks.
This change adds a phase that iterates over basic blocks starting with the first
basic block until there is no unique basic block successor or until it detects a
loop. It keeps track of local nodes it encounters. When it gets to an assignment
to a local variable or a local field, it checks whether the assignment is the
first reference to the local (or to the parent of the local field), and, if so,
it may do one of two optimizations:
guaranteed to be fully initialized in the prolog, the explicit zero
initialization is removed.
gc pointers or there are no gc-safe points between the prolog and the
assignment, it marks the local with lvHasExplicitInit which tells the codegen
not to insert zero initialization for this local in the prolog.
This addresses one of the examples in #2325 and 5 examples in #1007.