-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT generates redundant code when inlining string.IsNullOrEmpty #4207
Comments
Good find. The problem is we have a side effecting dereference of the array in a dead assignment, which couldn't be removed as a dead store (see below). We should be able to remove this by teaching Specifically, this bounds check can be used:
to eliminate the side-effect (the
And then, given that there is no longer any side-effects post |
Sorry but I don't follow. The problem occurs even if no array bound checks are involved: static bool foobar(int x) {
return x != 1 && x != 2;
}
static int Main(string[] args) {
return foobar(args.Length) ? 42 : 0;
} ASM code:
For better clarity here's the expected ASM code:
|
There are two issues here. One is the previous one and the other is because we have the IR as: bool x = (y != 2); and we end up with redundant code. This should definitely be optimized better -- I will look into this. |
In case anyone cares I stumbled upon a funny workaround: static bool foobar(int x) {
return (x != 1 && x != 2) ? true : false;
}
static int Main(string[] args) {
return foobar(args.Length) ? 42 : 0;
}
Who'd have thought that more complex C# code will produce better assembly code? 😄 |
It also happens when a boolean value passed to an inlined method. |
Test case to stress inlining, expression opts, and control flow simplification for booleans. Test case has 100 methods named Idxx. All 100 should generate identical code. On x64 windows we expect to get the 4 byte sequence ``` 0FB6C1 movzx rax, cl C3 ret ``` Only 22 of the variants get this codegen; there are at least 12 other sequences ranging in size from 9 to 32 bytes of code. Likely touches on the same issues raised in #914.
This keeps popping up... let's what the problem is. After inlining we have IR like:
This is pretty bad.
One of the duplicated branch nodes will go away, thanks to VN/assertion propagation that detects that it has a constant operand. The problem is the other branch node, that looks as expected:
Basically, the relevant relop node remains detached from the branch tree so codegen has to materialize the bool value produced by the relop, instead of just relying on the flags that the relop sets. One obvious solution is forward substitution. Unfortunately that's not something one can implement overnight. Still, perhaps some targeted forward substitution can be done, just for this case. The basic ingredient, SSA, already exists. The only potential problem is that it does not track the users of a node so there's no way to know if a node is single use or not. But that's not so difficult to add. The bigger problem is that this also typically requires some code motion and that's more difficult to do right. We could probably get away with moving only the relop and spilling its (non-constant) operands to lclvars so we don't have to move those as well. That is, if we have bool b = a.Length == 0;
if (b) just make it int l = a.Length;
if (l == 0) I suppose early prop could be used for something like this. And considering that branch nodes aren't that common (at one point I think I counted ~50,000 in corelib) the impact on throughput should be small. So if we have bool b = a.Length == 0;
if (b) and early prop finds that
This avoids another SSA limitation that the JIT currently has - if you introduce a new lclvar after SSA, there's no mechanism to add it to the SSA graph. |
Yeah, except that it's not also single-def so you have to create a new lclvar. Oh well, let's do that and see how it goes. Quick and dirty implementation produces this diff:
Typical diff looks as expected: call Number:TryParseInt32IntegerStyle(struct,int,ref,byref):int
test eax, eax
- sete al
- movzx rax, al
- test eax, eax
- jne SHORT G_M8084_IG04
+ je SHORT G_M8084_IG04
G_M8084_IG03:
xor eax, eax |
Throughput impact of my lousy implementation is practically non-existent: 0.06%. I have pending PRs that save 10 times that. |
Being overly conservative and forwarding only the relop doesn't seem to be necessary. In many cases the SSA def is in the immediately preceding tree so you can just stitch the two trees together without the typical code motion worries. That nearly doubles the diff improvement. Mostly due to containment and/or compare narrowing: mov r14d, dword ptr [rbp-3CH]
- movzx rcx, byte ptr [rdi+59]
- cmp ecx, 2
+ cmp byte ptr [rdi+59], 2
jne SHORT G_M6012_IG06
mov rcx, rsi Also, it's may be worth pointing out that this problem affects float comparisons as well. And in that case the impact is higher due to the increased complexity of float comparison bool result materialization: movss xmm0, dword ptr [rdi+8]
ucomiss xmm0, xmm2
- setpo al
- jpe SHORT G_M23365_IG03
- sete al
+ jpe SHORT G_M23365_IG05
-G_M23365_IG03:
- movzx rax, al
- test eax, eax
- je SHORT G_M23365_IG06
+ jne SHORT G_M23365_IG05
movss xmm0, dword ptr [rdi+12] Anyway, this seems pretty doable overall. Now I need to figure out how safe is it to rely on SSA use counts computed by SsaBuilder, considering that nothing in the JIT will update these counts once they're computed. But since early prop runs right after SsaBuilder this should probably be fine, so far I haven't encountered any situation where the use count is wrong during early prop. In the worst case, morphing invoked during early prop could theoretically add new uses. But then morph doesn't quite know about SSA so it's unlikely that it ever does this. At the very least it will need to set the SSA number on new LclVar uses and there's no trace of that. |
Still WIP but good for a discussion, I hope - https://github.com/mikedn/coreclr/commits/relop-subst cc @AndyAyersMS Quick summary:
This reduces FX diff by 14423 bytes (affects some 1600 methods). There are a few regressions as well. The largest one, 150 bytes, is caused by changes in register allocation that somehow increase the frame size. That results in more 32 bit address mode displacements being needed. Strangely, in the original implementation PIN showed a 0.06% regression and now it shows a 0.2% improvement. I'll have to double check that. Though it's not impossible that doing extra work to simplify the IR ends up resulting in throughput improvement, it seems a bit too good to be true. Anyway, throughput impact is low. My main concern is that I need to morph the trees and morph could add new SSA uses, at least in theory. That would invalidate the computed use count. But then I don't see morph setting the SSA number on new lclvar nodes anywhere, that would imply that if morph does actually add new uses then we're already in a bad state because existing early prop code does already call morph. Anyway, if this is a problem then I think there's a simple solution - put statements needing morphing in a list and morph only after earlyprop is complete. |
Nice, does it also handle the 15 |
I have built a corelib without those workarounds and the diff improvement increases, sign that the JIT did handle at least some of those cases. I'll have to check if it handles all of them. In general, I expect this to be pretty good at catching such cases, provided that the JIT does branch duplication as mentioned in a previous post. I did not know that it does that until now, I'll have to check to see how reliable is that. It's still possible to get this to work without branch duplication, but it requires quite a bit more work. |
On a side note - I tried a version that is not restricted to just JTRUE & co. but looks at every statement and attempts to merge it with the previous one when possible. The diff then increases to ~80kbytes. Oh well, I suppose it's not a surprise that even such basic forward substitution generates improvements... |
This looks like a fairly reasonable approach. Wonder if the branch duplication is coming from Also curious you iterated on your tree merging to see how much benefit there might be in generally recombining split trees. |
Yep, that was it, thanks!
So, I took a closer look at that and I had a nasty surprise.
and without the workaround it has 18:
And back when I created this issue it had only 15 bytes:
The reason why it increased to 18 is another workaround: 0u >= (uint)value.Length that makes Roslyn emit the perhaps curious sequence
🤷♂️ |
I was trying to look at the rest of the If the method containing the hack is not inlined then the hack is basically a deoptimization: G_M23366_IG02:
test rdx, rdx
- jne SHORT G_M23366_IG06
+ jne SHORT G_M23366_IG04
test rcx, rcx
- je SHORT G_M23366_IG04
- xor eax, eax
+ sete al
+ movzx rax, al
G_M23366_IG03:
add rsp, 40
ret
G_M23366_IG04:
- mov eax, 1
-G_M23366_IG05:
- add rsp, 40
- ret
-G_M23366_IG06:
mov gword ptr [rsp+30H], rcx
mov rcx, rdx
mov rdx, gword ptr [rsp+30H]
call [SortVersion:Equals(ref):bool:this]
movzx rax, al
-G_M23366_IG07:
+G_M23366_IG05:
add rsp, 40
ret
-; Total bytes of code 59, prolog size 5 for method SortVersion:op_Equality(ref,ref):bool
+; Total bytes of code 51, prolog size 5 for method SortVersion:op_Equality(ref,ref):bool Same thing happens if the method is inlined in another method that simply returns the inlined method. Or more generally, whenever the result of the inlined method is not used in flow control context. Anyway, my change seems to be working pretty well. With one exception: G_M33195_IG02:
- cmp gword ptr [rsi], 0
- jne SHORT G_M33195_IG05
+ mov rax, gword ptr [rsi]
+ test rax, rax
+ jne SHORT G_M33195_IG04
call [CORINFO_HELP_READYTORUN_NEW] Need to figure out why the value is loaded into a register instead of being compared directly like before. |
By "side effects" do you mean inlining changes? You can force the jit to make the same set of inlining decisions before/after by creating and then consuming an inline replay file, though it's not plumbed through jit-diffs. |
That only seems to apply to |
@AndyAyersMS Speaking of chains of moves, here's a diff from the above mentioned experiment: jmp SHORT G_M12430_IG04
G_M12430_IG03:
lea rax, bword ptr [rdx+12]
- mov rcx, rax
- mov edx, dword ptr [rdx+8]
- mov eax, edx
- mov rdx, rcx
- mov rcx, rdx
- mov edx, eax
- mov rax, rcx
- mov ecx, edx
+ mov ecx, dword ptr [rdx+8]
G_M12430_IG04:
movdqu xmm0, qword ptr [rsp+48H]
movdqu qword ptr [rsp+38H], xmm0
lea rdx, bword ptr [rsp+28H] |
Change title to "Redundant code generated when inlining booleans"? For With the following diff between the two approaches. SslStream.get_IsAuthenticated()
cmp qword [rcx+0x8], 0x0
- jz L0058
+ jz L003d
mov rax, [rcx+0x8]
mov rax, [rax+0x8]
test rax, rax
- jz L0058
+ jz L003d
mov edx, [rax]
test byte [rax+0x10], 0x1
- jnz L003b
+ jnz L003d
add rax, 0x18
mov rdx, [rax]
test rdx, rdx
- jnz L0037
+ jnz L0031
mov rax, [rax+0x8]
test rax, rax
- setz al
- movzx eax, al
- jmp L0039
- xor eax, eax
- jmp L0040
- mov eax, 0x1
- test eax, eax
- setz al
- movzx eax, al
- test eax, eax
- jz L0058
+ jz L003d
cmp qword [rcx+0x10], 0x0
- jnz L0058
+ jnz L003d
movzx eax, byte [rcx+0x18]
ret
xor eax, eax
ret The ternary is a less desirable approach as it reduces readability dotnet/corefx#40265 (comment) |
Comment at #53143 (comment) suggests that this has been resolved and that our workaround code in |
I believe the |
The current code gen is (confirmed with disasmo): Not inliningWith ternary:
Without ternary:
The version without ternary contains one branch less. InliningWith ternary:
Without ternary:
The codegen are identical. I think this issue is solved now. |
C# code:
ASM code:
Basically the IL code after inline contains a
ceq
followed by abrtrue
and the JIT compiler doesn't seem capable of folding them.category:cq
theme:optimization
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: