-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve performance of Convert.ToInt32(bool) and related functions #16138
Improve performance of Convert.ToInt32(bool) and related functions #16138
Conversation
Seems like something we ought to fix in the jit. Branchless sequences for various idioms are well known and should really be part of the jit's toolbox. |
A search of corelib for the pattern
As part of this change, consider switching some or all of those over to use Convert.ToInt32? |
@@ -769,7 +771,9 @@ public static byte ToByte(object value, IFormatProvider provider) | |||
|
|||
public static byte ToByte(bool value) | |||
{ | |||
return value ? (byte)Boolean.True : (byte)Boolean.False; | |||
// Ideally we'd call BoolToByte(!!value) to normalize any true value to 1, | |||
// but JIT optimizes !! away. The pattern below defeats this optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can JIT optimize !! away? Is not it a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, haven't checked but both the JIT and the C# compiler tend to assume that bool
is either 0 or 1. It's a bit of the bug since the IL spec doesn't actually guarantee that but then this behavior is old enough that it has become the norm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked, this is not JIT's doing. The C# compiler optimizes away !!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was mistaken. The C# compiler optimizes away !!
. I think this behavior is correct for the reason @mikedn calls out: the C# !
operator is not the same as the C !
operator, and the user-observable side effect of double-notting a bool is to keep the original value. So this code isn't working around a bug as much as it is trying to make up for a missing language feature. I should reword the comment.
Additionally, as far as I can tell the JIT doesn't actually have a concept of a !
operator; the not
opcode is instead equivalent to one's complement (~
). So putting two not
opcodes into the stream doesn't have the desired effect either.
I can't think of a more efficient way of performing this normalization. One option would be to not perform the normalization, but there's probably somebody somewhere who's somehow passing a true value other than 1 and relying on the existing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, as far as I can tell the JIT doesn't actually have a concept of a ! operator; the not opcode is instead equivalent to one's complement (~). So putting two not opcodes into the stream doesn't have the desired effect either.
The usual way to negate a 0/1 bool is xor 1
. Funnily enough, the C# compilers assumes 0/1 bools but fails to perform this particular optimization.
One option would be to not perform the normalization, but there's probably somebody somewhere who's somehow passing a true value other than 1 and relying on the existing logic.
I doubt that not performing the normalization would have any meaningful impact considering that both the JIT (and AFAIR the JIT does some rather dubious things in this area) and the C# compilers perform various other optimizations assuming 0/1 bools. The ship has sailed a long time ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While C# does not clearly define the exact values that true and false are (just that there are only two values).
True. The C# spec does not explicitly define the value of true
to any specific numerical value. However the C# compiler (as well as F#, VB, Cobol, etc ... every .NET language in existance) defines the literal true
as the value 1
/ ldc.i4.1
.
Scenarios where the literal true
is imagined as any other numeric value are academic.
Since the C# spec doesn't explicitly call out what these values are, but the implementation generally handles it as false is 0; true is not 0
Incorrect. The compiler assumes booleans can only be 1 or 0 and emits it's logic based on that. Sure there are times where C#'s emitted code will happen to also work with a 2+ boolean value but that is just coincidence, not design.
The bool type represents boolean logical quantities. The possible values of type bool are true and false.
Note that the boolean type itself agrees with the C# compiler here on the numeric values of true and false. Both in the constants it defines internally and the values that are returned from TryParse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenarios where the literal true is imagined as any other numeric value are academic.
Potentially academic, but still inline with the IL definition:
A bit pattern with any one or more bits set (analogous to a non-zero integer) denotes a value of true.
I do think it would be desirable, and within the realm of reason, to have the C# spec explicitly list the values it recognizes as true/false; otherwise, two independent language implementations may do it differently (one may do 0/1, like Roslyn; and the other may do 0/not 0, like the IL spec).
Also, while I still think fixing the C# spec to explicitly have the same definition of true
(that is, true
is a non-zero bit pattern) would be nice. It could also be considered a breaking change in the scenarios that @mikedn listed above, since the code may now execute differently from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C# language specification is not specific to any particular runtime platform. The only observable consequence of placing something like that in the specification would be the behavior you get when you overlay a bool and a byte in unsafe code. Specifying that would be a peculiar departure from the normal situation in which the specification avoids telling you in detail the way unsafe code works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and the same time the C# compile transforms b1 && b2 into a bitwise and. That is only valid for 0/1 bools.
@mikedn I was surprised to see that making this a bitwise operator is a Roslyn-specific behavior. Earlier versions of the C# compiler use short-circuiting logic. So this means technically there has already been a breaking change in the language, though the true number of people impacted is probably minimal. (Of course, there's still wonkiness with the other operators.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised to see that making this a bitwise operator is a Roslyn-specific behavior
Note that the JIT does it too. But I'm not sure when it started doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like you are working around a JIT bug. We should fix the JIT bug instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like you are working around a JIT bug. We should fix the JIT bug instead.
@@ -191,5 +191,27 @@ static internal PinningHelper GetPinningHelper(Object o) | |||
typeof(ArrayPinningHelper).ToString(); // Type used by the actual method body | |||
throw new InvalidOperationException(); | |||
} | |||
|
|||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
static internal byte BoolToByteNonNormalized(bool value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this return byte
? int
would make more sense. A bool
is 1 byte in size only when stored in memory (e.g. in a class field or array element). It is never 1 byte on the evaluation stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this method should be added to Unsafe
. It basically works around a C# language limitation (like other Unsafe
methods do) and at the same time it is slightly unsafe due to the possibility of a bool
not being only 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping that there might be some cases where RyuJIT might elide the movzx
instruction if it was kept as an 8-bit integral type all the way throughout the call sites. I'm thinking of cases like myByteArray[idx] = BoolToByte(condition);
, where everything would just compile to a single mov byte ptr [rdx + rcx], eax
instruction. But I'm not married to returning an 8-bit integral type, and if the JIT is smart enough to make this all work anyway then all the better. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of cases like myByteArray[idx] = BoolToByte(condition);, where everything would just compile to a single mov byte ptr [rdx + rcx], eax instruction.
I already sent a PR to fix that specific case. The JIT already eliminates movzx
in some other cases.
At the same time, movzx
is a cheap instruction (0 latency normally) so it shouldn't have much impact. But if you do see useless movzx
being generated, file issues. There are a lot of things that the JIT can't do but there are also quite a few things that the JIT could do, provided someone bothers with writing the necessary code.
@GrabYourPitchforks Check the examples here (https://github.com/dotnet/coreclr/issues/1306) to see if they are impacted, if there are, feel free to close the issue :) |
@redknightlois Yes, it would help your Framework test case but not your IfThenElse test case. That would require a compiler or JIT change. See #16121 (comment) for some discussion on this point. |
With if-conversion enabled in the JIT the existing test cl, cl
setne al
movzx rax, al The example from test r8d, r8d
setne al
movzx rax, al
mov r9d, eax
shl r9d, 2 |
I am optimistic that #16156 may finally address this once and for all. If the JIT is able to handle simple ternary cases like int Compare(int x, int y) {
return ((x >= y) ? 1 : 0) - ((x <= y) ? 1 : 0);
} Which in theory results in this codegen: cmp ecx, edx ; ecx = x, edx = y
setge al ; al = 0 if x < y, 1 if x >= y
movzx eax, al ; elided by processor
cmp ecx, edx ; ecx = x, edx = y
setle bl ; bl = 1 if x <= y, 0 if x > y
movzx ebx, bl ; elided by processor
; At this point, three possibilities:
; x < y => eax = 0, ebx = 1
; x > y => eax = 1, ebx = 0
; x == y => eax = 1, ebx = 1
sub eax, ebx ; eax contains result |
Yep, it generates: 3BCA cmp ecx, edx
0F9DC0 setge al
0FB6C0 movzx rax, al
3BCA cmp ecx, edx
0F9EC2 setle dl
0FB6D2 movzx rdx, dl
2BC2 sub eax, edx But it's probably a bad idea. If you have code like If I enable nested if-conversion we might be able to get something like: 3BCA cmp ecx, edx
B8FFFFFFFF mov eax, -1
0F9FC2 setg dl
0FB6D2 movzx rdx, dl
0F4CD0 cmovl edx, eax
8BC2 mov eax, edx |
I'm not too worried about that TBH. The existing sort routines in the framework special-case when T is an integral type, so they don't even call |
Except with work that's been done recently around inlining and |
Shouldn't, since |
You have no way of knowing how such methods are used. And we're talking about going from something like: G_M55887_IG01:
C5F877 vzeroupper
G_M55887_IG02:
3BCA cmp ecx, edx
7C0A jl SHORT G_M55887_IG05
G_M55887_IG03:
3BCA cmp ecx, edx ; ??!
C4E17A10442428 vmovss xmm0, dword ptr [rsp+28H]
G_M55887_IG04:
C3 ret
G_M55887_IG05:
C4E17828C3 vmovaps xmm0, xmm3
G_M55887_IG06:
C3 ret to something like: G_M55887_IG01:
C5F877 vzeroupper
G_M55887_IG02:
3BCA cmp ecx, edx
0F9DC0 setge al
0FB6C0 movzx rax, al
3BCA cmp ecx, edx
0F9EC2 setle dl
0FB6D2 movzx rdx, dl
2BC2 sub eax, edx
85C0 test eax, eax
7C08 jl SHORT G_M55887_IG04
C4E17A10442428 vmovss xmm0, dword ptr [rsp+28H]
G_M55887_IG03:
C3 ret
G_M55887_IG04:
C4E17828C3 vmovaps xmm0, xmm3
G_M55887_IG05:
C3 ret |
And when the same thing is done for |
The sort routines don't even use this implementation, and I assume that wouldn't change going forward. This whole exercise would be a no-op for that scenario. You could perhaps make the argument that methods like But per @mikedn's earlier comment, if nested if-conversion ends up in scope, then there's nothing for us to do here anyway since the JIT can always do the right thing. :) |
Sure they do. Not if |
@GrabYourPitchforks Any idea when a fix for the bool to int branching code is going to be merged into trunk (even if not fixed at the JIT level). I am redesigning an algorithm to be cache aware and have code like this: st1 |= *(ptr + 0) == *(ptr + offset + 0) ? 0ul : 1ul;
st2 |= *(ptr + 1) == *(ptr + offset + 1) ? 0ul : 1ul;
st3 |= *(ptr + 2) == *(ptr + offset + 2) ? 0ul : 1ul;
st4 |= *(ptr + 3) == *(ptr + offset + 3) ? 0ul : 1ul; This loop is hurting my eyes (and performance): st1 |= *(ulong*)(ptr + 0) == *(ulong*)(ptr + offset + 0) ? 0ul : 1ul;
48 8B 19 mov rbx,qword ptr [rcx]
4A 3B 1C 21 cmp rbx,qword ptr [rcx+r12]
74 1C je 00007FFA2B390E37
BB 01 00 00 00 mov ebx,1
EB 17 jmp 00007FFA2B390E39
48 89 4D 28 mov qword ptr [rbp+28h],rcx
48 8B D8 mov rbx,rax
EB 6A jmp 00007FFA2B390E95
48 8B 9D B8 00 00 00 mov rbx,qword ptr [rbp+0B8h]
E9 2B 02 00 00 jmp 00007FFA2B391062
33 DB xor ebx,ebx
48 0B C3 or rax,rbx
48 8B D8 mov rbx,rax
st2 |= *(ulong*)(ptr + 8) == *(ulong*)(ptr + offset + 8) ? 0ul : 1ul;
48 8B 41 08 mov rax,qword ptr [rcx+8]
4A 3B 44 21 08 cmp rax,qword ptr [rcx+r12+8]
74 07 je 00007FFA2B390E51
B8 01 00 00 00 mov eax,1
EB 02 jmp 00007FFA2B390E53
33 C0 xor eax,eax
4C 0B D0 or r10,rax
st3 |= *(ulong*)(ptr + 16) == *(ulong*)(ptr + offset + 16) ? 0ul : 1ul;
48 8B 41 10 mov rax,qword ptr [rcx+10h]
4A 3B 44 21 10 cmp rax,qword ptr [rcx+r12+10h]
74 07 je 00007FFA2B390E68
B8 01 00 00 00 mov eax,1
EB 02 jmp 00007FFA2B390E6A
33 C0 xor eax,eax
4C 0B D8 or r11,rax
st4 |= *(ulong*)(ptr + 24) == *(ulong*)(ptr + offset + 24) ? 0ul : 1ul;
48 8B 41 18 mov rax,qword ptr [rcx+18h]
4A 3B 44 21 18 cmp rax,qword ptr [rcx+r12+18h]
74 07 je 00007FFA2B390E7F
B8 01 00 00 00 mov eax,1
EB 02 jmp 00007FFA2B390E81
33 C0 xor eax,eax
4C 0B C8 or r9,rax |
@redknightlois There are no plans to merge this PR if #16156 is resolved in a timely manner. Check with those reviewers to see what their timelines are. And just in case this does fall through I can post a workaround for your scenario when I'm not on mobile. |
@redknightlois Consider this as a temporary workaround until the JIT changes come online. [MethodImpl(MethodImpOptions.AggressiveInlining)]
static byte ToByte(bool value) {
return Unsafe.As<bool, byte>(ref value);
}
st1 |= ToByte(*(ptr + 0) != *(ptr + offset + 0));
// ... |
It looks like these aren't actually zero latency moves, at least on current Intel CPUs. They used to be zero latency on Ivy Bridge and only when the destination register is different from the source register. That's unfortunate, the alternative is to zero out the destination register of SETcc using a XOR but that's problematic to do in the JIT's IR. |
This really belongs in #16156. |
The current implementation of
Convert.ToInt32(bool)
uses branching to return1
or0
to its caller, making it prone to branch mispredictions. This PR makes the implementation branchless. In my Win10 x64 testbed application, this results in a 370% increase in tps for calls to this routine when the input values are random. (Before: 1,780 ms for 500MM calls. After: 480 ms for 500MM calls.)This might seem esoteric, but it matters because it can be used as a springboard to help developers write their own high-performance branchless routines. For example, the statement
int skipLastChunk = isFinalBlock ? 4 : 0;
in the Base64 decoder routine can utilize this to become branchless by being rewritten asint skipLastChunk = Convert.ToInt32(isFinalBlock) << 2;
.Disassembly of
Convert.ToInt32(bool)
before the change (amd64):Disassembly after the change:
For non-random values (e.g., all test inputs are already true or false), the modified code performs +/- 5% of the original code. I don't see a real difference above the normal noise.
Open issue: you can eke out even higher performance if you're willing to forego the normalization logic. I didn't want to risk changing the existing
Convert.ToInt32(bool)
behavior where it's guaranteed to return1
or0
.