Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve performance of Convert.ToInt32(bool) and related functions #16138

Closed

Conversation

GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Feb 1, 2018

The current implementation of Convert.ToInt32(bool) uses branching to return 1 or 0 to its caller, making it prone to branch mispredictions. This PR makes the implementation branchless. In my Win10 x64 testbed application, this results in a 370% increase in tps for calls to this routine when the input values are random. (Before: 1,780 ms for 500MM calls. After: 480 ms for 500MM calls.)

This might seem esoteric, but it matters because it can be used as a springboard to help developers write their own high-performance branchless routines. For example, the statement int skipLastChunk = isFinalBlock ? 4 : 0; in the Base64 decoder routine can utilize this to become branchless by being rewritten as int skipLastChunk = Convert.ToInt32(isFinalBlock) << 2;.

Disassembly of Convert.ToInt32(bool) before the change (amd64):

; assume r8b contains 'value' and edi will receive the result
  test r8d, r8d
  jne IF_TRUE
IF_FALSE:
  xor r8d, r8d
  jmp FINISHED
IF_TRUE:
  mov r8d, 1
FINISHED:
  mov edi, r8d

Disassembly after the change:

  test r8d, r8d
  setne dil
  movzx edi, dil

For non-random values (e.g., all test inputs are already true or false), the modified code performs +/- 5% of the original code. I don't see a real difference above the normal noise.

Open issue: you can eke out even higher performance if you're willing to forego the normalization logic. I didn't want to risk changing the existing Convert.ToInt32(bool) behavior where it's guaranteed to return 1 or 0.

@AndyAyersMS
Copy link
Member

Seems like something we ought to fix in the jit.

Branchless sequences for various idioms are well known and should really be part of the jit's toolbox.

@stephentoub
Copy link
Member

A search of corelib for the pattern ? 1 : 0 yields a handful of results:

  D:\repos\coreclr\src\mscorlib\shared\System\Number.Formatting.cs(1546):            section = FindSection(format, dig[0] == 0 ? 2 : number.sign ? 1 : 0);
  D:\repos\coreclr\src\mscorlib\shared\System\Number.NumberBuffer.cs(24):            public bool sign { get => _sign != 0; set => _sign = value ? 1 : 0; }
  D:\repos\coreclr\src\mscorlib\shared\System\Globalization\Calendar.cs(804):                return ((TwoDigitYearMax / 100 - (year > TwoDigitYearMax % 100 ? 1 : 0)) * 100 + year);
  D:\repos\coreclr\src\mscorlib\shared\System\Globalization\GregorianCalendarHelper.cs(634):                return ((twoDigitYearMax / 100 - (y > twoDigitYearMax % 100 ? 1 : 0)) * 100 + y);
  D:\repos\coreclr\src\mscorlib\shared\System\Globalization\HijriCalendar.cs(165):                NumDays += 354 + (IsLeapYear(NumYearsLeft, CurrentEra) ? 1 : 0);
  D:\repos\coreclr\src\mscorlib\shared\System\IO\BinaryWriter.cs(119):            _buffer[0] = (byte)(value ? 1 : 0);
  D:\repos\coreclr\src\mscorlib\shared\System\IO\PathHelper.Windows.cs(76):                int lastSeparator = specialPath ? 1 : 0;
  D:\repos\coreclr\src\mscorlib\shared\System\IO\PathInternal.Unix.cs(27):            return path.Length > 0 && IsDirectorySeparator(path[0]) ? 1 : 0;
  D:\repos\coreclr\src\mscorlib\shared\System\IO\UnmanagedMemoryAccessor.cs(396):        public void Write(long position, bool value) => Write(position, (byte)(value ? 1 : 0));
  D:\repos\coreclr\src\mscorlib\shared\System\Text\UTF8Encoding.cs(2501):                   UTF8_CODEPAGE + (_emitUTF8Identifier ? 1 : 0);
  D:\repos\coreclr\src\mscorlib\src\System\Enum.cs(1205):            return InternalBoxEnum(rtType, value ? 1 : 0);
  D:\repos\coreclr\src\mscorlib\src\System\Globalization\CompareInfo.Windows.cs(54):                            bIgnoreCase ? 1 : 0);
  D:\repos\coreclr\src\mscorlib\src\System\Reflection\Emit\CustomAttributeBuilder.cs(485):                        writer.Write((byte)((bool)value ? 1 : 0));
  D:\repos\coreclr\src\mscorlib\src\System\Reflection\Emit\DynamicILGenerator.cs(714):            initLocals = (m_method.InitLocals) ? 1 : 0;
  D:\repos\coreclr\src\mscorlib\src\System\Threading\ManualResetEventSlim.cs(121):                UpdateStateAtomically(((value) ? 1 : 0) << SignalledState_ShiftCount, SignalledState_BitMask);
  D:\repos\coreclr\src\mscorlib\src\System\Threading\Tasks\TPLETWProvider.cs(290):                    Int32 isExceptionalInt = IsExceptional ? 1 : 0;

As part of this change, consider switching some or all of those over to use Convert.ToInt32?

@@ -769,7 +771,9 @@ public static byte ToByte(object value, IFormatProvider provider)

public static byte ToByte(bool value)
{
return value ? (byte)Boolean.True : (byte)Boolean.False;
// Ideally we'd call BoolToByte(!!value) to normalize any true value to 1,
// but JIT optimizes !! away. The pattern below defeats this optimization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can JIT optimize !! away? Is not it a bug?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, haven't checked but both the JIT and the C# compiler tend to assume that bool is either 0 or 1. It's a bit of the bug since the IL spec doesn't actually guarantee that but then this behavior is old enough that it has become the norm.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked, this is not JIT's doing. The C# compiler optimizes away !!.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was mistaken. The C# compiler optimizes away !!. I think this behavior is correct for the reason @mikedn calls out: the C# ! operator is not the same as the C ! operator, and the user-observable side effect of double-notting a bool is to keep the original value. So this code isn't working around a bug as much as it is trying to make up for a missing language feature. I should reword the comment.

Additionally, as far as I can tell the JIT doesn't actually have a concept of a ! operator; the not opcode is instead equivalent to one's complement (~). So putting two not opcodes into the stream doesn't have the desired effect either.

I can't think of a more efficient way of performing this normalization. One option would be to not perform the normalization, but there's probably somebody somewhere who's somehow passing a true value other than 1 and relying on the existing logic.

Copy link

@mikedn mikedn Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, as far as I can tell the JIT doesn't actually have a concept of a ! operator; the not opcode is instead equivalent to one's complement (~). So putting two not opcodes into the stream doesn't have the desired effect either.

The usual way to negate a 0/1 bool is xor 1. Funnily enough, the C# compilers assumes 0/1 bools but fails to perform this particular optimization.

One option would be to not perform the normalization, but there's probably somebody somewhere who's somehow passing a true value other than 1 and relying on the existing logic.

I doubt that not performing the normalization would have any meaningful impact considering that both the JIT (and AFAIR the JIT does some rather dubious things in this area) and the C# compilers perform various other optimizations assuming 0/1 bools. The ship has sailed a long time ago.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While C# does not clearly define the exact values that true and false are (just that there are only two values).

True. The C# spec does not explicitly define the value of true to any specific numerical value. However the C# compiler (as well as F#, VB, Cobol, etc ... every .NET language in existance) defines the literal true as the value 1 / ldc.i4.1.

Scenarios where the literal true is imagined as any other numeric value are academic.

Since the C# spec doesn't explicitly call out what these values are, but the implementation generally handles it as false is 0; true is not 0

Incorrect. The compiler assumes booleans can only be 1 or 0 and emits it's logic based on that. Sure there are times where C#'s emitted code will happen to also work with a 2+ boolean value but that is just coincidence, not design.

The bool type represents boolean logical quantities. The possible values of type bool are true and false.

Note that the boolean type itself agrees with the C# compiler here on the numeric values of true and false. Both in the constants it defines internally and the values that are returned from TryParse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenarios where the literal true is imagined as any other numeric value are academic.

Potentially academic, but still inline with the IL definition:

A bit pattern with any one or more bits set (analogous to a non-zero integer) denotes a value of true.

I do think it would be desirable, and within the realm of reason, to have the C# spec explicitly list the values it recognizes as true/false; otherwise, two independent language implementations may do it differently (one may do 0/1, like Roslyn; and the other may do 0/not 0, like the IL spec).

Also, while I still think fixing the C# spec to explicitly have the same definition of true (that is, true is a non-zero bit pattern) would be nice. It could also be considered a breaking change in the scenarios that @mikedn listed above, since the code may now execute differently from before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C# language specification is not specific to any particular runtime platform. The only observable consequence of placing something like that in the specification would be the behavior you get when you overlay a bool and a byte in unsafe code. Specifying that would be a peculiar departure from the normal situation in which the specification avoids telling you in detail the way unsafe code works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and the same time the C# compile transforms b1 && b2 into a bitwise and. That is only valid for 0/1 bools.

@mikedn I was surprised to see that making this a bitwise operator is a Roslyn-specific behavior. Earlier versions of the C# compiler use short-circuiting logic. So this means technically there has already been a breaking change in the language, though the true number of people impacted is probably minimal. (Of course, there's still wonkiness with the other operators.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see that making this a bitwise operator is a Roslyn-specific behavior

Note that the JIT does it too. But I'm not sure when it started doing it.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you are working around a JIT bug. We should fix the JIT bug instead.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you are working around a JIT bug. We should fix the JIT bug instead.

@@ -191,5 +191,27 @@ static internal PinningHelper GetPinningHelper(Object o)
typeof(ArrayPinningHelper).ToString(); // Type used by the actual method body
throw new InvalidOperationException();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static internal byte BoolToByteNonNormalized(bool value)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this return byte? int would make more sense. A bool is 1 byte in size only when stored in memory (e.g. in a class field or array element). It is never 1 byte on the evaluation stack.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this method should be added to Unsafe. It basically works around a C# language limitation (like other Unsafe methods do) and at the same time it is slightly unsafe due to the possibility of a bool not being only 0 or 1.

Copy link
Member Author

@GrabYourPitchforks GrabYourPitchforks Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that there might be some cases where RyuJIT might elide the movzx instruction if it was kept as an 8-bit integral type all the way throughout the call sites. I'm thinking of cases like myByteArray[idx] = BoolToByte(condition);, where everything would just compile to a single mov byte ptr [rdx + rcx], eax instruction. But I'm not married to returning an 8-bit integral type, and if the JIT is smart enough to make this all work anyway then all the better. :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of cases like myByteArray[idx] = BoolToByte(condition);, where everything would just compile to a single mov byte ptr [rdx + rcx], eax instruction.

I already sent a PR to fix that specific case. The JIT already eliminates movzx in some other cases.

At the same time, movzx is a cheap instruction (0 latency normally) so it shouldn't have much impact. But if you do see useless movzx being generated, file issues. There are a lot of things that the JIT can't do but there are also quite a few things that the JIT could do, provided someone bothers with writing the necessary code.

@redknightlois
Copy link

@GrabYourPitchforks Check the examples here (https://github.com/dotnet/coreclr/issues/1306) to see if they are impacted, if there are, feel free to close the issue :)

@GrabYourPitchforks
Copy link
Member Author

@redknightlois Yes, it would help your Framework test case but not your IfThenElse test case. That would require a compiler or JIT change. See #16121 (comment) for some discussion on this point.

@mikedn
Copy link

mikedn commented Feb 5, 2018

With if-conversion enabled in the JIT the existing Convert.ToInt32(bool) implementation generates:

test     cl, cl
setne    al
movzx    rax, al

The example from Base64.DecodeFromUtf8 generates

test     r8d, r8d
setne    al
movzx    rax, al
mov      r9d, eax
shl      r9d, 2

@GrabYourPitchforks
Copy link
Member Author

I am optimistic that #16156 may finally address this once and for all. If the JIT is able to handle simple ternary cases like Convert.ToInt32(bool), we can address the more "complex" cases like Int32.CompareTo(Int32) individually:

int Compare(int x, int y) {
  return ((x >= y) ? 1 : 0) - ((x <= y) ? 1 : 0);
}

Which in theory results in this codegen:

cmp ecx, edx ; ecx = x, edx = y
setge al ; al = 0 if x < y, 1 if x >= y
movzx eax, al ; elided by processor
cmp ecx, edx ; ecx = x, edx = y
setle bl ; bl = 1 if x <= y, 0 if x > y
movzx ebx, bl ; elided by processor
; At this point, three possibilities:
; x < y => eax = 0, ebx = 1
; x > y => eax = 1, ebx = 0
; x == y => eax = 1, ebx = 1
sub eax, ebx ; eax contains result

@mikedn
Copy link

mikedn commented Feb 5, 2018

Which in theory results in this codegen:

Yep, it generates:

       3BCA                 cmp      ecx, edx
       0F9DC0               setge    al
       0FB6C0               movzx    rax, al
       3BCA                 cmp      ecx, edx
       0F9EC2               setle    dl
       0FB6D2               movzx    rdx, dl
       2BC2                 sub      eax, edx

But it's probably a bad idea. If you have code like if (Compare(x, y) > 0) Swap(x, y) and Compare inlines (and probably you hope to inline) then the result is messy since the expression is too complicated for the JIT to figure out that it really means x > y.

If I enable nested if-conversion we might be able to get something like:

       3BCA                 cmp      ecx, edx
       B8FFFFFFFF           mov      eax, -1
       0F9FC2               setg     dl
       0FB6D2               movzx    rdx, dl
       0F4CD0               cmovl    edx, eax
       8BC2                 mov      eax, edx

@GrabYourPitchforks
Copy link
Member Author

But it's probably a bad idea. If you have code like if (Compare(x, y) > 0) Swap(x, y) and Compare inlines (and probably you hope to inline) then the result is messy since the expression is too complicated for the JIT to figure out that it really means x > y.

I'm not too worried about that TBH. The existing sort routines in the framework special-case when T is an integral type, so they don't even call Int32.CompareTo. The only callers who presumably would ever be using this method are going through the IComparable interface, which implies no inlining anyway.

@stephentoub
Copy link
Member

stephentoub commented Feb 5, 2018

The only callers who presumably would ever be using this method are going through the IComparable interface, which implies no inlining anyway.

Except with work that's been done recently around inlining and EqualityComparer<T>.Default , would it not impact uses of EqualityComparer<T>.Default where T == int?
#14125

@GrabYourPitchforks
Copy link
Member Author

Except with work that's been done recently around inlining and EqualityComparer.Default , would it not impact uses of EqualityComparer.Default where T == int?

Shouldn't, since IEquatable<T> is a separate interface and its implementation is untouched.

@mikedn
Copy link

mikedn commented Feb 5, 2018

The existing sort routines in the framework special-case when T is an integral type, so they don't even call Int32.CompareTo. The only callers who presumably would ever be using this method are going through the IComparable interface, which implies no inlining anyway.

You have no way of knowing how such methods are used. And we're talking about going from something like:

G_M55887_IG01:
       C5F877               vzeroupper
G_M55887_IG02:
       3BCA                 cmp      ecx, edx
       7C0A                 jl       SHORT G_M55887_IG05
G_M55887_IG03:
       3BCA                 cmp      ecx, edx ; ??!
       C4E17A10442428       vmovss   xmm0, dword ptr [rsp+28H]
G_M55887_IG04:
       C3                   ret
G_M55887_IG05:
       C4E17828C3           vmovaps  xmm0, xmm3
G_M55887_IG06:
       C3                   ret

to something like:

G_M55887_IG01:
       C5F877               vzeroupper
G_M55887_IG02:
       3BCA                 cmp      ecx, edx
       0F9DC0               setge    al
       0FB6C0               movzx    rax, al
       3BCA                 cmp      ecx, edx
       0F9EC2               setle    dl
       0FB6D2               movzx    rdx, dl
       2BC2                 sub      eax, edx
       85C0                 test     eax, eax
       7C08                 jl       SHORT G_M55887_IG04
       C4E17A10442428       vmovss   xmm0, dword ptr [rsp+28H]
G_M55887_IG03:
       C3                   ret
G_M55887_IG04:
       C4E17828C3           vmovaps  xmm0, xmm3
G_M55887_IG05:
       C3                   ret

@stephentoub
Copy link
Member

stephentoub commented Feb 5, 2018

Shouldn't, since IEquatable is a separate interface and its implementation is untouched.

And when the same thing is done for Comparer<T>.Default?

@GrabYourPitchforks
Copy link
Member Author

And when the same thing is done for Comparer.Default?

The sort routines don't even use this implementation, and I assume that wouldn't change going forward. This whole exercise would be a no-op for that scenario.

You could perhaps make the argument that methods like Enumerable.Min<T> could be impacted (if the compiler somehow chose the generic overload instead of the int-specific one, which I'm having a difficult time reconciling how this would happen in a real world application).

But per @mikedn's earlier comment, if nested if-conversion ends up in scope, then there's nothing for us to do here anyway since the JIT can always do the right thing. :)

@stephentoub
Copy link
Member

stephentoub commented Feb 5, 2018

The sort routines don't even use this implementation

Sure they do. Not if T==int, but if for example T=ValueTuple<int,whatever>, e.g.
https://source.dot.net/#System.Private.CoreLib/shared/System/ValueTuple.cs,576
That would use Int32.CompareTo.

@redknightlois
Copy link

@GrabYourPitchforks Any idea when a fix for the bool to int branching code is going to be merged into trunk (even if not fixed at the JIT level).

I am redesigning an algorithm to be cache aware and have code like this:

st1 |= *(ptr + 0) == *(ptr + offset + 0) ? 0ul : 1ul;
st2 |= *(ptr + 1) == *(ptr + offset + 1) ? 0ul : 1ul;
st3 |= *(ptr + 2) == *(ptr + offset + 2) ? 0ul : 1ul;
st4 |= *(ptr + 3) == *(ptr + offset + 3) ? 0ul : 1ul;

This loop is hurting my eyes (and performance):

                        st1 |= *(ulong*)(ptr + 0) == *(ulong*)(ptr + offset + 0) ? 0ul : 1ul;
48 8B 19             mov         rbx,qword ptr [rcx]  
4A 3B 1C 21          cmp         rbx,qword ptr [rcx+r12]  
74 1C                je          00007FFA2B390E37  
BB 01 00 00 00       mov         ebx,1  
EB 17                jmp         00007FFA2B390E39  
48 89 4D 28          mov         qword ptr [rbp+28h],rcx  
48 8B D8             mov         rbx,rax  
EB 6A                jmp         00007FFA2B390E95  
48 8B 9D B8 00 00 00 mov         rbx,qword ptr [rbp+0B8h]  
E9 2B 02 00 00       jmp         00007FFA2B391062  
33 DB                xor         ebx,ebx  
48 0B C3             or          rax,rbx  
48 8B D8             mov         rbx,rax  
       st2 |= *(ulong*)(ptr + 8) == *(ulong*)(ptr + offset + 8) ? 0ul : 1ul;
48 8B 41 08          mov         rax,qword ptr [rcx+8]  
4A 3B 44 21 08       cmp         rax,qword ptr [rcx+r12+8]  
74 07                je          00007FFA2B390E51  
B8 01 00 00 00       mov         eax,1  
EB 02                jmp         00007FFA2B390E53  
33 C0                xor         eax,eax  
4C 0B D0             or          r10,rax  
       st3 |= *(ulong*)(ptr + 16) == *(ulong*)(ptr + offset + 16) ? 0ul : 1ul;
48 8B 41 10          mov         rax,qword ptr [rcx+10h]  
4A 3B 44 21 10       cmp         rax,qword ptr [rcx+r12+10h]  
74 07                je          00007FFA2B390E68  
B8 01 00 00 00       mov         eax,1  
EB 02                jmp         00007FFA2B390E6A  
33 C0                xor         eax,eax  
4C 0B D8             or          r11,rax  
       st4 |= *(ulong*)(ptr + 24) == *(ulong*)(ptr + offset + 24) ? 0ul : 1ul;
48 8B 41 18          mov         rax,qword ptr [rcx+18h]  
4A 3B 44 21 18       cmp         rax,qword ptr [rcx+r12+18h]  
74 07                je          00007FFA2B390E7F  
B8 01 00 00 00       mov         eax,1  
EB 02                jmp         00007FFA2B390E81  
33 C0                xor         eax,eax  
4C 0B C8             or          r9,rax  

@GrabYourPitchforks
Copy link
Member Author

@redknightlois There are no plans to merge this PR if #16156 is resolved in a timely manner. Check with those reviewers to see what their timelines are. And just in case this does fall through I can post a workaround for your scenario when I'm not on mobile.

@GrabYourPitchforks
Copy link
Member Author

@redknightlois Consider this as a temporary workaround until the JIT changes come online.

[MethodImpl(MethodImpOptions.AggressiveInlining)]
static byte ToByte(bool value) {
  return Unsafe.As<bool, byte>(ref value);
}

st1 |= ToByte(*(ptr + 0) != *(ptr + offset + 0));
// ...

@mikedn
Copy link

mikedn commented Feb 19, 2018

movzx eax, al ; elided by processor

It looks like these aren't actually zero latency moves, at least on current Intel CPUs. They used to be zero latency on Ivy Bridge and only when the destination register is different from the source register.

That's unfortunate, the alternative is to zero out the destination register of SETcc using a XOR but that's problematic to do in the JIT's IR.

@RussKeldorph RussKeldorph added area-VM * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Apr 13, 2018
@RussKeldorph RussKeldorph added this to the Future milestone Apr 13, 2018
@GrabYourPitchforks
Copy link
Member Author

This really belongs in #16156.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.