cmov shim (used for Math.Min/Max) #16121

benaadams · 2018-01-31T11:09:44Z

Mostly an experiment

benaadams · 2018-01-31T11:14:19Z

Changes String:Compare(ref,int,ref,int,int,ref,int):int from

G_M25396_IG10:
       mov      eax, dword ptr [rcx+8]
       sub      eax, edx
       cmp      r10d, eax
       jle      SHORT G_M25396_IG11
       jmp      SHORT G_M25396_IG12
G_M25396_IG11:
       mov      eax, r10d
G_M25396_IG12:
       mov      r11d, dword ptr [r8+8]
       sub      r11d, r9d
       cmp      r10d, r11d
       jle      SHORT G_M25396_IG13
       jmp      SHORT G_M25396_IG14
G_M25396_IG13:
       mov      r11d, r10d

To

G_M25400_IG10:
       mov      eax, dword ptr [rcx+8]
       sub      eax, edx
       cmp      r10d, eax
       setle    r11b
       movzx    r11, r11b
       neg      r11d
       mov      esi, r10d
       sub      esi, eax
       and      r11d, esi
       add      r11d, eax
       mov      eax, dword ptr [r8+8]
       sub      eax, r9d
       cmp      r10d, eax
       setle    sil
       movzx    rsi, sil
       neg      esi
       mov      edi, r10d
       sub      edi, eax
       and      esi, edi
       add      esi, eax

But size regressions...

Total bytes of diff: 706 (0.02% of base)
    diff is a regression.

Total byte diff includes 38 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    1 unique methods,       38 unique bytes

Top file regressions by size (bytes):
         706 : System.Private.CoreLib.dasm (0.02% of base)

1 total files with size differences (0 improved, 1 regressed), 0 unchanged.

Top method regessions by size (bytes):
          46 : System.Private.CoreLib.dasm - EventSource:CreateManifestAndDescriptors(ref,ref,ref,int):ref
          38 : System.Private.CoreLib.dasm - JitHelpers:ConditionalSelect(bool,int,int):int (0/2 methods)
          28 : System.Private.CoreLib.dasm - ConcurrentDictionary`2:Clear():this (2 methods)
          27 : System.Private.CoreLib.dasm - String:CompareOrdinal(ref,int,ref,int,int):int
          23 : System.Private.CoreLib.dasm - FileStream:ReadAsyncInternal(struct,struct,byref):ref:this
          22 : System.Private.CoreLib.dasm - StringBuilder:CopyTo(int,struct,int):this
          22 : System.Private.CoreLib.dasm - MemberInfoCache`1:MergeWithGlobalList(ref):this
          20 : System.Private.CoreLib.dasm - ConcurrentDictionary`2:GrowTable(ref):this (2 methods)
          19 : System.Private.CoreLib.dasm - String:Compare(ref,int,ref,int,int,ref,int):int
          19 : System.Private.CoreLib.dasm - StringBuilder:ReplaceInPlaceAtChunk(byref,byref,long,int):this
          17 : System.Private.CoreLib.dasm - String:Compare(ref,int,ref,int,int,bool):int
          16 : System.Private.CoreLib.dasm - StringBuilder:.ctor(ref,struct):this
          16 : System.Private.CoreLib.dasm - StringBuilder:MakeRoom(int,int,byref,byref,bool):this
          15 : System.Private.CoreLib.dasm - ConcurrentQueue`1:EnqueueSlow(ref):this
          15 : System.Private.CoreLib.dasm - PerCoreLockedStacks:.ctor():this (3 methods)
          14 : System.Private.CoreLib.dasm - Math:Max(int,int):int (2 methods)
          14 : System.Private.CoreLib.dasm - Math:Min(int,int):int (2 methods)
          14 : System.Private.CoreLib.dasm - NameInfo:ReserveEventIDsBelow(int)
          14 : System.Private.CoreLib.dasm - StringBuilder:AppendCore(ref,int,int):ref:this
          14 : System.Private.CoreLib.dasm - StringBuilder:Replace(ushort,ushort,int,int):ref:this
          13 : System.Private.CoreLib.dasm - String:CompareOrdinal(struct,struct):int
          13 : System.Private.CoreLib.dasm - CalendarData:GetCalendars(ref,bool,ref):int
          12 : System.Private.CoreLib.dasm - PinnableBufferCache:TrimFreeListIfNeeded():bool:this
          12 : System.Private.CoreLib.dasm - EventSource:SendManifest(ref):bool:this
          12 : System.Private.CoreLib.dasm - CompareInfo:CompareOrdinalIgnoreCase(struct,struct):int

Top method improvements by size (bytes):
         -14 : System.Private.CoreLib.dasm - String:Compare(ref,int,ref,int,int,int):int
          -2 : System.Private.CoreLib.dasm - Number:TryInt32ToHexStr(int,ushort,int,struct,byref):bool
          -1 : System.Private.CoreLib.dasm - Number:TryUInt32ToDecStr(int,int,struct,byref):bool

61 total methods with size differences (3 improved, 58 regressed), 17092 unchanged.

mikedn · 2018-01-31T13:05:42Z

But size regressions...

That's likely a relatively minor issue. The real problem is that when branch mispredictions aren't an issue then even the real cmov tends to be slightly slower. But this... well, it's unlikely to be good. And you're lucky that Unsafe.As doesn't just make things worse.

mikedn · 2018-01-31T13:48:43Z

Presumably the String.Compare code you're showing is this:

int lengthA = length;
int lengthB = length;
if (strA != null)
    lengthA = Math.Min(lengthA, strA.Length - indexA);
if (strB != null)
    lengthB = Math.Min(lengthB, strB.Length - indexB);

If you have reasons to try to improve this then probably the first attempt would be to get rid of Math.Min and do something like

if (strA.Length - indexA < lengthA)
    lengthA = strA.Length - indexA;

The x = Math.Min(x, y) is unfortunately problematic.

And then if branch prediction is an issue then you'd probably want to try something like below. Not sure if it's completely correct but if it works, this approach comes before a solution that requires a compare and conversion from bool to int.

int delta = (strA.Length - indexA) - length;
delta = delta >> 31;
lengthA = (delta & (strA.Length - indexA)) | (~delta & lengthA);

redknightlois · 2018-01-31T19:05:29Z

Our experience dealing with string comparison suggest @mikedn is right. The size checks are not typically the issue, but the actual memory comparison itself; where the biggest bottleneck is memory latency. I would seriously kill for the ability to emit CMOV for the individual cases where you can certainly win, but in the context of strings where Math.Min/Math.Max appears in the framework it is probably not worth it.

EDIT: Now you used a C# shim, emitting the cmov itself may prove a potential improvement :).

GrabYourPitchforks · 2018-01-31T19:16:24Z

Yeah, for Math.Min I'm not convinced there's much value add since this pattern really works better with constant inputs. But there are other APIs that show substantial perf gains, e.g., +60% for Int32.CompareTo with a slightly tweaked pattern. I'll investigate PRs for those.

benaadams · 2018-01-31T19:19:17Z

Really I'd just like cmov 😄

tannergooding · 2018-01-31T19:23:03Z

Really I'd just like cmov

It might be reasonable as a hardware intrinsic

benaadams · 2018-01-31T21:14:37Z

Just for interest changes for a Min(Max(..., ...), ...) clamp - before I close and don't save the image

jkotas · 2018-01-31T22:43:13Z

It might be reasonable as a hardware intrinsic

I do not think so, It is something that the JIT should just pattern match.

redknightlois · 2018-01-31T22:50:26Z

@jkotas Not always... there are cases where cmov is the right tool for the job but not always... unless you can tell the JIT to emit cmov if you really want it, JIT pattern matching wont solve the issue. So unless the JIT will unilaterally decide that conditional operations will priorize emitting cmov while if-then-else the JNE pattern (don't think it is even possible to know from IL), the need to disambiguate the usage is still required (intrinsic or not).

jkotas · 2018-01-31T23:45:20Z

Is there a prior art where folks have to tell C/C++ compilers when to emit cmov ?

tannergooding · 2018-02-01T00:14:50Z

Is there a prior art where folks have to tell C/C++ compilers when to emit cmov

I believe most of the C/C++ compilers will automatically do it, but it also depends on several options (such as emit smaller or emit faster code) and sometimes depends on more complex analysis, which they would have time to do (since the conditions under which cmov is recommended for use is fairly strict).

redknightlois · 2018-02-01T01:18:55Z

@jkotas @tannergooding Actually you are asking yourself a trick question here.

If the underlying question is the strict: "Can you force the C/C++ compiler to emit a cmov instruction?" the answer is NO. It will decide when to use cmov and when NOT, based on his own analysis. Different compilers will do contradicting things for the same code.

However, if the underlying question is not the strict version: "Can I in C/C++ emit a cmov instruction when I need it?". Then the answer is way different.

Branchless Binary Search core loop: [EDIT: Wrong algorithm]

    __asm__ (
        "comiss %[xi_mid],%[z]\n"
        "cmovb %[mid],%[hi]\n"
        "cmovae %[mid],%[lo]\n"
        : [hi] "+r"(hi), [lo] "+r"(lo)
        : [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
        : "cc"
    );

AndyAyersMS · 2018-02-01T01:19:30Z

I'm not aware of any such way to influence native compilers, but you never know.

The analysis for cmov isn't particularly complex or costly. Certain idioms are well known to be inherently unpredictable (max, min, etc) and you can start with things that look like those. But it's best done fairly late and the JIT's IR is not really as friendly here as it could be.

So the issue around cmov is more of ROI: it is a lot of work for something that has very selective impact.

redknightlois · 2018-02-01T01:30:27Z

Adding to what @AndyAyersMS commented. A year or so ago I proposed being able to give extra information to the JIT about what branch is expected. And also a 'unpredictable' one like Assume.Unpredictable where the JIT would try to emit a cmov instruction instead. That is a different approach without an intrinsic to emit it directly.

GrabYourPitchforks · 2018-02-01T01:47:10Z

Here's an alternative Math.Max(int, int) implementation that's slightly more optimized than the current proposal.

        [NonVersionable]
        public static int Max(int val1, int val2)
        {
            var selectVal2Mask = JitHelpers.BoolToByteNonNormalized(val1 >= val2) - 1;
            return (val2 & selectVal2Mask) + (val1 & ~selectVal2Mask);
        }

This gives a 280% throughput boost on my machine for random inputs (old: 2,150 ms for 500MM iterations; new: 770 ms for 500MM iterations).

Of course, cmov would be better. ;)

benaadams · 2018-02-01T01:52:22Z

Based on JitHelpers.BoolToByteNonNormalized in #16138

mikedn · 2018-02-01T06:30:18Z

I believe most of the C/C++ compilers will automatically do it, but it also depends on several options (such as emit smaller or emit faster code)

All well known C/C++ compilers are pretty aggressive at CMOV generation. Small/Fast code options usually don't impact this since CMOV tends to be more or less good at both. What I would expect to impact CMOV generation is things like PGO, if a compiler knows that a branch is rarely taken then it could fallback to normal conditional code.

But it's best done fairly late and the JIT's IR is not really as friendly here as it could be.

I don't know. I did it in lowering and JIT's IR worked well enough. The only thing that I found to be rather cumbersome is the fact is that JIT's IR doesn't have instructions and operands but a soup of nodes that are both. This tends to make the pattern matching code slightly more complex. It probably would have helped is containment pulled out contained nodes out of the linear order so they look more like "operands".

So the issue around cmov is more of ROI: it is a lot of work for something that has very selective impact.

I'm happy to do such work if the JIT team supports it by reviewing, suggesting improvements etc. I already did an experiment about a year ago but then the JIT teams provided basically no feedback...

This gives a 280% throughput boost on my machine for random inputs (old: 2,150 ms for 500MM iterations; new: 770 ms for 500MM iterations).

How well does that work on not so random input (e.g. a mostly sorted array)?

AndyAyersMS · 2018-02-01T07:18:00Z

Re cmov: I worry about introducing mid-block flag consumers given that we don't track flag liveness. But I suppose, given decomposition of large int ops and unordered float compares, we're already in this boat. And it's also a blessing that we don't do other late peepholes, though the code cries out for it in places.

Sorry that we didn't get back to you about your earlier experiment. If you're willing to give it another go I'll make sure we pay attention.

mikedn · 2018-02-01T07:47:28Z

I worry about introducing mid-block flag consumers given that we don't track flag liveness.

I worry about that too but it seems to work quite well in practice. At least for as long as LSRA doesn't start doing more fancy things...

And it's also a blessing that we don't do other late peepholes, though the code cries out for it in places.

Any examples? It would be useful to have a list to see what may potentially interfere with flags.

GrabYourPitchforks · 2018-02-01T07:55:59Z

How well does that work on not so random input (e.g. a mostly sorted array)?

@mikedn Assuming perfect branch prediction, the perf hit for the non-branching version is around -30%. The Max and Min functions are a bit interesting because frequently one of the parameters passed in is constant, and there are optimizations one can do if you can get this baked in at JIT time. I don't know how to make that work in straight-up IL however.

mikedn · 2018-02-01T08:13:24Z

Assuming perfect branch prediction, the perf hit for the non-branching version is around -30%.

Right, with cmov the impact would be lower, 5-10% AFAIR.

The Max and Min functions are a bit interesting because frequently one of the parameters passed in is constant, and there are optimizations one can do if you can get this baked in at JIT time.

Yes, most of this revolves around the JIT recognizing c ? x : y. Once that is done it's pretty easy to recognize other patterns on top of that (e.g. turning c ? 0 : 4 into setcc/shl 2).

I need to dust off my if conversion experiment. The biggest unknown about it is the impact on JIT throughput, back then I didn't have the necessary tools to measure it. Then maybe we can put this matter to rest one way or another.

redknightlois · 2018-02-01T17:34:06Z

@mikedn I have a benchmark I am working on that has this particular idiom.

We can use it for benchmarking:

st1 |= (*(ptr + 0) ^ *(ptr + offset + 0)) == 0 ? 0ul : 1ul;

Code generated running on master (2 days ago) is ugly as hell:

mov         rbp,qword ptr [rdx]  
xor         rbp,qword ptr [rdx+rcx*8]  
je          00007FFC5F8CA8EA  
mov         ebp,1  
jmp         00007FFC5F8CA8EC  
xor         ebp,ebp  
or          r9,rbp

EDIT: @AndyAyersMS @jkotas any plan to expose JitHelpers as public API? (if it is exposed already what package do I import to use it?)

mikedn · 2018-02-01T17:53:03Z

st1 |= (*(ptr + 0) ^ *(ptr + offset + 0)) == 0 ? 0ul : 1ul;

This doesn't even require CMOV, it should generate something like:

mov         rbp,qword ptr [rdx]  
xor         rbp,qword ptr [rdx+rcx*8]  
setne       al
movzx       eax, al ; actually optional, or can be narrowed to "or r9l, al"
or          r9, rax

GrabYourPitchforks · 2018-02-01T18:05:35Z

That code can use Convert.ToUint64(bool) directly, which uses branchless SETcc under the covers once the outstanding PR is merged.

redknightlois · 2018-02-01T18:09:03Z

Yes, most of this revolves around the JIT recognizing c ? x : y. Once that is done it's pretty easy to recognize other patterns on top of that (e.g. turning c ? 0 : 4 into setcc/shl 2).

@mikedn yes I know... It is an special case for that last one. It probably should have read 'that idiom' instead of 'this idiom' to disambiguate.
@GrabYourPitchforks Awesome, then looking forward for the PR to be merged and appear at master :)

AndyAyersMS · 2018-02-01T18:34:38Z

I think we're in somewhat murky waters here as far as what we are willing to do for performance.

On the one hand, and speaking as a jit developer, I'd really prefer that we invest in improving the capabilities of the jit.

The sorts of patterns discussed here and in other related issues are common and arise in both predictable and unexpected ways in user code (in particular a lot of them only be come visible after inlining). The optimal expansions are often machine and context dependent, and there are a lot of them.

Encapsulating these patterns via helpers/intrinsics/etc leads to code that is now studded with odd performance barnacles. It leads to a kind of performance mysticism, giving off the impression that one must resort to contortions and somewhat arcane practices to get good performance out of C#. And by presuming to handle the important perf cases this way there's then a bit less incentive for the jit team to go in and fix the underling issues, as "anyone who cares" has presumably already coded around the problems.

On the other hand, modern CPUs have a number of performance cliffs and the perf gap between naive codegen and optimal codegen can be quite large. So I get that it's frustrating as a developer to not be able to close this gap when performance really is a key factor.

In cases like the HW intrinsics where it's very unlikely the JIT will ever be able to organically use the full set of operations the machine provides, exposing the raw hardware capabilities is reasonable. But in cases where we're talking about more basic idiomatic patterns I'd prefer we hold off and give the JIT team a chance to step up and fix the problems.

On the (third?) hand we also need to be pragmatic and realize that the JIT team is relatively small and moves relatively slowly.

So I think that leads us to a case-by-case evaluation of various proposals, and here I would like to see us more clearly provide guidance on what sorts of perf justify what sorts of changes.

jkotas · 2018-02-01T19:41:54Z

But in cases where we're talking about more basic idiomatic patterns I'd prefer we hold off and give the JIT team a chance to step up and fix the problems.

I agree. These are good experiments to demonstrate the benefits, but I do not think we want to be adding cryptic code like this to the framework. We should wait for the JIT fix.

GrabYourPitchforks · 2018-02-01T20:25:13Z

I wonder if some of this work can be offloaded onto Roslyn instead of RyuJIT. For example, the pattern (condition) ? 1 : 0 could get Roslyn to emit just the correct comparison followed by stloc, which will cause today's JIT to emit the desired cmp, setcc, movzx pattern. Would that be a better allocation of resources?

mikedn · 2018-02-05T18:44:14Z

I'm not sure where the String.Compare example came, the one I found has some null checks that are missing in the presented example code. In any case, the Math.Min part becomes:

       test     rsi, rsi
       je       SHORT G_M30462_IG04
       mov      eax, dword ptr [rsi+8]
       sub      eax, ebx
       cmp      r12d, eax
       mov      r15d, eax
       cmovle   r15d, r12d
G_M30462_IG04:
       test     rdi, rdi
       je       SHORT G_M30462_IG05
       mov      eax, dword ptr [rdi+8]
       sub      eax, r14d
       cmp      r12d, eax
       cmovle   eax, r12d
       mov      r12d, eax
G_M30462_IG05:

benaadams · 2018-02-05T19:07:14Z

I'm not sure where the String.Compare example came

String:Compare was mostly from looking at the regressions; not as an example of a good result 😄

benaadams · 2018-02-05T20:52:35Z

int Math.Clamp(int value, int min, int max)

Might be something to look at; and its probably what the Min(Max(..., ...), ...) locations should be using

benaadams · 2018-02-05T23:28:15Z

Only found 2 that should be clamp, both in StringBuilder #16219

benaadams · 2018-02-09T23:29:11Z

Closing this as investigation into cmov has been reborn

EgorBo · 2019-09-11T00:02:05Z

I wonder why it can't be just:

[NonVersionable]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int Max(int val1, int val2)
{
    if (Sse41.IsSupported)
    {
        Vector128<int> vec1 = Vector128.CreateScalarUnsafe(val1);
        Vector128<int> vec2 = Vector128.CreateScalarUnsafe(val2);
        return Sse41.Max(vec1, vec2).ToScalar();
    }
    return (val1 >= val2) ? val1 : val2;
}
// or Avx2.Max and ymm

(vpmaxsd instead of cmov or branches)

A small benchmark: https://gist.github.com/EgorBo/9e99d6958f4db162c075badaac29b5f2
case 1: random data

|   Method |       Mean |     Error |    StdDev | Ratio |
|--------- |-----------:|----------:|----------:|------:|
|  MathMax | 3,331.9 ns |  5.557 ns |  5.198 ns |  1.00 |
| Sse41Max |   999.7 ns | 16.600 ns | 15.528 ns |  0.30 |
|  Avx2Max |   984.7 ns |  2.482 ns |  2.200 ns |  0.30 |

case 2: equal data

|   Method |     Mean |    Error |   StdDev | Ratio |
|--------- |---------:|---------:|---------:|------:|
|  MathMax | 822.2 ns | 4.274 ns | 3.998 ns |  1.00 |
| Sse41Max | 991.2 ns | 8.321 ns | 7.376 ns |  1.21 |
|  Avx2Max | 986.6 ns | 2.800 ns | 2.338 ns |  1.20 |

Some C++ compilers also do it: https://godbolt.org/z/8TDdoh

cmov shim

7b7f9fa

GrabYourPitchforks mentioned this pull request Feb 1, 2018

Improve performance of Convert.ToInt32(bool) and related functions #16138

Closed

benaadams closed this Feb 9, 2018

benaadams deleted the cmov branch February 9, 2018 23:29

cmov shim (used for Math.Min/Max) #16121

cmov shim (used for Math.Min/Max) #16121

Conversation

benaadams commented Jan 31, 2018

benaadams commented Jan 31, 2018

mikedn commented Jan 31, 2018

mikedn commented Jan 31, 2018

redknightlois commented Jan 31, 2018 • edited Loading

GrabYourPitchforks commented Jan 31, 2018

benaadams commented Jan 31, 2018

tannergooding commented Jan 31, 2018 • edited Loading

benaadams commented Jan 31, 2018

jkotas commented Jan 31, 2018

redknightlois commented Jan 31, 2018

jkotas commented Jan 31, 2018

tannergooding commented Feb 1, 2018

redknightlois commented Feb 1, 2018 • edited Loading

AndyAyersMS commented Feb 1, 2018

redknightlois commented Feb 1, 2018

GrabYourPitchforks commented Feb 1, 2018 • edited Loading

benaadams commented Feb 1, 2018

mikedn commented Feb 1, 2018

AndyAyersMS commented Feb 1, 2018

mikedn commented Feb 1, 2018

GrabYourPitchforks commented Feb 1, 2018

mikedn commented Feb 1, 2018

redknightlois commented Feb 1, 2018 • edited Loading

mikedn commented Feb 1, 2018

GrabYourPitchforks commented Feb 1, 2018

redknightlois commented Feb 1, 2018 • edited Loading

AndyAyersMS commented Feb 1, 2018

jkotas commented Feb 1, 2018

GrabYourPitchforks commented Feb 1, 2018

mikedn commented Feb 5, 2018

benaadams commented Feb 5, 2018

benaadams commented Feb 5, 2018

benaadams commented Feb 5, 2018

benaadams commented Feb 9, 2018

EgorBo commented Sep 11, 2019 • edited Loading

redknightlois commented Jan 31, 2018 •

edited

Loading

tannergooding commented Jan 31, 2018 •

edited

Loading

redknightlois commented Feb 1, 2018 •

edited

Loading

GrabYourPitchforks commented Feb 1, 2018 •

edited

Loading

redknightlois commented Feb 1, 2018 •

edited

Loading

redknightlois commented Feb 1, 2018 •

edited

Loading

EgorBo commented Sep 11, 2019 •

edited

Loading