Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJit: avoid conditional jumps using cmov and similar instructions #6749

Closed
svick opened this issue Oct 2, 2016 · 52 comments · Fixed by #81267
Closed

RyuJit: avoid conditional jumps using cmov and similar instructions #6749

svick opened this issue Oct 2, 2016 · 52 comments · Fixed by #81267
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@svick
Copy link
Contributor

svick commented Oct 2, 2016

Conditional jumps, especially those that are hard to predict, are fairly expensive, so they should be avoided if possible. One way to avoid them is to use conditional moves and similar instructions (like sete). As far as I can tell, RuyJit never does this and I think it should.

For example, take these two methods:

[MethodImpl(MethodImplOptions.NoInlining)]
static long sete_or_mov(bool cond) {
    return cond ? 4 : 0;
}

[MethodImpl(MethodImplOptions.NoInlining)]
static long cmov(long longValue) {
    long tmp1 = longValue & 0x00000000ffffffff;
    return tmp1 == 0 ? longValue : tmp1;
}

For both of them, RyuJit generates a conditional jump:

; Assembly listing for method Program:sete_or_mov(bool):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,   3  )    bool  ->  rcx
;  V01 tmp0         [V01,T01] (  3,   2  )     int  ->  rax
;# V02 OutArgs      [V02    ] (  1,   1  )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M60330_IG01:

G_M60330_IG02:
       84C9                 test     cl, cl
       7504                 jne      SHORT G_M60330_IG03
       33C0                 xor      eax, eax
       EB05                 jmp      SHORT G_M60330_IG04

G_M60330_IG03:
       B804000000           mov      eax, 4

G_M60330_IG04:
       4863C0               movsxd   rax, eax

G_M60330_IG05:
       C3                   ret

; Total bytes of code 17, prolog size 0 for method Program:sete_or_mov(bool):long
; ============================================================
; Assembly listing for method Program:cmov(long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,   3.5)    long  ->  rcx
;  V01 loc0         [V01,T01] (  3,   2.5)    long  ->  rax
;# V02 OutArgs      [V02    ] (  1,   1  )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M53075_IG01:

G_M53075_IG02:
       B8FFFFFFFF           mov      eax, 0xFFFFFFFF
       4823C1               and      rax, rcx
       4885C0               test     rax, rax
       7401                 je       SHORT G_M53075_IG04

G_M53075_IG03:
       C3                   ret

G_M53075_IG04:
       488BC1               mov      rax, rcx

G_M53075_IG05:
       C3                   ret

; Total bytes of code 18, prolog size 0 for method Program:cmov(long):long
; ============================================================

For comparison, here are the same methods compiled using Clang and GCC with -O1 (by Compiler Explorer):

GCC 6.2:

sete_or_mov(bool):
        test    dil, dil
        setne   al
        movzx   eax, al
        sal     rax, 2
        ret
cmov(unsigned long):
        mov     eax, edi
        test    rax, rax
        cmove   rax, rdi
        ret

Clang 3.9.0:

sete_or_mov(bool):                       # @sete_or_mov(bool)
        movzx   eax, dil
        shl     rax, 2
        ret

cmov(unsigned long):                               # @cmov(unsigned long)
        mov     eax, edi
        mov     ecx, 4294967295
        and     rcx, rdi
        cmove   rax, rdi
        ret

category:cq
theme:basic-cq
skill-level:expert
cost:large
impact:small

@benaadams
Copy link
Member

benaadams commented Oct 2, 2016

There is code for it I think but @jamesqo noticed it was never getting executed in dotnet/coreclr#6647 (comment)

Function called CodeGen::genCodeForQmarkWithCMOV

@svick
Copy link
Contributor Author

svick commented Oct 2, 2016

@benaadams That code is in codegenlegacy.cpp, which, as far as I can tell, is the pre-RuyJit codegen and does not even support x64. So I'm not sure how relevant is that for RuyJit on x64.

@benaadams
Copy link
Member

Might be why its not called then 😆

@jamesqo
Copy link
Contributor

jamesqo commented Oct 2, 2016

@benaadams I believe I tested it with x86 builds, without the x86 RyuJIT (which is disabled by default) enabled. Feel free to check for yourself, though, in case I messed something up.

@benaadams
Copy link
Member

There does seem to be a lot of stuff for cmov in the code, have never seen it output though

@cmckinsey
Copy link
Contributor

We used to generate CMOV/SETCC from our internal QMark operator, however we ended up expanding QMark after importer to short-circuit flow to simplify handling in downstream phases. As I recall the plan was to enable a more general if-conversion phase to identify simple cmov/setcc patterns before lowering to reconstitute them. /cc @dotnet/jit-contrib @russellhadley

I believe JIT32 (used today on x86) does generate at least SETCC, if not CMOV.

Regardless of the current state, I think we all agree with the previous comments both are pretty powerful features of the IA architecture and worth generating. Often result in smaller and faster code without the possibility of expensive mis-prediction penalties.

@russellhadley
Copy link
Contributor

I agree, it's pretty important to handle both of these cases. In terms of priority @svick where is this coming from? Is it in a particular workload or benchmark? What's the opportunity here?

@svick
Copy link
Contributor Author

svick commented Oct 3, 2016

@russellhadley I noticed this when I tried to modify this code in Kestrel to use cmov and failed.

But that was just "Hey, this code could be faster! No? Too bad.", I don't know if it would actually have measurable impact on the overall performance of Kestrel. So I don't think you should prioritize this based on me.

@russellhadley
Copy link
Contributor

@svick It's still Kestrel and it's in one of the methods we want to get inlined into the MemoryAllocatorIterator::Seek routine and optimized - so it had some priority. Thanks for the heads up. :)

@choikwa
Copy link
Contributor

choikwa commented Oct 3, 2016

@svick, that code is probably faster with bsf/tzcnt(count trailing zeroes) and lshl(logical shift left) instead of cmov/condIf's. Whether JIT knows to generate them is another question.

@benaadams
Copy link
Member

@choikwa but ternary operators, for example, are probably a much more common pattern across the ecosystem as a first step to improve; and likely more simple to recognise - since its a language construct.

Have an improvement for that particular section of code in Kestrel in a PR aspnet/KestrelHttpServer#1138

@mikedn
Copy link
Contributor

mikedn commented Oct 7, 2016

@cmckinsey @russellhadley Anyone working on this? FYI I'm trying to add support for CMOV to RyuJIT's codegen for a different issue. Maybe I can take a look at this one after that.

@pgavlin
Copy link
Contributor

pgavlin commented Oct 7, 2016

AFAIK nobody is looking into this at the moment. Feel free to dive in :)

@russellhadley
Copy link
Contributor

@mikedn I think @benaadams has provided a branchless version of the code that motivated this in Kestrel, but getting to the point where we can generate CMOVs is still interesting. @sivarv has been working on some ideas around how to improve this kind of instruction selection. It would be good to make sure this effort dovetails.

@mikedn
Copy link
Contributor

mikedn commented Oct 29, 2016

FWIW I've made a prototype that survives diffing and testing. It detects simple half/full hammocks where the true/false BBs contain a single assignment (or a return) that involves only local variables and constants. It only works with i4 and i8 but with some care we might be able to handle small integer types.

I'll probably try to extend this to indirs though it's problematic due to side-effects. A quick and dirty experiment that survives diffing could probably show if there are enough opportunities for this to be useful.

The conversion is currently done when lowering GT_JTRUE. It could be easily moved to a separate phase before lowering but I'm wondering if this shouldn't be done much earlier. After conversion we have less basic blocks and we may be able to eliminate some variables as well. And we may miss other optimizations if we do this conversion late - for example #6748 requires CSE.

FX diff shows a 2k reduction in code size but it could be better. I had to turn off a and-cmp to test optimization that caused a lot of trouble and that results in code size regressions. Also, using cmov can result in pretty bad code if variables are not already in registers. Something like:

mov eax, [ebp-8]
mov edx, 1
cmov eax, edx
mov [ebp-8], eax

Codegen for GT_SELECT should probably detect such cases and fall back to a jcc and mov [ebp-8], 1.

The 2 examples from the initial post generate:

G_M65174_IG02:
       84D2                 test     dl, dl
       0F95C0               setne    al
       0FB6C0               movzx    rax, al
       C1E002               shl      eax, 2
       4863C0               movsxd   rax, eax
G_M65174_IG03:
       C3                   ret

G_M65174_IG02:
       B8FFFFFFFF           mov      eax, 0xFFFFFFFF
       4823C2               and      rax, rdx
       4885C0               test     rax, rax
       480F44C2             cmove    rax, rdx
G_M65174_IG03:
       C3                   ret

The code is similar to what the native compilers generate except a couple of warts. The first example has a widening cast that can be easily removed. The second example has a useless test that could be removed as well.

It's worth noting that Clang assumes that bools are 0/1 and generates better code for the first example. I don't think we can do that, CLR bools are 0/non-zero.

The Math.Max example from #4326 generates:

G_M65175_IG02:
       8B4208               mov      eax, dword ptr [rdx+8]
       83F82A               cmp      eax, 42
       BA2A000000           mov      edx, 42
       0F4CC2               cmovl    eax, edx

so we get rid of a useless jump by accident (those kind of jumps can probably appear in other situations that aren't handled by if-conversion).

The bool conversion example from #4399 isn't currently detected but a variant of it works:

        for (int i = 0; i < N; i++)
            sum = (bits[i] ? 1 : 0) + sum;

generates

       470FB6440810         movzx    r8, byte  ptr [r8+r9+16]
       4585C0               test     r8d, r8d
       410F95C0             setne    r8b
       450FB6C0             movzx    r8, r8b
       4103C0               add      eax, r8d

The reason why the the original code (sum += bits[i] ? 1 : 0) isn't handled is that the IL stack is no empty at the entry of the true/false blocks and as a result an additional assignment is added to those blocks.

With some additional work this could probably turn into:

       movzx    r8, byte  ptr [r8+r9+16]
       neg r8d
       adc eax, 0

The nice thing about if-conversion is that once you generate a GT_SELECT you can look around and detect all kind of patterns relatively easily.

I'll try to do some perf measurements but the perf characteristics of converting branches are probably well known to anyone familiar with the subject. If the branch isn't predictable then you get good performance (I've seen numbers in the range 2-5x but I have to double check). If the branch is predictable then performance may regress slightly. Maybe such conversions should be driven by profile data but it seems that all native compilers generate cmov by default.

@mikedn
Copy link
Contributor

mikedn commented Oct 29, 2016

As a simple perf test let's sum all the positive numbers from an array:

int sum = 0;
for (long i = 0; i < nums.Length; i++)
    sum += (nums[i] > 0 ? nums[i] : 0);
return sum;

As mentioned above, use of += blocks if-conversion so we don't get CMOV:

G_M65174_IG03:
       493BC8               cmp      rcx, r8
       731F                 jae      SHORT G_M65174_IG06
       448B4C8A10           mov      r9d, dword ptr [rdx+4*rcx+16]
       4585C9               test     r9d, r9d
       7F05                 jg       SHORT G_M65174_IG04
       4533C9               xor      r9d, r9d
       EB00                 jmp      SHORT G_M65174_IG04

G_M65174_IG04:
       4103C1               add      eax, r9d
       48FFC1               inc      rcx
       4C3BC1               cmp      r8, rcx
       7FE1                 jg       SHORT G_M65174_IG03

With 1,000,000 numbers this runs in 630us if all numbers are positive and 4300us if we have a random mix of positive and negative numbers. Poor branch prediction takes its toll.

If we change the sum expression so that if-conversion kicks in we get CMOV:

G_M65174_IG03:
       493BC8               cmp      rcx, r8
       731F                 jae      SHORT G_M65174_IG05
       448B4C8A10           mov      r9d, dword ptr [rdx+4*rcx+16]
       4533D2               xor      r10d, r10d
       4585C9               test     r9d, r9d
       450F4ECA             cmovle   r9d, r10d
       4103C1               add      eax, r9d
       48FFC1               inc      rcx
       4C3BC1               cmp      r8, rcx
       7FE1                 jg       SHORT G_M65174_IG03

Now we always get 720us no matter what the numbers are. We're slightly slower in the "all positive" case and 6 (yes, six!) times faster in the random case.

Should something be done about the slowdown we see in the "all positive" case? I don't know, as a developer I would use the following code if the numbers are known to be mostly positive:

for (long i = 0; i < nums.Length; i++)
    if (nums[i] > 0)
        sum += nums[i];

If-conversion doesn't kick in and this runs in 490us if all numbers are positive, that's faster than all of the above.

@redknightlois
Copy link

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2016

I know. The prototype I have does generate CMOV in the MinTernary method but it doesn't yet work in the Ternary method because the result is assigned to an array rather than a local variable.

@russellhadley
Copy link
Contributor

@mikedn For your prototype did you just try and re-enable they qmark support? (not expand them early into flow)

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2016

@russellhadley Nope, I'm generating GT_SELCC nodes during GT_JTRUE lowering. GT_SELCC is a new operator I've added to represent CMOV. It's a binary operator that expects the condition flags to be set appropriately.

@russellhadley
Copy link
Contributor

@mikedn Does it take 2 full trees to be evaluated or is it restricted to taking lclVars?

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2016

@russellhadley lclVars and constants. It recognizes:

if (whatever) { lclX = lclY (or const); }
// generates STORE_LCL(lclX, SELCC<cond>(lclY, lclX))

if (whatever) { lclX = lclY (or const); } else { lclX = lclZ (or const); }
// generates STORE_LCL(lclX, SELCC<cond>(lclY, lclZ))

if (whatever) { return lclY (or const); } else { return lclZ (or const); }
// generates RETURN(SELCC<cond>(lclY, lclZ))

It should be possible to handle more patterns but there are all kinds of issues:

  • In the first case lclX could also be an indir if we know that the address is not null. CMOV faults even if the condition is false.
  • In the second example lclX could be an indir but we have to ensure that both indirs have the same address. I suppose it should be possible to an extent but I haven't looked into this.
  • In all cases the RHS could probably be any side effect free tree but that's problematic from a performance point of view since it is evaluated even if the condition is false. Maybe we could allow one additional simple operation (addition, bitwise ops etc.)

I think (haven't looked at it too much) that GT_QMARK allows more or less arbitrary trees and that makes it problematic, it looks like some kind of control flow embedded in an expression tree. GT_SELCC looks like any other binary operator except that it has a hidden dependency on condition flags.

I mixed up my code with a few other experiments, I'll try to clean it up a bit tomorrow and push it to my GitHub fork for reference. Though at 500 lines of code (including empty lines) it may not be a pretty sight. Needs reorganization.

@redknightlois
Copy link

Thinking about it, given that CMOV is complicated to catch everything, why not tackle only the 'easy' cases (when there is certainty to have performance improvements) and provide also an intrisic version in case you need in any of the dubious cases.

@mikedn
Copy link
Contributor

mikedn commented Dec 21, 2016

@mikedn
Copy link
Contributor

mikedn commented Dec 21, 2016

@redknightlois I'm not sure how an intrinsic would help. Once it works for local variables it's up to you to ensure that the code has the right pattern to make use of CMOV. Or create a method like T Select<T>(bool condition, T trueValue, T falseValue) => condition ? trueValue : falseValue;. The moment you write Select(c, a[i], b[i]) you accept that both a[i] and b[i] will be evaluated.

@sdmaclea
Copy link
Contributor

@mikedn Looks like the experimental branch was deleted. Do you have a draft somewhere?

@mikedn
Copy link
Contributor

mikedn commented Sep 14, 2017

@sdmaclea See the "If conversion" commit in this branch https://github.com/mikedn/coreclr/commits/relop-lower-2

@tannergooding
Copy link
Member

@mikedn, as for providing an intrinsic (a bit late, I know), I didn't think the runtime had a rule stating that a variable that is passed directly to a method (without being stored to a local in the interim) has to be evaluated (that's just the way it happens to work today, since we don't expose very many, if any, "intrinsics" like this).

That is, it should be entirely possible for the runtime to expose a method in Corlib that has a special contract where Select(c, a[i], b[i]) will only evaluate a[i] or b[i] based on the result of c. The method could be exposed in the new System.Runtime.Intrinsics namespace (or maybe System.Runtime.CompilerServices.Unsafe) and would be considered a "special" use-case for power-users (just like the rest of S.R.Intrinsics or S.R.CS.Unsafe).

That being said, just ensuring we recognize c ? a[i] : b[i] is probably good enough. The downside being, that there is no guarantee it will be transformed, depending on various other constraints the JIT might be under (this has been brought up in various different threads, including the one where I requested we document recognized patterns 😉).

@mikedn
Copy link
Contributor

mikedn commented Sep 15, 2017

@tannergooding I don't quite understand what you are saying. In particular, the "Select(c, a[i], b[i]) will only evaluate a[i] or b[i] based on the result of c" makes no sense. The whole point of adding the Select method I suggested is to allow people to clearly state that they are OK with both a[i] and b[i] being evaluated. The "special contract" you suggest seems to do exactly the opposite (and sounds like a C/C++ macro abomination).

That being said, just ensuring we recognize c ? a[i] : b[i] is probably good enough

It will never recognize that. It's simply impossible to express the semantics of such an expression using CMOV.

@pgavlin
Copy link
Contributor

pgavlin commented Sep 15, 2017

This is starting to go pretty far off the rails. Let's back up and answer @tannergooding's high-level question:

At this point I'm just trying to understand why this:

int Select(bool condition, int trueValue, int falseValue)
{
    return condition ? trueValue : falseValue;
}

int MyMethod(bool condition, int[] a, int[] b)
{
    return Select(condition, a[5], b[6]);
}

cannot be transformed to be:

int MyMethod(bool condition, int[] a, int[] b)
{
    return condition ? a[5] : b[6];
}

This transformation cannot be performed in the way that you suggest because it changes the behavior of the program. In the first case, the evaluation of condition, the array accesses, and any associated side-effects occur before the call to Select; in the second case, only one array access is performed, thus dropping the side-effects of the other access. As a rule, optimizations performed by the compiler must not affect the behavior of the compiled program. The only exceptions occur when the behavior of a program is undefined, in which case the compiler is allowed to do pretty much whatever it wants. This does not occur very often in IL, and in any case should generally be treated as a bug rather than a feature.

@JosephTremoulet
Copy link
Contributor

@tannergooding,

I'm just trying to understand why this ... cannot be transformed to be ... My understanding of the optimizations the spec allows, indicates it should be.

It's in the part of I.12.6.4 that you quoted. Adding emphasis (note "and exceptions"):

Conforming implementations of the CLI are free to execute programs using any technology that
guarantees, within a single thread of execution, that side-effects and exceptions generated by a
thread are visible in the order specified by the CIL. For this purpose only volatile operations
(including volatile reads) constitute visible side-effects. (Note that while only volatile operations
constitute visible side-effects, volatile operations also affect the visibility of non-volatile
references.)

The JIT must guarantee that the NullReferenceException or IndexOutOfRangeException thrown by the ldelem is visible.

@tannergooding
Copy link
Member

@pgavlin, @JosephTremoulet. Thanks.

Am I correctly understanding the subsequent section on E-relaxed methods that, if a new CompilationRelaxations value was exposed, the JIT would be allowed to perform such a transformation and delay the check into the inlined code?

@JosephTremoulet
Copy link
Contributor

@tannergooding, by my reading, that would still run afoul of:

The check’s exception is thrown some time in the sequence, unless the sequence throws
another exception. When multiple relaxed checks fail, it is unspecified as to which
exception is thrown by the VES

@tannergooding
Copy link
Member

@JosephTremoulet. Hmm, it seems that the further explanation of relaxations in Annex F specifically calls out the scenario I am interested in:

If this method is relaxed, and the caller is also relaxed, then the caller would be allowed, in the
absence of constraining data or control dependences, to interleave the call with other instructions
in the caller. If one of those other interleaved instructions faults, then any or all of M’s side
effects might be suppressed. Thus, method M should be marked as strict if it is important to
prevent a fault from destroying the invariant.

This “interference” from the caller is potentially annoying, but seems to be intrinsic to any
definition of relaxed exceptions that permits both:

  1. instruction reordering and
  2. inlined method calls are at least as fast as manual inlining.

@tannergooding
Copy link
Member

In either case. Thanks for helping me understand this better.

I'm going to create a new issue so I can continue the discussion without polluting this thread anymore.

I think, even if the runtime does not "technically" allow it today (although I think it might), it should support these types of optimizations.

@JosephTremoulet
Copy link
Contributor

@tannergooding, translating "caller" and "M" to your example, that's saying that (if both are relaxed) a fault in MyMethod may result in some arbitrary subset of Select 's side-effects being suppressed. The fault in MyMethod still must be made visible.

@mikedn mikedn removed their assignment Jan 18, 2020
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@EgorBo
Copy link
Member

EgorBo commented Sep 20, 2020

HW_INTRINSIC-based implementation for CMOVnn: EgorBo@1271fe5

static int Test1(int x)
{
    return x == 42 ? 1000 : 2000;
}

static int Test2(int x, int a, int b)
{
    return x == 42 ? a : b;
}
; Method Tests:Test1(int):int
G_M56601_IG01:
G_M56601_IG02:
       83F92A               cmp      ecx, 42
       B8E8030000           mov      eax, 0x3E8
       BAD0070000           mov      edx, 0x7D0
       0F45C2               cmovne   eax, edx
G_M56601_IG03:
       C3                   ret      
; Total bytes of code: 17


; Method Tests:Test2(int,int,int):int
G_M50938_IG01:
G_M50938_IG02:
       83F92A               cmp      ecx, 42
       8BC2                 mov      eax, edx
       410F45C0             cmovne   eax, r8d
G_M50938_IG03:
       C3                   ret      
; Total bytes of code: 10

Works better with PGO (COMPlus_TieredPGO=1) 🙂:
image

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Dec 5, 2020
jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Jan 27, 2023
This adds support for contained compares in GT_SELECT nodes for xarch.
As part of this, also enables if-conversion on xarch.

Fix dotnet#6749
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jan 27, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 9, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
Archived in project