Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT generates redundant code when inlining string.IsNullOrEmpty #4207

Closed
Tracked by #93172
mikedn opened this issue May 2, 2015 · 23 comments · Fixed by #63095
Closed
Tracked by #93172

RyuJIT generates redundant code when inlining string.IsNullOrEmpty #4207

mikedn opened this issue May 2, 2015 · 23 comments · Fixed by #63095
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@mikedn
Copy link
Contributor

mikedn commented May 2, 2015

C# code:

static int Main(string[] args) {
    return string.IsNullOrEmpty(args[0]) ? 42 : 0;
}

ASM code:

G_M1504_IG02:
       83790800             cmp      dword ptr [rcx+8], 0
       7628                 jbe      SHORT G_M1504_IG06
       488B4110             mov      rax, gword ptr [rcx+16]
       4885C0               test     rax, rax
       7415                 je       SHORT G_M1504_IG04
       83780800             cmp      dword ptr [rax+8], 0
       0F94C0               sete     al          ; redundant
       0FB6C0               movzx    rax, al     ; redundant
       85C0                 test     eax, eax    ; redundant
       7507                 jne      SHORT G_M1504_IG04
       33C0                 xor      eax, eax

Basically the IL code after inline contains a ceq followed by a brtrue and the JIT compiler doesn't seem capable of folding them.

category:cq
theme:optimization
skill-level:expert
cost:large

@cmckinsey cmckinsey assigned cmckinsey and schellap and unassigned cmckinsey May 4, 2015
@schellap
Copy link

Good find.

The problem is we have a side effecting dereference of the array in a dead assignment, which couldn't be removed as a dead store (see below). We should be able to remove this by teaching ValueNumber::GetArrForLenVn($280) how to retrieve the array reference out of a pointer to the first element and in AssertionProp comparing the retrieved reference($80) against the non-null assertion from the bound check (OAK_NO_THROW($80,$40)) to remove the side-effect.

Specifically, this bounds check can be used:

***** BB01, stmt 1 (top level)
     ( 10, 15) [000045] ------------             *  stmtExpr  void  (top level) (IL 0x000...  ???)
N008 (  4,  4) [000003] R--XG-------             |     /--*  indir     ref    <l:$105, c:$280>
N006 (  1,  1) [000054] ------------             |     |  |  /--*  const     long   16 Fseq[#FirstElem] $140
N007 (  3,  3) [000055] -------N----             |     |  \--*  +         byref  <l:$200, c:$240>
N005 (  1,  1) [000047] ------------             |     |     \--*  lclVar    ref    V00 arg0         u:2 (last use) $80
N009 ( 10, 15) [000058] ---XG-------             |  /--*  comma     ref    <l:$107, c:$106>
N004 (  6, 11) [000050] ---X--------             |  |  \--*  arrBndsChk void   $102
N002 (  3,  3) [000049] ---X--------             |  |     +--*  arrLen    int    $c0
N001 (  1,  1) [000001] ------------             |  |     |  \--*  lclVar    ref    V00 arg0         u:2 $80
N003 (  1,  1) [000002] ------------             |  |     \--*  const     int    0 $40
N011 ( 10, 15) [000044] -A-XG---R---             \--*  =         ref    <l:$107, c:$106>
N010 (  1,  1) [000043] D------N----                \--*  lclVar    ref    V02 tmp1         d:2 <l:$105, c:$280>

to eliminate the side-effect (the X flag on the arrLen)

***** BB02, stmt 3 (top level)
     ( 12,  8) [000041] ------------             *  stmtExpr  void  (top level) (IL 0x000...  ???)
N003 (  1,  1) [000037] ------------             |     /--*  const     int    0 $40
N004 (  8,  5) [000038] ---X--------             |  /--*  ==        int    <l:$2c3, c:$2c2>
N002 (  3,  3) [000036] ---X--------             |  |  \--*  arrLen    int    <l:$c2, c:$c1>
N001 (  1,  1) [000035] ------------             |  |     \--*  lclVar    ref    V02 tmp1         u:2 (last use) <l:$105, c:$280>
N006 ( 12,  8) [000040] -A-X----R---             \--*  =         bool   <l:$2c3, c:$2c2>
N005 (  3,  2) [000039] D------N----                \--*  lclVar    int    V01 tmp0         d:3 <l:$2c3, c:$2c2>

And then, given that there is no longer any side-effects post AssertionProp, going into Liveness the dead assignment to V01 can be removed as there is no further "use" anywhere else.

@mikedn
Copy link
Contributor Author

mikedn commented Jun 11, 2015

Sorry but I don't follow. The problem occurs even if no array bound checks are involved:

static bool foobar(int x) {
    return x != 1 && x != 2;
}

static int Main(string[] args) {
    return foobar(args.Length) ? 42 : 0;
}

ASM code:

00007FFC1B234490  mov         eax,dword ptr [rcx+8]  
00007FFC1B234493  cmp         eax,1  
00007FFC1B234496  je          00007FFC1B2344A5  
00007FFC1B234498  cmp         eax,2  
00007FFC1B23449B  setne       al  
00007FFC1B23449E  movzx       eax,al  
00007FFC1B2344A1  test        eax,eax  
00007FFC1B2344A3  jne         00007FFC1B2344A8  
00007FFC1B2344A5  xor         eax,eax  
00007FFC1B2344A7  ret  
00007FFC1B2344A8  mov         eax,2Ah  
00007FFC1B2344AD  ret  

For better clarity here's the expected ASM code:

00007FFC1B234490  mov         eax,dword ptr [rcx+8]  
00007FFC1B234493  cmp         eax,1  
00007FFC1B234496  je          00007FFC1B2344A5  
00007FFC1B234498  cmp         eax,2  
00007FFC1B2344A3  jne         00007FFC1B2344A8  
00007FFC1B2344A5  xor         eax,eax  
00007FFC1B2344A7  ret  
00007FFC1B2344A8  mov         eax,2Ah  
00007FFC1B2344AD  ret

@schellap
Copy link

There are two issues here. One is the previous one and the other is because we have the IR as:

bool x = (y != 2);
if (x)

and we end up with redundant code. This should definitely be optimized better -- I will look into this.

@mikedn
Copy link
Contributor Author

mikedn commented Oct 11, 2015

In case anyone cares I stumbled upon a funny workaround:

static bool foobar(int x) {
    return (x != 1 && x != 2) ? true : false;
}
static int Main(string[] args) {
    return foobar(args.Length) ? 42 : 0;
}
00007FFEE1310480  mov         eax,dword ptr [rcx+8]  
00007FFEE1310483  cmp         eax,1  
00007FFEE1310486  je          00007FFEE131048D  
00007FFEE1310488  cmp         eax,2  
00007FFEE131048B  jne         00007FFEE1310490  
00007FFEE131048D  xor         eax,eax  
00007FFEE131048F  ret  
00007FFEE1310490  mov         eax,2Ah  
00007FFEE1310495  ret  

Who'd have thought that more complex C# code will produce better assembly code? 😄

@omariom
Copy link
Contributor

omariom commented Feb 3, 2016

It also happens when a boolean value passed to an inlined method.

AndyAyersMS referenced this issue in AndyAyersMS/coreclr May 4, 2016
Test case to stress inlining, expression opts, and control flow
simplification for booleans.

Test case has 100 methods named Idxx. All 100 should generate identical
code. On x64 windows we expect to get the 4 byte sequence
```
       0FB6C1               movzx    rax, cl
       C3                   ret
```

Only 22 of the variants get this codegen; there are at least 12 other
sequences ranging in size from 9 to 32 bytes of code.

Likely touches on the same issues raised in #914.
@mikedn
Copy link
Contributor Author

mikedn commented Jan 3, 2019

This keeps popping up... let's what the problem is. After inlining we have IR like:

***** BB04, stmt 1
               [000022] ------------              *  STMT      void  (IL 0x000...  ???)
               [000021] ------------              \--*  JTRUE     void  
               [000019] ------------                 |  /--*  CNS_INT   ref    null
               [000020] ------------                 \--*  EQ        int   
               [000001] ------------                    \--*  LCL_VAR   ref    V03 arg3         

------------ BB05 [000..001) -> BB07 (always), preds={} succs={BB07}

***** BB05, stmt 2
               [000037] ------------              *  STMT      void  (IL 0x000...  ???)
               [000033] ------------              |     /--*  CNS_INT   int    0
               [000034] ---X--------              |  /--*  EQ        int   
               [000032] ---X--------              |  |  \--*  ARR_LENGTH int   
               [000031] ------------              |  |     \--*  LCL_VAR   ref    V03 arg3         
               [000036] -A-X--------              \--*  ASG       bool  
               [000035] D------N----                 \--*  LCL_VAR   bool   V07 tmp1         

------------ BB06 [000..001), preds={} succs={BB07}

***** BB06, stmt 3
               [000028] ------------              *  STMT      void  (IL 0x000...  ???)
               [000025] ------------              |  /--*  CAST      int <- bool <- int
               [000024] ------------              |  |  \--*  CNS_INT   int    1
               [000027] -A----------              \--*  ASG       bool  
               [000026] D------N----                 \--*  LCL_VAR   bool   V07 tmp1         

------------ BB07 [???..???) -> BB03 (cond), preds={} succs={BB02,BB03}

***** BB07, stmt 4
               [000009] ------------              *  STMT      void  (IL   ???...  ???)
               [000008] --C---------              \--*  JTRUE     void  
               [000006] ------------                 |  /--*  CNS_INT   int    0
               [000007] --C---------                 \--*  NE        int   
               [000038] ------------                    \--*  LCL_VAR   int    V07 tmp1         

This is pretty bad.
After morphing, before SSA, it's a bit better. The branch node of interest has been duplicated so it's closer to the "detached" relop node:

------------ BB01 [000..008) -> BB04 (cond), preds={} succs={BB02,BB04}

***** BB01, stmt 1
     (  5,  5) [000022] ------------              *  STMT      void  (IL 0x000...  ???)
N004 (  5,  5) [000021] ------------              \--*  JTRUE     void  
N002 (  1,  1) [000019] ------------                 |  /--*  CNS_INT   ref    null
N003 (  3,  3) [000020] J------N----                 \--*  EQ        int   
N001 (  1,  1) [000001] ------------                    \--*  LCL_VAR   ref    V03 arg3         u:1

------------ BB02 [000..001) -> BB06 (cond), preds={BB01} succs={BB03,BB06}

***** BB02, stmt 2
     ( 12,  8) [000037] ------------              *  STMT      void  (IL 0x000...  ???)
N003 (  1,  1) [000033] ------------              |     /--*  CNS_INT   int    0
N004 (  8,  5) [000034] ---X--------              |  /--*  EQ        int   
N002 (  3,  3) [000032] ---X--------              |  |  \--*  ARR_LENGTH int   
N001 (  1,  1) [000031] ------------              |  |     \--*  LCL_VAR   ref    V03 arg3         u:1 (last use)
N006 ( 12,  8) [000036] -A-X----R---              \--*  ASG       bool  
N005 (  3,  2) [000035] D------N----                 \--*  LCL_VAR   int    V07 tmp1         d:2

***** BB02, stmt 3
     (  7,  6) [000044] ------------              *  STMT      void  (IL   ???...  ???)
N004 (  7,  6) [000040] ------------              \--*  JTRUE     void  
N002 (  1,  1) [000043] ------------                 |  /--*  CNS_INT   int    0
N003 (  5,  4) [000041] J------N----                 \--*  NE        int   
N001 (  3,  2) [000042] ------------                    \--*  LCL_VAR   int    V07 tmp1         u:2 (last use)

------------ BB03 [???..???) -> BB05 (always), preds={BB02} succs={BB05}

------------ BB04 [000..001) -> BB06 (cond), preds={BB01} succs={BB05,BB06}

***** BB04, stmt 4
     (  5,  4) [000028] ------------              *  STMT      void  (IL 0x000...  ???)
N001 (  1,  1) [000024] ------------              |  /--*  CNS_INT   int    1
N003 (  5,  4) [000027] -A------R---              \--*  ASG       bool  
N002 (  3,  2) [000026] D------N----                 \--*  LCL_VAR   int    V07 tmp1         d:1

***** BB04, stmt 5
     (  7,  6) [000009] ------------              *  STMT      void  (IL   ???...  ???)
N004 (  7,  6) [000008] ------------              \--*  JTRUE     void  
N002 (  1,  1) [000006] ------------                 |  /--*  CNS_INT   int    0
N003 (  5,  4) [000007] J------N----                 \--*  NE        int   
N001 (  3,  2) [000038] ------------                    \--*  LCL_VAR   int    V07 tmp1         u:1 (last use)

One of the duplicated branch nodes will go away, thanks to VN/assertion propagation that detects that it has a constant operand. The problem is the other branch node, that looks as expected:

***** BB02, stmt 2
     ( 12,  8) [000037] ------------              *  STMT      void  (IL 0x000...  ???)
N003 (  1,  1) [000033] ------------              |     /--*  CNS_INT   int    0 $43
N004 (  8,  5) [000034] ---X--------              |  /--*  EQ        int    $c3
N002 (  3,  3) [000032] ---X--------              |  |  \--*  ARR_LENGTH int    $c1
N001 (  1,  1) [000031] ------------              |  |     \--*  LCL_VAR   ref    V03 arg3         u:1 (last use) $80
N006 ( 12,  8) [000036] -A-X----R---              \--*  ASG       bool   $c3
N005 (  3,  2) [000035] D------N----                 \--*  LCL_VAR   int    V07 tmp1         d:2 $c2

***** BB02, stmt 3
     (  7,  6) [000044] ------------              *  STMT      void  (IL   ???...  ???)
N004 (  7,  6) [000040] ------------              \--*  JTRUE     void  
N002 (  1,  1) [000043] ------------                 |  /--*  CNS_INT   int    0 $43
N003 (  5,  4) [000041] J------N----                 \--*  NE        int    $c4
N001 (  3,  2) [000042] ------------                    \--*  LCL_VAR   int    V07 tmp1         u:2 (last use) $c2

Basically, the relevant relop node remains detached from the branch tree so codegen has to materialize the bool value produced by the relop, instead of just relying on the flags that the relop sets.

One obvious solution is forward substitution. Unfortunately that's not something one can implement overnight.

Still, perhaps some targeted forward substitution can be done, just for this case. The basic ingredient, SSA, already exists. The only potential problem is that it does not track the users of a node so there's no way to know if a node is single use or not. But that's not so difficult to add.

The bigger problem is that this also typically requires some code motion and that's more difficult to do right. We could probably get away with moving only the relop and spilling its (non-constant) operands to lclvars so we don't have to move those as well. That is, if we have

bool b = a.Length == 0;
if (b)

just make it

int l = a.Length;
if (l == 0)

I suppose early prop could be used for something like this. And considering that branch nodes aren't that common (at one point I think I counted ~50,000 in corelib) the impact on throughput should be small.

So if we have

bool b = a.Length == 0;
if (b)

and early prop finds that b has a single use then it's pretty simple:

  • retype b to int (it's single-use so that shouldn't be problematic)
  • assign a.Length to the retyped variable
  • replace that single use of b with b == 0.

This avoids another SSA limitation that the JIT currently has - if you introduce a new lclvar after SSA, there's no mechanism to add it to the SSA graph.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 3, 2019

retype b to int (it's single-use so that shouldn't be problematic)

Yeah, except that it's not also single-def so you have to create a new lclvar. Oh well, let's do that and see how it goes.

Quick and dirty implementation produces this diff:

Total bytes of diff: -7768 (-0.04% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -1872 : Microsoft.CodeAnalysis.CSharp.dasm (-0.09% of base)
       -1825 : System.Private.CoreLib.dasm (-0.05% of base)
       -1326 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.06% of base)
       -1165 : System.Private.Xml.dasm (-0.04% of base)
        -302 : System.Private.DataContractSerialization.dasm (-0.04% of base)
38 total files with size differences (38 improved, 0 regressed), 91 unchanged.
Top method regressions by size (bytes):
          33 ( 0.81% of base) : System.Private.Xml.dasm - XmlSerializationWriterILGen:GenerateMembersElement(ref):ref:this
          32 ( 1.35% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Binder:ComputeVariableType(ref,ref,ref,ref,byref,byref,ref):ref:this
          16 ( 0.28% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - AbstractFlowPass`1:VisitTryStatement(ref):ref:this (3 methods)
          13 ( 8.90% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMemberContainerTypeSymbol:ComparePartialMethodSignatures(ref,ref):bool:this
           6 ( 1.05% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:ShouldAddWinRTMembersForInterface(ref,ref,ref,ref,ref,ref,ref):bool
Top method improvements by size (bytes):
        -120 (-2.94% of base) : System.Private.DataContractSerialization.dasm - XmlBinaryReader:ReadArray(ref,ref,ref,int,int):int:this (20 methods)
         -80 (-4.84% of base) : System.Private.Xml.dasm - Parser:Parse(ref,int):bool:this
         -72 (-14.72% of base) : System.Private.CoreLib.dasm - PathInternal:NormalizeDirectorySeparators(ref):ref
         -68 (-3.85% of base) : System.Private.CoreLib.dasm - SpanHelpers:SequenceEqual(byref,byref,int):bool (2 methods)
         -62 (-2.31% of base) : System.Text.Encoding.CodePages.dasm - GB18030Encoding:GetChars(long,int,long,int,ref):int:this
Top method regressions by size (percentage):
          13 ( 8.90% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SourceMemberContainerTypeSymbol:ComparePartialMethodSignatures(ref,ref):bool:this
           5 ( 4.85% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundExpressionExtensions:ToStatement(ref):ref
           4 ( 3.51% of base) : System.Private.Xml.dasm - SourceInfo:Equals(ref):bool:this
           5 ( 3.33% of base) : System.Private.Xml.dasm - XmlDoubleSortKey:.ctor(double,ref):this
           4 ( 2.00% of base) : System.Private.Xml.dasm - QilTypeChecker:CheckFilter(ref):ref:this
Top method improvements by size (percentage):
         -56 (-19.38% of base) : System.Private.Xml.dasm - XmlILOptimizerVisitor:IsPrimitiveNumeric(ref):bool:this
         -56 (-19.38% of base) : System.Private.Xml.dasm - XmlILOptimizerVisitor:MatchesContentTest(ref):bool:this
         -19 (-18.63% of base) : System.Private.CoreLib.dasm - Delegate:RemoveAll(ref,ref):ref
          -8 (-17.78% of base) : Microsoft.CodeAnalysis.CSharp.dasm - <>c:<Analyze>b__4_0(struct):bool:this
          -8 (-17.39% of base) : System.Private.Xml.dasm - XPathConvert:SkipWhitespace(long):long
810 total methods with size differences (792 improved, 18 regressed), 112056 unchanged.

Typical diff looks as expected:

        call     Number:TryParseInt32IntegerStyle(struct,int,ref,byref):int
        test     eax, eax
-       sete     al
-       movzx    rax, al
-       test     eax, eax
-       jne      SHORT G_M8084_IG04
+       je       SHORT G_M8084_IG04 
 G_M8084_IG03:
        xor      eax, eax

@mikedn
Copy link
Contributor Author

mikedn commented Jan 3, 2019

Throughput impact of my lousy implementation is practically non-existent: 0.06%. I have pending PRs that save 10 times that.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 3, 2019

Being overly conservative and forwarding only the relop doesn't seem to be necessary. In many cases the SSA def is in the immediately preceding tree so you can just stitch the two trees together without the typical code motion worries. That nearly doubles the diff improvement. Mostly due to containment and/or compare narrowing:

        mov      r14d, dword ptr [rbp-3CH]
-       movzx    rcx, byte  ptr [rdi+59]
-       cmp      ecx, 2
+       cmp      byte  ptr [rdi+59], 2
        jne      SHORT G_M6012_IG06
        mov      rcx, rsi

Also, it's may be worth pointing out that this problem affects float comparisons as well. And in that case the impact is higher due to the increased complexity of float comparison bool result materialization:

        movss    xmm0, dword ptr [rdi+8]
        ucomiss  xmm0, xmm2
-       setpo    al
-       jpe      SHORT G_M23365_IG03
-       sete     al
+       jpe      SHORT G_M23365_IG05
-G_M23365_IG03:
-       movzx    rax, al
-       test     eax, eax
-       je       SHORT G_M23365_IG06
+       jne      SHORT G_M23365_IG05
        movss    xmm0, dword ptr [rdi+12]

Anyway, this seems pretty doable overall. Now I need to figure out how safe is it to rely on SSA use counts computed by SsaBuilder, considering that nothing in the JIT will update these counts once they're computed. But since early prop runs right after SsaBuilder this should probably be fine, so far I haven't encountered any situation where the use count is wrong during early prop.

In the worst case, morphing invoked during early prop could theoretically add new uses. But then morph doesn't quite know about SSA so it's unlikely that it ever does this. At the very least it will need to set the SSA number on new LclVar uses and there's no trace of that.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 6, 2019

Still WIP but good for a discussion, I hope - https://github.com/mikedn/coreclr/commits/relop-subst

cc @AndyAyersMS

Quick summary:

  • compute SSA use counts in SSA builder
  • traverse basic block list in earlyprop (well, actually right before it) and attempt to do basic forward substitution of JTRUE's relop operands

This reduces FX diff by 14423 bytes (affects some 1600 methods). There are a few regressions as well. The largest one, 150 bytes, is caused by changes in register allocation that somehow increase the frame size. That results in more 32 bit address mode displacements being needed.

Strangely, in the original implementation PIN showed a 0.06% regression and now it shows a 0.2% improvement. I'll have to double check that. Though it's not impossible that doing extra work to simplify the IR ends up resulting in throughput improvement, it seems a bit too good to be true. Anyway, throughput impact is low.

My main concern is that I need to morph the trees and morph could add new SSA uses, at least in theory. That would invalidate the computed use count. But then I don't see morph setting the SSA number on new lclvar nodes anywhere, that would imply that if morph does actually add new uses then we're already in a bad state because existing early prop code does already call morph.

Anyway, if this is a problem then I think there's a simple solution - put statements needing morphing in a list and morph only after earlyprop is complete.

@benaadams
Copy link
Member

Nice, does it also handle the 15 ? true : false and ? false : true workaround call sites? (Though could easily drop them in PR)

@mikedn
Copy link
Contributor Author

mikedn commented Jan 7, 2019

I have built a corelib without those workarounds and the diff improvement increases, sign that the JIT did handle at least some of those cases. I'll have to check if it handles all of them.

In general, I expect this to be pretty good at catching such cases, provided that the JIT does branch duplication as mentioned in a previous post. I did not know that it does that until now, I'll have to check to see how reliable is that.

It's still possible to get this to work without branch duplication, but it requires quite a bit more work.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 7, 2019

On a side note - I tried a version that is not restricted to just JTRUE & co. but looks at every statement and attempts to merge it with the previous one when possible. The diff then increases to ~80kbytes. Oh well, I suppose it's not a surprise that even such basic forward substitution generates improvements...

@AndyAyersMS
Copy link
Member

This looks like a fairly reasonable approach.

Wonder if the branch duplication is coming from fgOptimizeUncondBranchToSimpleCond...

Also curious you iterated on your tree merging to see how much benefit there might be in generally recombining split trees.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 8, 2019

Wonder if the branch duplication is coming from fgOptimizeUncondBranchToSimpleCond…

Yep, that was it, thanks!

Nice, does it also handle the 15 ? true : false and ? false : true workaround call sites? (Though could easily drop them in PR)

So, I took a closer look at that and I had a nasty surprise. string.IsNullOrEmpty without the true : false workaround does not inline! With the workaround it has 16 bytes of IL:

	IL_0000: ldarg.0
	IL_0001: brfalse.s IL_000e
	IL_0003: ldc.i4.0
	IL_0004: ldarg.0
	IL_0005: callvirt instance int32 System.String::get_Length()
	IL_000a: bge.un.s IL_000e
	IL_000c: ldc.i4.0
	IL_000d: ret
	IL_000e: ldc.i4.1
	IL_000f: ret

and without the workaround it has 18:

	IL_0000: ldarg.0
	IL_0001: brfalse.s IL_0010
	IL_0003: ldc.i4.0
	IL_0004: ldarg.0
	IL_0005: callvirt instance int32 System.String::get_Length()
	IL_000a: clt.un
	IL_000c: ldc.i4.0
	IL_000d: ceq
	IL_000f: ret
	IL_0010: ldc.i4.1
	IL_0011: ret

And back when I created this issue it had only 15 bytes:

	IL_0000: ldarg.0
	IL_0001: brfalse.s IL_000d
	IL_0003: ldarg.0
	IL_0004: callvirt instance int32 System.String::get_Length()
	IL_0009: ldc.i4.0
	IL_000a: ceq
	IL_000c: ret
	IL_000d: ldc.i4.1
	IL_000e: ret

The reason why it increased to 18 is another workaround:

0u >= (uint)value.Length

that makes Roslyn emit the perhaps curious sequence

	IL_000a: clt.un
	IL_000c: ldc.i4.0
	IL_000d: ceq

🤷‍♂️

@mikedn
Copy link
Contributor Author

mikedn commented Jan 13, 2019

I was trying to look at the rest of the true : false hacks in corelib. It's kind of difficult to look at the diffs because adding/removing those hacks is not without side effects, even with this experimental JIT change.

If the method containing the hack is not inlined then the hack is basically a deoptimization:

 G_M23366_IG02:
        test     rdx, rdx
-       jne      SHORT G_M23366_IG06
+       jne      SHORT G_M23366_IG04
        test     rcx, rcx
-       je       SHORT G_M23366_IG04
-       xor      eax, eax
+       sete     al
+       movzx    rax, al 
 G_M23366_IG03:
        add      rsp, 40
        ret      
 G_M23366_IG04:
-       mov      eax, 1
-G_M23366_IG05:
-       add      rsp, 40
-       ret      
-G_M23366_IG06:
        mov      gword ptr [rsp+30H], rcx
        mov      rcx, rdx
        mov      rdx, gword ptr [rsp+30H]
        call     [SortVersion:Equals(ref):bool:this]
        movzx    rax, al
-G_M23366_IG07:
+G_M23366_IG05:
        add      rsp, 40
        ret      
-; Total bytes of code 59, prolog size 5 for method SortVersion:op_Equality(ref,ref):bool
+; Total bytes of code 51, prolog size 5 for method SortVersion:op_Equality(ref,ref):bool

Same thing happens if the method is inlined in another method that simply returns the inlined method. Or more generally, whenever the result of the inlined method is not used in flow control context.

Anyway, my change seems to be working pretty well. With one exception:

 G_M33195_IG02:
-       cmp      gword ptr [rsi], 0
-       jne      SHORT G_M33195_IG05
+       mov      rax, gword ptr [rsi]
+       test     rax, rax
+       jne      SHORT G_M33195_IG04
        call     [CORINFO_HELP_READYTORUN_NEW]

Need to figure out why the value is loaded into a register instead of being compared directly like before.

@AndyAyersMS
Copy link
Member

By "side effects" do you mean inlining changes?

You can force the jit to make the same set of inlining decisions before/after by creating and then consuming an inline replay file, though it's not plumbed through jit-diffs.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 13, 2019

By "side effects" do you mean inlining changes?

That only seems to apply to string.IsNullOrEmpty, that fails to inline if ? true : false is removed. Other uses of ? true : false don't seem to impact inlining, it's just that removing ? true : false produces some diffs that are not related to this issue.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 13, 2019

On a side note - I tried a version that is not restricted to just JTRUE & co. but looks at every statement and attempts to merge it with the previous one when possible. The diff then increases to ~80kbytes. Oh well, I suppose it's not a surprise that even such basic forward substitution generates improvements...

@AndyAyersMS Speaking of chains of moves, here's a diff from the above mentioned experiment:

        jmp      SHORT G_M12430_IG04
 G_M12430_IG03:
        lea      rax, bword ptr [rdx+12]
-       mov      rcx, rax
-       mov      edx, dword ptr [rdx+8]
-       mov      eax, edx
-       mov      rdx, rcx
-       mov      rcx, rdx
-       mov      edx, eax
-       mov      rax, rcx
-       mov      ecx, edx
+       mov      ecx, dword ptr [rdx+8]
 G_M12430_IG04:
        movdqu   xmm0, qword ptr [rsp+48H]
        movdqu   qword ptr [rsp+38H], xmm0
        lea      rdx, bword ptr [rsp+28H]

@benaadams
Copy link
Member

Change title to "Redundant code generated when inlining booleans"?

For SslStream.get_IsAuthenticated() in the following code sharplab.io the vestiges of surfacing the bools to a register remain; whereas changing the methods to ternary expressions of true/false it optmizes correctly sharplab.io

With the following diff between the two approaches.

SslStream.get_IsAuthenticated()
    cmp qword [rcx+0x8], 0x0
-   jz L0058
+   jz L003d
    mov rax, [rcx+0x8]
    mov rax, [rax+0x8]
    test rax, rax
-   jz L0058
+   jz L003d
    mov edx, [rax]
    test byte [rax+0x10], 0x1
-   jnz L003b
+   jnz L003d
    add rax, 0x18
    mov rdx, [rax]
    test rdx, rdx         
-   jnz L0037     
+   jnz L0031
    mov rax, [rax+0x8]    
    test rax, rax    
-   setz al               
-   movzx eax, al         
-   jmp L0039             
-   xor eax, eax          
-   jmp L0040             
-   mov eax, 0x1          
-   test eax, eax         
-   setz al               
-   movzx eax, al         
-   test eax, eax         
-   jz L0058     
+   jz L003d              
    cmp qword [rcx+0x10], 0x0
-   jnz L0058
+   jnz L003d
    movzx eax, byte [rcx+0x18]
    ret
    xor eax, eax
    ret

The ternary is a less desirable approach as it reduces readability dotnet/corefx#40265 (comment)

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Dec 1, 2020
@GrabYourPitchforks
Copy link
Member

Comment at #53143 (comment) suggests that this has been resolved and that our workaround code in string.IsNullOrEmpty and other methods is no longer necessary. Can somebody confirm?

@SingleAccretion
Copy link
Contributor

I believe the string.IsNullOrEmpty workaround (and other ? true : false ones) is still effective. What #53143 (comment) referred to was the fact that 0 >= (uint)_length; is no longer necessary, and could be written simply as _length <= 0 (or even _length == 0, but that I suspect will still have CQ issues).

@huoyaoyuan
Copy link
Member

The current code gen is (confirmed with disasmo):

Not inlining

With ternary:

G_M47833_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M47833_IG02:
       test     rcx, rcx
       je       SHORT G_M47833_IG05
						;; bbWeight=1    PerfScore 1.25

G_M47833_IG03:
       cmp      dword ptr [rcx+8], 0
       je       SHORT G_M47833_IG05
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 2.12

G_M47833_IG04:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

G_M47833_IG05:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.12

G_M47833_IG06:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code: 20

Without ternary:

G_M43467_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M43467_IG02:
       test     rcx, rcx
       je       SHORT G_M43467_IG05
						;; bbWeight=1    PerfScore 1.25

G_M43467_IG03:
       cmp      dword ptr [rcx+8], 0
       sete     al
       movzx    rax, al
						;; bbWeight=0.50 PerfScore 2.12

G_M43467_IG04:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

G_M43467_IG05:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.12

G_M43467_IG06:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code: 22

The version without ternary contains one branch less.

Inlining

With ternary:

G_M27707_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M27707_IG02:
       test     rcx, rcx
       je       SHORT G_M27707_IG05
						;; bbWeight=1    PerfScore 1.25

G_M27707_IG03:
       cmp      dword ptr [rcx+8], 0
       je       SHORT G_M27707_IG05
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 2.12

G_M27707_IG04:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

G_M27707_IG05:
       mov      eax, 42
						;; bbWeight=0.50 PerfScore 0.12

G_M27707_IG06:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code: 20

Without ternary:

G_M25673_IG01:
						;; bbWeight=1    PerfScore 0.00

G_M25673_IG02:
       test     rcx, rcx
       je       SHORT G_M25673_IG05
						;; bbWeight=1    PerfScore 1.25

G_M25673_IG03:
       cmp      dword ptr [rcx+8], 0
       je       SHORT G_M25673_IG05
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 2.12

G_M25673_IG04:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

G_M25673_IG05:
       mov      eax, 42
						;; bbWeight=0.50 PerfScore 0.12

G_M25673_IG06:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code: 20

The codegen are identical.

I think this issue is solved now.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 23, 2021
jasonsparc pushed a commit to srskokoro/srskokoro-maui that referenced this issue Feb 28, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.