Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal codegen with C# switch statement handling for some simple cases #10634

Open
voinokin opened this issue Jul 5, 2018 · 10 comments
Open
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@voinokin
Copy link

voinokin commented Jul 5, 2018

The check for switch statement input value range is duplicated in JIT for some simple cases.
Consider the following repro code:

static readonly ulong v = 2;
static ulong result = 0;

static void Main(string[] args)
{
    switch (v)
    {
        case 2: ++result; goto case 1;
        case 1: ++result; goto case 0;
        case 0: ++result; goto default;
        default: break;
    }

    Console.WriteLine("v={0} => result={1}", v, result);
}

IL produced:

.method private hidebysig static void  Main(string[] args) cil managed
{
  .entrypoint
  // Code size       105 (0x69)
  .maxstack  3
  .locals init (uint64 V_0)
  IL_0000:  ldsfld     uint64 _100GbSort.Program2::v
  IL_0005:  stloc.0
  IL_0006:  ldloc.0
  IL_0007:  dup
  IL_0008:  ldc.i4.2
  IL_0009:  conv.i8
  IL_000a:  ble.un.s   IL_000f    <=== **range check from C# compiler**
  IL_000c:  pop
  IL_000d:  br.s       IL_004a    <=== **range check from C# compiler**
  IL_000f:  conv.u4
  IL_0010:  switch     ( 
                        IL_003d,
                        IL_0030,
                        IL_0023)
  IL_0021:  br.s       IL_004a
  IL_0023:  ldsfld     uint64 _100GbSort.Program2::result
  IL_0028:  ldc.i4.1
  IL_0029:  conv.i8
  IL_002a:  add
  IL_002b:  stsfld     uint64 _100GbSort.Program2::result
  IL_0030:  ldsfld     uint64 _100GbSort.Program2::result
  IL_0035:  ldc.i4.1
  IL_0036:  conv.i8
  IL_0037:  add
  IL_0038:  stsfld     uint64 _100GbSort.Program2::result
  IL_003d:  ldsfld     uint64 _100GbSort.Program2::result
  IL_0042:  ldc.i4.1
  IL_0043:  conv.i8
  IL_0044:  add
  IL_0045:  stsfld     uint64 _100GbSort.Program2::result
  IL_004a:  ldstr      "v={0} => result={1}"
  IL_004f:  ldsfld     uint64 _100GbSort.Program2::v
  IL_0054:  box        [System.Runtime]System.UInt64
  IL_0059:  ldsfld     uint64 _100GbSort.Program2::result
  IL_005e:  box        [System.Runtime]System.UInt64
  IL_0063:  call       void [System.Console]System.Console::WriteLine(string,
                                                                      object,
                                                                      object)
  IL_0068:  ret
} // end of method Program2::Main

Disasm:

--- Program2.cs ---
    switch (v)
000007FE74D51450  push        rdi  
000007FE74D51451  push        rsi  
000007FE74D51452  sub         rsp,28h  
000007FE74D51456  mov         rcx,7FE74C34B30h  
000007FE74D51460  mov         edx,1  
000007FE74D51465  call        000007FED48404F0  
000007FE74D5146A  mov         rsi,qword ptr [7FE74C34B68h]  
000007FE74D51471  mov         rdi,rsi  
000007FE74D51474  cmp         rdi,2               <======== **range check from C# compiler**
000007FE74D51478  ja          000007FE74D514AC  
000007FE74D5147A  cmp         edi,2               <========= **unneeded check**
000007FE74D5147D  ja          000007FE74D514AC    <========= **unneeded check**
000007FE74D5147F  mov         eax,edi  
000007FE74D51481  lea         rdx,[7FE74D51500h]  
000007FE74D51488  mov         edx,dword ptr [rdx+rax*4]  
000007FE74D5148B  lea         rcx,[7FE74D51456h]  
000007FE74D51492  add         rdx,rcx  
000007FE74D51495  jmp         rdx  
    {
        case 2: ++result; goto case 1;
000007FE74D51497  inc         qword ptr [7FE74C34B70h]  
        case 1: ++result; goto case 0;
000007FE74D5149E  inc         qword ptr [7FE74C34B70h]  
        case 0: ++result; goto default;
000007FE74D514A5  inc         qword ptr [7FE74C34B70h]  
        default: break;
    }

    Console.WriteLine("v={0} => result={1}", v, result);
000007FE74D514AC  mov         rcx,7FED4171B00h  
000007FE74D514B6  call        000007FED48400F0  
000007FE74D514BB  mov         rdi,rax  
000007FE74D514BE  mov         qword ptr [rdi+8],rsi  
000007FE74D514C2  mov         rcx,7FED4171B00h  
000007FE74D514CC  call        000007FED48400F0  
000007FE74D514D1  mov         r8,rax  
000007FE74D514D4  mov         rdx,qword ptr [7FE74C34B70h]  
000007FE74D514DB  mov         qword ptr [r8+8],rdx  
000007FE74D514DF  mov         rdx,rdi  
000007FE74D514E2  mov         rcx,qword ptr [12533068h]  
000007FE74D514EA  call        000007FE74D51308  
000007FE74D514EF  nop  
000007FE74D514F0  add         rsp,28h  
000007FE74D514F4  pop         rsi  
000007FE74D514F5  pop         rdi  
000007FE74D514F6  ret 

category:cq
theme:inlining
skill-level:intermediate
cost:medium

@mikedn
Copy link
Contributor

mikedn commented Jul 5, 2018

That's a side effect of using long switch expressions, switch requires int so going from long to int require a check. The second check is then generated as part of the switch instruction itself and the JIT doesn't have any capability to deal with this type of redundant range checks.

@voinokin
Copy link
Author

voinokin commented Jul 5, 2018

Also, it looks like the presence of switch blocks method inlining (not in repro code, but such result is easy to get). Is there any better way to go ?

@mikedn
Copy link
Contributor

mikedn commented Jul 5, 2018

Yep, IL switch instruction currently blocks inlining and AggresiveInlining does not help in this case. If you want to inline a switch then I presume that it is small enough so perhaps use could use a couple of ifs instead? The C# compiler already does this in some cases, not sure why it didn't do it for the switch in your example. Could be another consequence of the use of long.

@voinokin
Copy link
Author

voinokin commented Jul 5, 2018

I've tried with ifs and it is slower when there are 5 cases in similar code - namely 3, 2, 1, 0, default.
Since my code allows to downconvert ulong to uint input value without the loss of functionality, and this eliminates one extra range check (no inlining though), I'm going to leave this issue in its current state. Perhaps some time in the future...
Thanks for explanations anyways!

@AndyAyersMS
Copy link
Member

Inlining methods with switches is not supported by the default inlining policy, but other policies allow it, so some level of support is plumbed through. Would be simple enough to also enable it by default, perhaps core-only for now. Given how conservative the default inline policy is, it would also likely require aggressive inlining attribution, though very small switches might make it past the profitability screen.

Just need to split the HAS_SWITCH case off on its own and not set propagate = true;:

https://github.com/dotnet/coreclr/blob/cd1232d47cc028ebf6d22621ce63903a7e5c0e94/src/jit/inlinepolicy.cpp#L306-L309

Would also be nice to track if when args/constants reach switches and give a boost to the profitability metrics.

@voinokin
Copy link
Author

voinokin commented Jul 5, 2018

Related question - is it somehow possible to switch to other inlining policy thru say app.config or alike? Would be nice for my hi-perf code.

@mikedn
Copy link
Contributor

mikedn commented Jul 5, 2018

Given how conservative the default inline policy is, it would also likely require aggressive inlining attribution

IMO that's perfectly reasonable as switches can be relatively large. What's not reasonable is that methods containing switches do not inline even if aggressive inlining is specified, that seems like a completely artificial limitation.

though very small switches might make it past the profitability screen

I did a quick test - I simply removed the observation. This resulted in one method being inlined in corelib - SpinLock:IsEnterDeprioritized. I assume that this means that it passed the profitability screen though it's kind of odd as the resulting code is pretty large:

       cmp      esi, 3
       ja       SHORT G_M61114_IG11
       mov      eax, esi
       lea      rdx, [reloc @RWD00]
       mov      edx, dword ptr [rdx+4*rax]
       lea      rcx, G_M61114_IG02
       add      rdx, rcx
       jmp      rdx
G_M61114_IG08:
       mov      eax, dword ptr [rdi+4]
       shr      eax, 16
       movzx    rax, ax
       test     eax, eax
       setne    al
       movzx    rax, al
       test     eax, eax
       jne      SHORT G_M61114_IG12
       jmp      SHORT G_M61114_IG11
G_M61114_IG09:
       cmp      word  ptr [rdi+4], 0
       setne    al
       movzx    rax, al
       test     eax, eax
       jne      SHORT G_M61114_IG12
       jmp      SHORT G_M61114_IG11
G_M61114_IG10:
       cmp      word  ptr [rdi+4], 1
       seta     al
       movzx    rax, al
       test     eax, eax
       jne      SHORT G_M61114_IG12
G_M61114_IG11:

@AndyAyersMS
Copy link
Member

@voinokin nothing officially supported, no.

You can experiment with an alternate policy I created called the ModelPolicy by setting the environment variable COMPlus_JitInlinePolicyModel=1. It's intended to have a more accurate cost/benefit model, but not to be more aggressive overall.

There are other policies available if you do a custom build of the jit, but they are mainly there to enable stress testing or experimental studies.

@Zhentar
Copy link

Zhentar commented Jul 10, 2018

@mikedn Roslyn's switch handling aims for well-packed jump tables; I think it will always pick a switch over ifs when it can get three+ contiguous entries.

I experimented with a number of different configurations trying to see if I could find a way to write this that the JIT would inline without telling it to be aggressive:

				public static bool TryParseBaseline(ReadOnlySpan<byte> source, out int value, out int bytesConsumed, char standardFormat = default)
				{
					switch (standardFormat)
					{
						case default(char):
						case 'g':
						case 'G':
						case 'd':
						case 'D':
							return TryParseInt32D(source, out value, out bytesConsumed);

						case 'n':
						case 'N':
							return TryParseInt32N(source, out value, out bytesConsumed);

						case 'x':
						case 'X':
							value = default;
							return TryParseUInt32X(source, out Unsafe.As<int, uint>(ref value), out bytesConsumed);

						default:
							return ThrowHelper.TryParseThrowFormatException(out value, out bytesConsumed);
					}
				}

It was always rejected for too many IL bytes or too many basic blocks, even when I coerced Roslyn into emitting switch IL for parts of it (and other variations were considered unprofitable), so banning switches from inlining based on expected size seems to be pretty redundant. It also seems somewhat counter-productive - I suspect a significant portion of switch statements (at least at the C# syntax level - in IL iterator blocks probably significantly influence the frequency) are like this one where the common case by far is the caller providing a constant for an argument being switched on.

@mikedn
Copy link
Contributor

mikedn commented Jul 10, 2018

Roslyn's switch handling aims for well-packed jump tables; I think it will always pick a switch over ifs when it can get three+ contiguous entries.

Something like that, it looks like it goes after > 50% switch table density - https://github.com/dotnet/roslyn/blob/d4dab355b96955aca5b4b0ebf6282575fad78ba8/src/Compilers/Core/Portable/CodeGen/SwitchIntegralJumpTableEmitter.cs#L65-L94

It's debatable, in some cases it should probably go for less, especially if the resulting jump table wouldn't be too large. It would be easier for the JIT to look at a jump table and decide switch strategy to use than to analyze the binary search tree code that is otherwise produced.

It was always rejected for too many IL bytes or too many basic blocks, even when I coerced Roslyn into emitting switch IL for parts of it (and other variations were considered unprofitable), so banning switches from inlining based on expected size seems to be pretty redundant.

Yeah, it's a bit arbitrary. It could be that there was (and perhaps still is) a technical limitation in the JIT importer around inserting switch basic blocks back into the inliner method. But it seems more likely that someone simply decided that switches tend to be large so it's not worth the trouble. And when AggresiveInlining was added nobody thought of going back to revisit that decision.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants