RyuJIT performance regression with bool to int conversions #4398

redknightlois · 2015-07-29T01:28:47Z

It looks like RyuJIT is almost 30% less efficient than legacy in bool-to-int conversions. All 3 methods show more or less the same issue.

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Mode=Throughput Platform=X64 .NET=Current

Method	Jit	AvrTime	StdDev	op/s
Framework	LegacyJit	1.16 us	28.20 ns	858517.79
IfThenElse	LegacyJit	1.08 us	44.58 ns	929458.93
UnsafeConvert	LegacyJit	595.83 ns	5.79 ns	1678333.61
Framework	RyuJit	1.31 us	18.64 ns	765539.56
IfThenElse	RyuJit	1.31 us	16.05 ns	761023.73
UnsafeConvert	RyuJit	896.80 ns	6.27 ns	1115078.04

Code in question: https://gist.github.com/redknightlois/c1ae5ddc6f73c2e53c9b

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium

mikedn · 2015-07-29T10:54:40Z

The fact that all 3 methods display the same issue would imply that the issue isn't related to conversion. The conversion that happens in the UnsafeConvert case is very different from the conversion that happens in the other 2 cases.

In the particular case of IfThenElse (which is probably the most common) there's no regression in the actual conversion, both JIT64 and RyuJIT generate identical code. The actual issue seems to be related to some low level and processor specific factors such as code alignment and size.

The funny thing is that if you write IfThenElse "properly", by avoiding the use of the static field

public static int IfThenElse(bool[] bits) {
    int sum = 0;
    for (int i = 0; i < N; i++)
        sum += bits[i] ? 1 : 0;
    return sum;
}

then the results are reversed, the code generated by RyuJIT is faster than the code generated by JIT64. Well, at least on my machine...

AndreyAkinshin · 2015-07-29T11:06:15Z

@redknightlois, @mikedn, I think, we should try to avoid work with arrays in this benchmark. Otherwise, we measure conversion time + access to array elements time + CPU cache efficiency.
.NET JIT can generate different version of the array element access code for static arrays, instance arrays, parameter arrays, and local arrays.

mikedn · 2015-07-29T11:54:34Z

@AndreyAkinshin

I think, we should try to avoid work with arrays in this benchmark. Otherwise, we measure conversion time + access to array elements time + CPU cache efficiency.

The size of that array is too small for CPU data caching to impact the benchmark. Besides, if you get rid of the array there's not much left for a proper benchmark.

NET JIT can generate different version of the array element access code for static arrays, instance arrays, parameter arrays, and local arrays.

Actually the only difference is in the case of static arrays, in that case the compiler insists on reloading the array reference every iteration. But that didn't prevent JIT64 from getting a good result.

AndreyAkinshin · 2015-07-29T12:18:18Z

@mikedn,

Besides, if you get rid of the array there's not much left for a proper benchmark.

BenchmarkDotNet assumes that you don't use arrays, if you don't want to measure some operations with arrays. You can measure just one or two instructions. For example, this benchmark works fine.

mikedn · 2015-07-29T12:38:23Z

@AndreyAkinshin I don't understand what you're trying to say. If you want to measure how long it takes to sum the elements of an array then you do just that, sum the elements of the array and measure the time. You don't go and measure how long an individual addition operation takes because the result won't tell you anything about how long the sum operation takes.

AndreyAkinshin · 2015-07-29T12:46:25Z

@mikedn, but the issue is about the bool to int conversions, not about a sum calculation.

mikedn · 2015-07-29T13:45:26Z

I already stated in the my first post that the conversion code generated by RyuJIT and JIT64 for the IfThenElse case is identical. While the conversion code could be improved the regression isn't caused by that code but by surrounding, array/loop specific code. Code that you seem to suggest that is should be removed. The sum stuff was a simple analogy, much like the benchmark you linked to even if it's not related in any way to this issue.

There is a difference in conversion code generated for UnsafeConvert but I didn't bothered too much with that case as it's kind of ugly.

redknightlois · 2015-07-29T13:49:29Z

@mikedn I discovered the issue because I had a very tight routine that last time I checked used to take 2:03 minutes, now it takes 3:07 minutes. There was nothing different but then I remember that I had installed Visual Studio 2015 in that machine and reduced the issue to its core. Then I just added the other alternatives to see how it fared. Up to 1-3% difference is acceptable, Up to 5-7% I can live with but 55% worse is a performance regression. I probably could have gone the extra mile and find out the code generated by both is the same, but even if that would be the case, the perf regression in this example does exist.

@AndreyAkinshin Probably the name is not the best one, as you say there are many different things playing in this beside the conversion itself.

So I just went and got rid of the array and used 2 instance variables... definitely not much better.

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Mode=Throughput Platform=X64 .NET=Current

Method	Jit	AvrTime	StdDev	op/s
Framework	LegacyJit	1.47 ns	0.0191 ns	678168087.72
IfThenElse	LegacyJit	1.48 ns	0.0249 ns	676316149.82
UnsafeConvert	LegacyJit	0.299 ns	0.00806 ns	3340169067.7
Framework	RyuJit	2.97 ns	0.0216 ns	336993897.01
IfThenElse	RyuJit	2.96 ns	0.0141 ns	338053580.07
UnsafeConvert	RyuJit	8.59 ns	0.0329 ns	116442151.41

Repro: https://gist.github.com/redknightlois/045d3022c2c28e1498e2

mikedn · 2015-07-29T14:51:07Z

@redknightlois

I discovered the issue because I had a very tight routine that last time I checked used to take 2:03 minutes, now it takes 3:07 minutes

Which one of the 2 benchmarks more accurately represent your actual code? In your latest benchmark the issue is easy to see and easy to fix (at the cost of increased code size). RyuJIT doesn't align certain branch targets like JIT64 did:

JIT 64 code for IfThenElse:

movzx       eax,byte ptr [rcx+8]  
test        eax,eax  
jne         00007FFCB3538160  
xor         edx,edx  
jmp         00007FFCB3538165  
nop         dword ptr [rax]  
mov         edx,1  
movzx       eax,byte ptr [rcx+9]  
test        eax,eax  
jne         00007FFCB3538171  
xor         eax,eax  
jmp         00007FFCB3538176  
mov         eax,1  
add         eax,edx  
ret

RyuJIT code for IfThenElse:

movzx       eax,byte ptr [rcx+8]  
test        eax,eax  
jne         00007FFCB347056C  
xor         eax,eax  
jmp         00007FFCB3470571  
mov         eax,1  
movzx       edx,byte ptr [rcx+9]  
test        edx,edx  
jne         00007FFCB347057D  
xor         edx,edx  
jmp         00007FFCB3470582  
mov         edx,1  
add         eax,edx  
ret

The JIT64 version has a nop dword [rax] somewhere in the middle, that's intended to align the target of a previous branch to a 16 bytes boundary. The conversion code could and should be improved but the actual regression is caused by code alignment.

The array version of the benchmark has the same alignment issue but attempting to fix alignment in that case doesn't appear to help. There's another issue in that case, one which I don't see and apparently JIT64 avoided by accident since some variations of that code actually run slower on JIT64.

redknightlois · 2015-07-29T15:00:44Z

The first benchmark with unsafe conversions is effectively the case it looks more like the code in question. We rely on unsafe code a lot, if it is of interest we can measure other places and issue repros if time permit.

There's another issue in that case, one which I don't see and apparently JIT64 avoided by accident since some variations of that code actually run slower on JIT64.

Love when that happens :D

mikedn · 2015-07-29T15:36:56Z

The code generated by RyuJIT for UnsafeConvert is worse, it cannot eliminate the local variable because its address has been taken and that means that there's an additional memory load/store which could be responsible for the degraded performance.

But that code has a bug, it casts to int* when it should cast to byte*. Because of that JIT64 produces an incorrect result and if the code is fixed then the performance difference disappears. At least in my own tests...

redknightlois · 2015-07-29T15:49:21Z

@mikedn Yes, that's true it can read garbage in this version (the original code is working on packed arrays of booleans). I will rerun that one and tell you what it looks like on this end.

I am testing a bunch of other routines (isolated) and have some other results. Do you want me to open new issues or I post them here?

mikedn · 2015-07-29T15:54:53Z

I am testing a bunch of other routines (isolated) and have some other results. Do you want me to open new issues or I poste them here?

I'm not in a position to want anything from you, it's not like I work for Microsoft 😄. But I think that it would make sense to create different issues if the code is you have problem with is different.

redknightlois · 2015-07-29T16:23:22Z

@mikedn LOL I was under the impression you were (you fake it quite well :D)

Well with the fixed UnsafeConvert there is still a huge difference.

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Method=UnsafeConvert Mode=Throughput Platform=X64 .NET=Current

Jit	AvrTime	StdDev	op/s
LegacyJit	0.00227 ns	0.00656 ns	450599340891.6
RyuJit	2.08 ns	0.0171 ns	479917468.58

Either the LegacyJit is doing some extra optimization that would skew the results and I am not being able to negate, or the difference in huge.

I have updated the gist code with the correct code.

mikedn · 2015-07-29T18:03:04Z

Here are some funny results:

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Method=UnsafeConvert Mode=Throughput Platform=X64 .NET=Current

Jit	AvrTime	StdDev	op/s
LegacyJit	0.000000 ns	0.0155 ns	17554479418886200
RyuJit	1.58 ns	0.00509 ns	634902061.76

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Method=UnsafeConvert Mode=Throughput Platform=X64 .NET=Current

Jit	AvrTime	StdDev	op/s
LegacyJit	1.68 ns	0.0619 ns	595674355.39
RyuJit	1.55 ns	0.0143 ns	643122192.73

The first result set is from your unchanged code, the second is after I added [MethodImpl(MethodImplOptions.NoInlining)] to the UnsafeConvert method.

The real problem isn't the conversion itself (even though the code isn't great as I stated above), the problem is that RyuJIT doesn't inline that method but JIT64 does.

redknightlois · 2015-07-29T18:10:50Z

@AndreyAkinshin I know it is supposed to be called through a NoInlining call. Any idea why it appears to not being honoring the NoInlining in the first case?

AndreyAkinshin · 2015-07-30T03:31:15Z

@mikedn, @redknightlois, it is definitely a BenchmarkDotNet bug. Inlining is a huge area of troubles for benchmarking. But I know how to fix it, I will publish BenchmarkDotNet v0.7.7 today or tomorrow. Just for now, you can use MethodImplOptions.NoInlining for all of the target methods, it should help.

AndreyAkinshin · 2015-07-30T09:26:11Z

@mikedn, @redknightlois, I have published new version of BenchmarkDotNet: https://www.nuget.org/packages/BenchmarkDotNet/0.7.7

The bug with inlining has been fixed.
Now you can easily run gists with help of the BenchmarkRunner.RunUrl method or via cmd + BenchmarkDotNet.Samples.exe:

BenchmarkDotNet.Samples.exe https://gist.githubusercontent.com/redknightlois/c1ae5ddc6f73c2e53c9b/raw/38ea89e8766
2fa7054f8464b743a4a616a3172b0/Jit_BoolToInt.cs

AndreyAkinshin · 2015-07-30T09:44:33Z

@mikedn, @redknightlois, I have also made some improvements for Jit_BoolToInt benchmark. UnsafeConvert worked too fast and it is hard to measure the time correctly with small StdDev. Now I make 6 convert operations. Moreover, the sum operation has a significant effect on the results. So, I replaced it with saving results to the fields: https://github.com/PerfDotNet/BenchmarkDotNet/blob/master/BenchmarkDotNet.Samples/Jit_BoolToInt.cs What do you think?

Here is my new results:
// BenchmarkDotNet=v0.7.7.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
// Host CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Mode=Throughput Platform=X64 .NET=HostFramework

Method	Jit	AvrTime	StdDev	op/s
Framework	LegacyJit	0.535 ns	0.0112 ns	1867970565.31
IfThenElse	LegacyJit	0.476 ns	0.0301 ns	2100057728.09
UnsafeConvert	LegacyJit	0.186 ns	0.0139 ns	5374099628.64
Framework	RyuJit	0.497 ns	0.0553 ns	2011678561.91
IfThenElse	RyuJit	0.734 ns	0.0450 ns	1363137087.31
UnsafeConvert	RyuJit	1.84 ns	0.0123 ns	542931614.3

mikedn · 2015-07-30T10:41:55Z

So, to try to sum this up:

Framework/IfThenElse cases are likely affected by a lack of branch target alignment in the code generated by RyuJIT
UnsafeConvert is affected by RyuJIT's inability to eliminate redundant loads/stores of local variables that have their address taken.

redknightlois · 2015-07-30T14:00:08Z

@AndreyAkinshin Looks pretty good, small enough for the JIT guys to look into it. 👍

mmitche · 2015-07-30T22:21:09Z

Thanks guys, this is great stuff!

EgorBo · 2020-03-04T23:37:29Z

Related: #32716

BruceForstall · 2020-12-01T23:37:41Z

Branch target alignment is being worked on.

It doesn't seem like there is much here otherwise that is actionable. I'm going to close this.

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

msftgits added this to the Future milestone Jan 30, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Dec 1, 2020

BruceForstall closed this as completed Dec 1, 2020

BruceForstall changed the title ~~RyuJIT performance regresion with bool to int conversions~~ RyuJIT performance regression with bool to int conversions Dec 1, 2020

ghost locked as resolved and limited conversation to collaborators Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RyuJIT performance regression with bool to int conversions #4398

RyuJIT performance regression with bool to int conversions #4398

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

AndreyAkinshin commented Jul 29, 2015

mikedn commented Jul 29, 2015

AndreyAkinshin commented Jul 29, 2015

mikedn commented Jul 29, 2015

AndreyAkinshin commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

AndreyAkinshin commented Jul 30, 2015

AndreyAkinshin commented Jul 30, 2015

AndreyAkinshin commented Jul 30, 2015

mikedn commented Jul 30, 2015

redknightlois commented Jul 30, 2015

mmitche commented Jul 30, 2015

EgorBo commented Mar 4, 2020

BruceForstall commented Dec 1, 2020

RyuJIT performance regression with bool to int conversions #4398

RyuJIT performance regression with bool to int conversions #4398

Comments

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

AndreyAkinshin commented Jul 29, 2015

mikedn commented Jul 29, 2015

AndreyAkinshin commented Jul 29, 2015

mikedn commented Jul 29, 2015

AndreyAkinshin commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

mikedn commented Jul 29, 2015

redknightlois commented Jul 29, 2015

AndreyAkinshin commented Jul 30, 2015

AndreyAkinshin commented Jul 30, 2015

AndreyAkinshin commented Jul 30, 2015

mikedn commented Jul 30, 2015

redknightlois commented Jul 30, 2015

mmitche commented Jul 30, 2015

EgorBo commented Mar 4, 2020

BruceForstall commented Dec 1, 2020