Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT performance regression with bool to int conversions #4398

Closed
redknightlois opened this issue Jul 29, 2015 · 24 comments
Closed

RyuJIT performance regression with bool to int conversions #4398

redknightlois opened this issue Jul 29, 2015 · 24 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@redknightlois
Copy link

It looks like RyuJIT is almost 30% less efficient than legacy in bool-to-int conversions. All 3 methods show more or less the same issue.

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Mode=Throughput Platform=X64 .NET=Current

Method Jit AvrTime StdDev op/s
Framework LegacyJit 1.16 us 28.20 ns 858517.79
IfThenElse LegacyJit 1.08 us 44.58 ns 929458.93
UnsafeConvert LegacyJit 595.83 ns 5.79 ns 1678333.61
Framework RyuJit 1.31 us 18.64 ns 765539.56
IfThenElse RyuJit 1.31 us 16.05 ns 761023.73
UnsafeConvert RyuJit 896.80 ns 6.27 ns 1115078.04

Code in question: https://gist.github.com/redknightlois/c1ae5ddc6f73c2e53c9b

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

The fact that all 3 methods display the same issue would imply that the issue isn't related to conversion. The conversion that happens in the UnsafeConvert case is very different from the conversion that happens in the other 2 cases.

In the particular case of IfThenElse (which is probably the most common) there's no regression in the actual conversion, both JIT64 and RyuJIT generate identical code. The actual issue seems to be related to some low level and processor specific factors such as code alignment and size.

The funny thing is that if you write IfThenElse "properly", by avoiding the use of the static field

public static int IfThenElse(bool[] bits) {
    int sum = 0;
    for (int i = 0; i < N; i++)
        sum += bits[i] ? 1 : 0;
    return sum;
}

then the results are reversed, the code generated by RyuJIT is faster than the code generated by JIT64. Well, at least on my machine...

@AndreyAkinshin
Copy link
Member

@redknightlois, @mikedn, I think, we should try to avoid work with arrays in this benchmark. Otherwise, we measure conversion time + access to array elements time + CPU cache efficiency.
.NET JIT can generate different version of the array element access code for static arrays, instance arrays, parameter arrays, and local arrays.

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

@AndreyAkinshin

I think, we should try to avoid work with arrays in this benchmark. Otherwise, we measure conversion time + access to array elements time + CPU cache efficiency.

The size of that array is too small for CPU data caching to impact the benchmark. Besides, if you get rid of the array there's not much left for a proper benchmark.

NET JIT can generate different version of the array element access code for static arrays, instance arrays, parameter arrays, and local arrays.

Actually the only difference is in the case of static arrays, in that case the compiler insists on reloading the array reference every iteration. But that didn't prevent JIT64 from getting a good result.

@AndreyAkinshin
Copy link
Member

@mikedn,

Besides, if you get rid of the array there's not much left for a proper benchmark.

BenchmarkDotNet assumes that you don't use arrays, if you don't want to measure some operations with arrays. You can measure just one or two instructions. For example, this benchmark works fine.

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

@AndreyAkinshin I don't understand what you're trying to say. If you want to measure how long it takes to sum the elements of an array then you do just that, sum the elements of the array and measure the time. You don't go and measure how long an individual addition operation takes because the result won't tell you anything about how long the sum operation takes.

@AndreyAkinshin
Copy link
Member

@mikedn, but the issue is about the bool to int conversions, not about a sum calculation.

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

I already stated in the my first post that the conversion code generated by RyuJIT and JIT64 for the IfThenElse case is identical. While the conversion code could be improved the regression isn't caused by that code but by surrounding, array/loop specific code. Code that you seem to suggest that is should be removed. The sum stuff was a simple analogy, much like the benchmark you linked to even if it's not related in any way to this issue.

There is a difference in conversion code generated for UnsafeConvert but I didn't bothered too much with that case as it's kind of ugly.

@redknightlois
Copy link
Author

@mikedn I discovered the issue because I had a very tight routine that last time I checked used to take 2:03 minutes, now it takes 3:07 minutes. There was nothing different but then I remember that I had installed Visual Studio 2015 in that machine and reduced the issue to its core. Then I just added the other alternatives to see how it fared. Up to 1-3% difference is acceptable, Up to 5-7% I can live with but 55% worse is a performance regression. I probably could have gone the extra mile and find out the code generated by both is the same, but even if that would be the case, the perf regression in this example does exist.

@AndreyAkinshin Probably the name is not the best one, as you say there are many different things playing in this beside the conversion itself.

So I just went and got rid of the array and used 2 instance variables... definitely not much better.

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Mode=Throughput Platform=X64 .NET=Current

Method Jit AvrTime StdDev op/s
Framework LegacyJit 1.47 ns 0.0191 ns 678168087.72
IfThenElse LegacyJit 1.48 ns 0.0249 ns 676316149.82
UnsafeConvert LegacyJit 0.299 ns 0.00806 ns 3340169067.7
Framework RyuJit 2.97 ns 0.0216 ns 336993897.01
IfThenElse RyuJit 2.96 ns 0.0141 ns 338053580.07
UnsafeConvert RyuJit 8.59 ns 0.0329 ns 116442151.41

Repro: https://gist.github.com/redknightlois/045d3022c2c28e1498e2

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

@redknightlois

I discovered the issue because I had a very tight routine that last time I checked used to take 2:03 minutes, now it takes 3:07 minutes

Which one of the 2 benchmarks more accurately represent your actual code? In your latest benchmark the issue is easy to see and easy to fix (at the cost of increased code size). RyuJIT doesn't align certain branch targets like JIT64 did:

JIT 64 code for IfThenElse:

movzx       eax,byte ptr [rcx+8]  
test        eax,eax  
jne         00007FFCB3538160  
xor         edx,edx  
jmp         00007FFCB3538165  
nop         dword ptr [rax]  
mov         edx,1  
movzx       eax,byte ptr [rcx+9]  
test        eax,eax  
jne         00007FFCB3538171  
xor         eax,eax  
jmp         00007FFCB3538176  
mov         eax,1  
add         eax,edx  
ret  

RyuJIT code for IfThenElse:

movzx       eax,byte ptr [rcx+8]  
test        eax,eax  
jne         00007FFCB347056C  
xor         eax,eax  
jmp         00007FFCB3470571  
mov         eax,1  
movzx       edx,byte ptr [rcx+9]  
test        edx,edx  
jne         00007FFCB347057D  
xor         edx,edx  
jmp         00007FFCB3470582  
mov         edx,1  
add         eax,edx  
ret

The JIT64 version has a nop dword [rax] somewhere in the middle, that's intended to align the target of a previous branch to a 16 bytes boundary. The conversion code could and should be improved but the actual regression is caused by code alignment.

The array version of the benchmark has the same alignment issue but attempting to fix alignment in that case doesn't appear to help. There's another issue in that case, one which I don't see and apparently JIT64 avoided by accident since some variations of that code actually run slower on JIT64.

@redknightlois
Copy link
Author

The first benchmark with unsafe conversions is effectively the case it looks more like the code in question. We rely on unsafe code a lot, if it is of interest we can measure other places and issue repros if time permit.

There's another issue in that case, one which I don't see and apparently JIT64 avoided by accident since some variations of that code actually run slower on JIT64.

Love when that happens :D

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

The code generated by RyuJIT for UnsafeConvert is worse, it cannot eliminate the local variable because its address has been taken and that means that there's an additional memory load/store which could be responsible for the degraded performance.

But that code has a bug, it casts to int* when it should cast to byte*. Because of that JIT64 produces an incorrect result and if the code is fixed then the performance difference disappears. At least in my own tests...

@redknightlois
Copy link
Author

@mikedn Yes, that's true it can read garbage in this version (the original code is working on packed arrays of booleans). I will rerun that one and tell you what it looks like on this end.

I am testing a bunch of other routines (isolated) and have some other results. Do you want me to open new issues or I post them here?

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

I am testing a bunch of other routines (isolated) and have some other results. Do you want me to open new issues or I poste them here?

I'm not in a position to want anything from you, it's not like I work for Microsoft 😄. But I think that it would make sense to create different issues if the code is you have problem with is different.

@redknightlois
Copy link
Author

@mikedn LOL I was under the impression you were (you fake it quite well :D)

Well with the fixed UnsafeConvert there is still a huge difference.

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Method=UnsafeConvert Mode=Throughput Platform=X64 .NET=Current

Jit AvrTime StdDev op/s
LegacyJit 0.00227 ns 0.00656 ns 450599340891.6
RyuJit 2.08 ns 0.0171 ns 479917468.58

Either the LegacyJit is doing some extra optimization that would skew the results and I am not being able to negate, or the difference in huge.

I have updated the gist code with the correct code.

@mikedn
Copy link
Contributor

mikedn commented Jul 29, 2015

Here are some funny results:

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Method=UnsafeConvert Mode=Throughput Platform=X64 .NET=Current

Jit AvrTime StdDev op/s
LegacyJit 0.000000 ns 0.0155 ns 17554479418886200
RyuJit 1.58 ns 0.00509 ns 634902061.76

// BenchmarkDotNet=v0.7.6.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz, ProcessorCount=4
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Method=UnsafeConvert Mode=Throughput Platform=X64 .NET=Current

Jit AvrTime StdDev op/s
LegacyJit 1.68 ns 0.0619 ns 595674355.39
RyuJit 1.55 ns 0.0143 ns 643122192.73

The first result set is from your unchanged code, the second is after I added [MethodImpl(MethodImplOptions.NoInlining)] to the UnsafeConvert method.

The real problem isn't the conversion itself (even though the code isn't great as I stated above), the problem is that RyuJIT doesn't inline that method but JIT64 does.

@redknightlois
Copy link
Author

@AndreyAkinshin I know it is supposed to be called through a NoInlining call. Any idea why it appears to not being honoring the NoInlining in the first case?

@AndreyAkinshin
Copy link
Member

@mikedn, @redknightlois, it is definitely a BenchmarkDotNet bug. Inlining is a huge area of troubles for benchmarking. But I know how to fix it, I will publish BenchmarkDotNet v0.7.7 today or tomorrow. Just for now, you can use MethodImplOptions.NoInlining for all of the target methods, it should help.

@AndreyAkinshin
Copy link
Member

@mikedn, @redknightlois, I have published new version of BenchmarkDotNet: https://www.nuget.org/packages/BenchmarkDotNet/0.7.7

  1. The bug with inlining has been fixed.
  2. Now you can easily run gists with help of the BenchmarkRunner.RunUrl method or via cmd + BenchmarkDotNet.Samples.exe:
BenchmarkDotNet.Samples.exe https://gist.githubusercontent.com/redknightlois/c1ae5ddc6f73c2e53c9b/raw/38ea89e8766
2fa7054f8464b743a4a616a3172b0/Jit_BoolToInt.cs

@AndreyAkinshin
Copy link
Member

@mikedn, @redknightlois, I have also made some improvements for Jit_BoolToInt benchmark. UnsafeConvert worked too fast and it is hard to measure the time correctly with small StdDev. Now I make 6 convert operations. Moreover, the sum operation has a significant effect on the results. So, I replaced it with saving results to the fields: https://github.com/PerfDotNet/BenchmarkDotNet/blob/master/BenchmarkDotNet.Samples/Jit_BoolToInt.cs What do you think?

Here is my new results:
// BenchmarkDotNet=v0.7.7.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
// Host CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Common: Type=Jit_BoolToInt Mode=Throughput Platform=X64 .NET=HostFramework

Method Jit AvrTime StdDev op/s
Framework LegacyJit 0.535 ns 0.0112 ns 1867970565.31
IfThenElse LegacyJit 0.476 ns 0.0301 ns 2100057728.09
UnsafeConvert LegacyJit 0.186 ns 0.0139 ns 5374099628.64
Framework RyuJit 0.497 ns 0.0553 ns 2011678561.91
IfThenElse RyuJit 0.734 ns 0.0450 ns 1363137087.31
UnsafeConvert RyuJit 1.84 ns 0.0123 ns 542931614.3

@mikedn
Copy link
Contributor

mikedn commented Jul 30, 2015

So, to try to sum this up:

  • Framework/IfThenElse cases are likely affected by a lack of branch target alignment in the code generated by RyuJIT
  • UnsafeConvert is affected by RyuJIT's inability to eliminate redundant loads/stores of local variables that have their address taken.

@redknightlois
Copy link
Author

@AndreyAkinshin Looks pretty good, small enough for the JIT guys to look into it. 👍

@mmitche
Copy link
Member

mmitche commented Jul 30, 2015

Thanks guys, this is great stuff!

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@EgorBo
Copy link
Member

EgorBo commented Mar 4, 2020

Related: #32716

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Dec 1, 2020
@BruceForstall
Copy link
Member

Branch target alignment is being worked on.

It doesn't seem like there is much here otherwise that is actionable. I'm going to close this.

@BruceForstall BruceForstall changed the title RyuJIT performance regresion with bool to int conversions RyuJIT performance regression with bool to int conversions Dec 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants