Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

33% performance degradation in .NET 5 #44457

Closed
koszeggy opened this issue Nov 10, 2020 · 11 comments
Closed

33% performance degradation in .NET 5 #44457

koszeggy opened this issue Nov 10, 2020 · 11 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI regression-from-last-release tenet-performance Performance related issue
Milestone

Comments

@koszeggy
Copy link

koszeggy commented Nov 10, 2020

Description

Note: I don't know whether the root cause of my issue is related with #36907 so I report an easily reproducible scenario here.

I have a high performance core library, which has some multidimensional span-like types such as Array2D and Array3D structs, which are affected by the performance degradation: accessing elements of these types are faster on .NET Core 3 than accessing elements of a regular multidimensional array but not when executing on .NET 5.

Reproduction:

  • Fork or download this repository
  • The library targets a bunch of target frameworks. To observe the degradation it is enough to target .NET Core 3.x and .NET 5.0 only.
  • From the KGySoft.CoreLibraries.PerformanceTest project execute the Array2DPerformanceTest.AccessTest against both .NET Core 3 and .NET 5
  • Observe the results in the console output

Online living version: I created also an online example. As per 11/10/2020 this executes the performance test on .NET Core 3.1 (this is a somewhat shortened test in order not to timeout). Targeting .NET 5 is not possible on .NET Fiddle yet.

Configuration

  • dotnet --version: 5.0.100-rc.2.20479.15
  • Windows 10 [Version 10.0.19042.572] 64 bit version

Regression?

The regression can be observed between .NET Core 3.0/3.1 and .NET 5.0

Data

  • .NET Core 3.0 results on my machine:
==[AccessTest (.NET Core 3.0.0) Results]================================================
Iterations: 10,000
Warming up: Yes
Test cases: 3
Repeats: 5
Calling GC.Collect: Yes
Forced CPU Affinity: 2
Cases are sorted by time (quickest first)
--------------------------------------------------
1. int[y][x] = value: average time: 389.46 ms
  #1         389.59 ms
  #2         389.14 ms
  #3         388.61 ms	 <---- Best
  #4         389.39 ms
  #5         390.58 ms	 <---- Worst
  Worst-Best difference: 1,98 ms (0,51%)
2. Array2D<int>[y, x] = value: average time: 642.20 ms (+252.74 ms / 164.89%)
  #1         641.34 ms
  #2         642.98 ms
  #3         642.63 ms
  #4         643.43 ms	 <---- Worst
  #5         640.61 ms	 <---- Best
  Worst-Best difference: 2.83 ms (0.44%)
3. int[y, x] = value: average time: 701.46 ms (+312.00 ms / 180.11%)
  #1         702.69 ms
  #2         704.56 ms	 <---- Worst
  #3         700.29 ms
  #4         701.06 ms
  #5         698.72 ms	 <---- Best
  Worst-Best difference: 5.84 ms (0.84%)
  • .NET 5.0 RC2 results on my machine:
    The Array2D case has a 33% performance degradation (860 ms vs. 642 ms) while the regular 2D array and jagged array performance did not change essentially.
==[AccessTest (.NET Core 5.0.0-rc.2.20475.5) Results]================================================
Iterations: 10,000
Warming up: Yes
Test cases: 3
Repeats: 5
Calling GC.Collect: Yes
Forced CPU Affinity: 2
Cases are sorted by time (quickest first)
--------------------------------------------------
1. int[y][x] = value: average time: 395.04 ms
  #1         395.01 ms
  #2         393.34 ms	 <---- Best
  #3         397.91 ms	 <---- Worst
  #4         394.28 ms
  #5         394.68 ms
  Worst-Best difference: 4.57 ms (1.16%)
2. int[y, x] = value: average time: 704.27 ms (+309.23 ms / 178.28%)
  #1         702.28 ms
  #2         702.97 ms
  #3         703.96 ms
  #4         700.14 ms	 <---- Best
  #5         712.02 ms	 <---- Worst
  Worst-Best difference: 11.89 ms (1.70%)
3. Array2D<int>[y, x] = value: average time: 860.47 ms (+465.42 ms / 217.82%)
  #1         848.51 ms	 <---- Best
  #2         870.71 ms	 <---- Worst
  #3         853.88 ms
  #4         866.94 ms
  #5         862.30 ms
  Worst-Best difference: 22.20 ms (2.62%)
  • .NET Core 3.1 online results (.NET Fiddle): https://dotnetfiddle.net/02BdPF
    Note: This test is reduced (both in time and cases) in order not to timeout .NET Fiddle. As it is not possible to run .NET 5 codes online yet it is only good for demonstrating that Array2D access is faster than regular 2D array access.

Analysis

I'm not sure whether I could identify the hot-spot correctly but since Array2D uses ArraySection internally, and I could not observe any significant performance degradation in ArraySection performance test (feel free to set Repeat = 5 to get more reliable results just like above) I suspect that the issue lies in accessing the wrapped ArraySection struct inside the Array2D struct here. However, I could not find any suspicious in the IL code, and I could not check the JITted machine code of the .NET 5 version.

@koszeggy koszeggy added the tenet-performance Performance related issue label Nov 10, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 10, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@SingleAccretion
Copy link
Contributor

I have run the benchmarks on my machine (Ivy Bridge) using your test from the repository, and did not see a regression from 3.1.8 to 5.0.0 (what I did see however was the results tended down with each run):

==[AccessTest (.NET Core 5.0.0-rc.2.20475.5) Results]==
1. int[y][x] = value: average time: 545,16 ms
  Worst-Best difference: 5,10 ms (0,94 %)
2. Array2D<int>[y, x] = value: average time: 818,02 ms (+272,86 ms / 150,05 %)
  Worst-Best difference: 127,29 ms (16,49 %)
3. int[y, x] = value: average time: 908,14 ms (+362,98 ms / 166,58 %)
  Worst-Best difference: 4,54 ms (0,50 %)
  
==[AccessTest (.NET Core 3.1.8) Results]==
1. int[y][x] = value: average time: 550,16 ms
  Worst-Best difference: 3,37 ms (0,61%)
2. Array2D<int>[y, x] = value: average time: 828,48 ms (+278,32 ms / 150,59%)
  Worst-Best difference: 9,24 ms (1,12%)
3. int[y, x] = value: average time: 1 002,67 ms (+452,51 ms / 182,25%)
  Worst-Best difference: 10,39 ms (1,04%)

I lost the ability to run your tests after that for some reason and now have replicated your setup with Benchmark.NET.

[SimpleJob(RuntimeMoniker.NetCoreApp31)]
[SimpleJob(RuntimeMoniker.NetCoreApp30)]
[SimpleJob(RuntimeMoniker.NetCoreApp50)]
public class Benchmarks
{
    public const int Width = 320;
    public const int Height = 200;

    [Benchmark]
    public void AssignArray2DMD()
    {
        var array2d = new Array2D<int>(Height, Width);
        int i = 0;
        for (int y = 0; y < Height; y++)
        {
            for (int x = 0; x < Width; x++)
                array2d[y, x] = ++i;
        }
    }
}

Using this setup I too was not able to replicate the regression:

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1139 (1909/November2018Update/19H2)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-alpha.1.20529.5
  [Host]        : .NET Core 3.0.3 (CoreCLR 4.700.20.6603, CoreFX 4.700.20.6701), X64 RyuJIT
  .NET Core 3.0 : .NET Core 3.0.3 (CoreCLR 4.700.20.6603, CoreFX 4.700.20.6701), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
  .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT


|          Method |           Job |       Runtime |     Mean |   Error |  StdDev |
|---------------- |-------------- |-------------- |---------:|--------:|--------:|
| AssignArray2DMD | .NET Core 3.0 | .NET Core 3.0 | 168.1 us | 3.09 us | 5.17 us |
| AssignArray2DMD | .NET Core 3.1 | .NET Core 3.1 | 166.5 us | 2.60 us | 2.44 us |
| AssignArray2DMD | .NET Core 5.0 | .NET Core 5.0 | 165.4 us | 3.30 us | 3.24 us |

@koszeggy
Copy link
Author

koszeggy commented Nov 10, 2020

what I did see however was the results tended down with each run

I'm not sure how you tested it but the self allocating constructor uses ArrayPool internally on targets where it is available. Are you sure you disposed the array2d instance in the end? In your Benchmark.NET tests it is not disposed, at least.

Using Benchmark.NET

Unfortunately the overhead of Benchmark.NET is a bit too large for very quick test cases so you need to "magnify" the actual payload of the test.

If you modify the test method like this:

[Benchmark]
public void AssignArray2DMD()
{
    var array2d = new Array2D<int>(Height, Width);

    for (int iter = 0; iter < 10_000; iter++)
    {
        int i = 0;
        for (int y = 0; y < Height; y++)
        {
            for (int x = 0; x < Width; x++)
                array2d[y, x] = ++i;
        }
    }

    array2d.Dispose();
}

then the difference will be clear (at least on my machine):

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i5-8300H CPU 2.30GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]        : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
  .NET Core 3.0 : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT


|          Method |           Job |       Runtime |     Mean |   Error |   StdDev |
|---------------- |-------------- |-------------- |---------:|--------:|---------:|
| AssignArray2DMD | .NET Core 3.0 | .NET Core 3.0 | 560.4 ms | 3.12 ms |  2.92 ms |
| AssignArray2DMD | .NET Core 3.1 | .NET Core 3.1 | 568.8 ms | 9.79 ms | 13.73 ms |
| AssignArray2DMD | .NET Core 5.0 | .NET Core 5.0 | 848.4 ms | 3.87 ms |  3.62 ms |

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Nov 10, 2020

I'm not sure how you tested it but the self allocating constructor uses ThreadPool internally on targets where it is available. Are you sure you disposed the array2d instance in the end? In your Benchmark.NET tests it is not disposed, at least.

I see, I did not think that would matter because I assumed the test was about the quality of indexer codegen. I have now run the corrected test you provided and there is still no difference:

[Benchmark]
public void AssignArray2D()
{
    var array2d = new Array2D<int>(Height, Width);

    for (int iter = 0; iter < 10_000; iter++)
    {
        int i = 0;
        for (int y = 0; y < Height; y++)
        {
            for (int x = 0; x < Width; x++)
                array2d[y, x] = ++i;
        }
    }

    array2d.Dispose();
}
// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1139 (1909/November2018Update/19H2)
Intel Core i7-4820K CPU 3.70GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=6.0.100-alpha.1.20529.5
  [Host]        : .NET Core 3.0.3 (CoreCLR 4.700.20.6603, CoreFX 4.700.20.6701), X64 RyuJIT
  .NET Core 3.0 : .NET Core 3.0.3 (CoreCLR 4.700.20.6603, CoreFX 4.700.20.6701), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), X64 RyuJIT
  .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT


|        Method |           Job |       Runtime |     Mean |    Error |   StdDev |
|-------------- |-------------- |-------------- |---------:|---------:|---------:|
| AssignArray2D | .NET Core 3.0 | .NET Core 3.0 | 757.3 ms | 15.09 ms | 18.53 ms |
| AssignArray2D | .NET Core 3.1 | .NET Core 3.1 | 755.6 ms | 14.66 ms | 16.88 ms |
| AssignArray2D | .NET Core 5.0 | .NET Core 5.0 | 760.1 ms | 15.07 ms | 23.01 ms |

BTW, it could be a good idea to get the diassembly for these methods with [DisassemblyDiagnoser(maxDepth: 5)].

@koszeggy
Copy link
Author

Hmm, it must be a platform-dependent thing, then. :(

Thanks for the tip, I will provide a disassembly when I will have time again.

@AndyAyersMS
Copy link
Member

Could be you have an alignment-sensitive loop? cc @kunalspathak

@koszeggy
Copy link
Author

I was playing with DisassemblyDiagnoser a bit and I made the following observations:

  • There is about a 50% chance that there is no difference between .NET 3.x and 5.0. - if this is an alignment issue, maybe the JITter can have a lucky day and put the code to a well-aligned address? Remark: I could not observe such "coin toss" when using my performence test (see the original post) as the difference always emerged.
  • If the difference emerges there is another 50% for Benchmark.NET that it is unable to dump any results for .NET 5.0 (Code Size will be 40 bytes and contains only a short BenchmarkDotNet.Autogenerated.Runnable_0.__ForDisassemblyDiagnoser__() method.
// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i5-8300H CPU 2.30GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]        : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
  .NET Core 3.0 : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), X64 RyuJIT
  .NET Core 3.1 : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  .NET Core 5.0 : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT


|          Method |           Job |       Runtime |     Mean |   Error |  StdDev | Code Size |
|---------------- |-------------- |-------------- |---------:|--------:|--------:|----------:|
| AssignArray2DMD | .NET Core 3.0 | .NET Core 3.0 | 562.3 ms | 2.60 ms | 2.30 ms |    1082 B |
| AssignArray2DMD | .NET Core 3.1 | .NET Core 3.1 | 560.4 ms | 2.63 ms | 2.20 ms |    1082 B |
| AssignArray2DMD | .NET Core 5.0 | .NET Core 5.0 | 852.3 ms | 4.38 ms | 4.10 ms |    1089 B |

And the assembly dumps. As the relevant codes are all inlined I copy-pasted the disassembled AssignArray2DMD method only. The rest (ctor, Release and deeper calls) are not relevant here.

.NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

; ConsoleApp1.Benchmarks.AssignArray2DMD()
       push      rdi
       push      rsi
       sub       rsp,58
       vzeroupper
       mov       rsi,rcx
       lea       rdi,[rsp+20]
       mov       ecx,0E
       xor       eax,eax
       rep stosd
       mov       rcx,rsi
       mov       dword ptr [rsp+3C],0C8
       mov       dword ptr [rsp+38],140
       xor       ecx,ecx
       lea       rdx,[rsp+20]
       vxorps    xmm0,xmm0,xmm0
       vmovdqu   xmmword ptr [rdx],xmm0
       mov       [rdx+10],rcx
       lea       rcx,[rsp+20]
       mov       edx,0FA00
       mov       r8d,1
       call      KGySoft.Collections.ArraySection`1[[System.Int32, System.Private.CoreLib]]..ctor(Int32, Boolean)
       vmovdqu   xmm0,xmmword ptr [rsp+20]
       vmovdqu   xmmword ptr [rsp+40],xmm0
       mov       rcx,[rsp+30]
       mov       [rsp+50],rcx
       xor       ecx,ecx
M00_L00:
       xor       eax,eax
       xor       edx,edx
M00_L01:
       xor       r8d,r8d
M00_L02:
       inc       eax
       mov       r9d,edx
       imul      r9d,[rsp+38]
       add       r9d,r8d
       cmp       qword ptr [rsp+40],0
       je        short M00_L03
       mov       r10,[rsp+40]
       add       r9d,[rsp+48]
       cmp       r9d,[r10+8]
       jae       short M00_L04
       movsxd    r9,r9d
       mov       [r10+r9*4+10],eax
       inc       r8d
       cmp       r8d,140
       jl        short M00_L02
       inc       edx
       cmp       edx,0C8
       jl        short M00_L01
       inc       ecx
       cmp       ecx,2710
       jl        short M00_L00
       lea       rcx,[rsp+40]
       call      KGySoft.Collections.ArraySection`1[[System.Int32, System.Private.CoreLib]].Release()
       lea       rax,[rsp+38]
       vxorps    xmm0,xmm0,xmm0
       vmovdqu   xmmword ptr [rax],xmm0
       vmovdqu   xmmword ptr [rax+10],xmm0
       add       rsp,58
       pop       rsi
       pop       rdi
       ret
M00_L03:
       call      KGySoft.Throw.IndexOutOfRangeException()
       int       3
M00_L04:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 241

.NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT

; ConsoleApp1.Benchmarks.AssignArray2DMD()
       sub       rsp,58
       vzeroupper
       vxorps    xmm4,xmm4,xmm4
       vmovdqa   xmmword ptr [rsp+20],xmm4
       vmovdqa   xmmword ptr [rsp+30],xmm4
       vmovdqa   xmmword ptr [rsp+40],xmm4
       xor       eax,eax
       mov       [rsp+50],rax
       mov       dword ptr [rsp+3C],0C8
       mov       dword ptr [rsp+38],140
       lea       rcx,[rsp+20]
       mov       edx,0FA00
       mov       r8d,1
       call      KGySoft.Collections.ArraySection`1[[System.Int32, System.Private.CoreLib]]..ctor(Int32, Boolean)
       vmovdqu   xmm0,xmmword ptr [rsp+20]
       vmovdqu   xmmword ptr [rsp+40],xmm0
       mov       rcx,[rsp+30]
       mov       [rsp+50],rcx
       xor       ecx,ecx
M00_L00:
       xor       eax,eax
       xor       edx,edx
M00_L01:
       xor       r8d,r8d
M00_L02:
       inc       eax
       mov       r9d,edx
       imul      r9d,[rsp+38]
       add       r9d,r8d
       cmp       qword ptr [rsp+40],0
       je        short M00_L03
       mov       r10,[rsp+40]
       add       r9d,[rsp+48]
       cmp       r9d,[r10+8]
       jae       short M00_L04
       movsxd    r9,r9d
       mov       [r10+r9*4+10],eax
       inc       r8d
       cmp       r8d,140
       jl        short M00_L02
       inc       edx
       cmp       edx,0C8
       jl        short M00_L01
       inc       ecx
       cmp       ecx,2710
       jl        short M00_L00
       lea       rcx,[rsp+40]
       call      KGySoft.Collections.ArraySection`1[[System.Int32, System.Private.CoreLib]].Release()
       vxorps    xmm0,xmm0,xmm0
       vmovdqu   xmmword ptr [rsp+38],xmm0
       vmovdqu   xmmword ptr [rsp+48],xmm0
       add       rsp,58
       ret
M00_L03:
       call      KGySoft.Throw.IndexOutOfRangeException()
       int       3
M00_L04:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 225

Bonus content: .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT - when it is unable to dump the results:

; BenchmarkDotNet.Autogenerated.Runnable_0.__ForDisassemblyDiagnoser__()
       push      rbp
       sub       rsp,20
       lea       rbp,[rsp+20]
       mov       [rbp+10],rcx
       mov       rcx,[rbp+10]
       cmp       dword ptr [rcx+38],0B
       jne       short M00_L00
       mov       rcx,[rbp+10]
       call      00007FF8797F2578
M00_L00:
       nop
       lea       rsp,[rbp]
       pop       rbp
       ret
; Total bytes of code 40

@danmoseley
Copy link
Member

@kunalspathak thoughts about where investigation should start ?

@kunalspathak
Copy link
Member

Below is the assembly code for .NET 3.1 and .NET 5

  • int[y, x] : <>c__DisplayClass1_0:b__0()
  • int[y][x] : <>c__DisplayClass1_0:b__1()
  • Array2D<int>[y, x] : <>c__DisplayClass1_0:b__2()
.NET 3.1
; Assembly listing for method <>c__DisplayClass1_0:<AccessTest>b__0():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T05] (  3, 18   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T03] (  4, 49   )     int  ->  rax        
;  V02 loc1         [V02,T04] (  5, 29   )     int  ->  rdx        
;  V03 loc2         [V03,T01] (  5, 68   )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05,T02] (  2, 64   )     int  ->  rax         "dup spill"
;  V06 rat0         [V06,T00] (  6,192   )     ref  ->   r9         "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 32

G_M55748_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32

G_M55748_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx

G_M55748_IG03:
       4533C0               xor      r8d, r8d

G_M55748_IG04:
       FFC0                 inc      eax
       4C8B4908             mov      r9, gword ptr [rcx+8]
       448BD2               mov      r10d, edx
       452B5118             sub      r10d, dword ptr [r9+24]
       453B5110             cmp      r10d, dword ptr [r9+16]
       733C                 jae      SHORT G_M55748_IG07
       458BD8               mov      r11d, r8d
       452B591C             sub      r11d, dword ptr [r9+28]
       453B5914             cmp      r11d, dword ptr [r9+20]
       732F                 jae      SHORT G_M55748_IG07
       418B7114             mov      esi, dword ptr [r9+20]
       490FAFF2             imul     rsi, r10
       4D8BD3               mov      r10, r11
       4C03D6               add      r10, rsi
       4389449120           mov      dword ptr [r9+4*r10+32], eax
       41FFC0               inc      r8d
       4181F840010000       cmp      r8d, 320
       7CC1                 jl       SHORT G_M55748_IG04

G_M55748_IG05:
       FFC2                 inc      edx
       81FAC8000000         cmp      edx, 200
       7CB4                 jl       SHORT G_M55748_IG03

G_M55748_IG06:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      

G_M55748_IG07:
       E8E0D1DE5D           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 97, prolog size 5 for method <>c__DisplayClass1_0:<AccessTest>b__0():this
; ============================================================

; Assembly listing for method <>c__DisplayClass1_0:<AccessTest>b__1():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T08] (  3,  6   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T04] (  4, 49   )     int  ->  rax        
;  V02 loc1         [V02,T05] (  6, 33   )     int  ->  rdx        
;  V03 loc2         [V03,T02] (  6, 84   )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05,T03] (  2, 64   )     int  ->  rax         "dup spill"
;  V06 tmp2         [V06,T00] (  3, 96   )     ref  ->  r11         "arr expr"
;  V07 tmp3         [V07,T01] (  3, 96   )     ref  ->  r11         "arr expr"
;  V08 cse0         [V08,T06] (  2, 20   )     ref  ->   r9         "ValNumCSE"
;  V09 cse1         [V09,T07] (  2, 20   )    long  ->  r10         "ValNumCSE"
;
; Lcl frame size = 32

G_M55749_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32

G_M55749_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx

G_M55749_IG03:
       4533C0               xor      r8d, r8d
       4C8B4910             mov      r9, gword ptr [rcx+16]
       4C63D2               movsxd   r10, edx

G_M55749_IG04:
       FFC0                 inc      eax
       4D8BD9               mov      r11, r9
       413B5308             cmp      edx, dword ptr [r11+8]
       732F                 jae      SHORT G_M55749_IG07
       4F8B5CD310           mov      r11, gword ptr [r11+8*r10+16]
       453B4308             cmp      r8d, dword ptr [r11+8]
       7324                 jae      SHORT G_M55749_IG07
       4963F0               movsxd   rsi, r8d
       418944B310           mov      dword ptr [r11+4*rsi+16], eax
       41FFC0               inc      r8d
       4181F840010000       cmp      r8d, 320
       7CD6                 jl       SHORT G_M55749_IG04

G_M55749_IG05:
       FFC2                 inc      edx
       81FAC8000000         cmp      edx, 200
       7CC2                 jl       SHORT G_M55749_IG03

G_M55749_IG06:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      

G_M55749_IG07:
       E85ECCDE5D           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 83, prolog size 5 for method <>c__DisplayClass1_0:<AccessTest>b__1():this
; ============================================================
; Assembly listing for method <>c__DisplayClass1_0:<AccessTest>b__2():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T11] (  3,  6   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T07] (  4, 49   )     int  ->  rax        
;  V02 loc1         [V02,T09] (  5, 29   )     int  ->  rdx        
;  V03 loc2         [V03,T04] (  5, 68   )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05,T05] (  2, 64   )     int  ->  rax         "dup spill"
;  V06 tmp2         [V06,T00] (  4,128   )   byref  ->  r10         "Inlining Arg"
;  V07 tmp3         [V07,T01] (  3, 96   )   byref  ->  r11         "Inlining Arg"
;  V08 tmp4         [V08,T06] (  2, 64   )     int  ->  rsi         "Inlining Arg"
;  V09 tmp5         [V09,T02] (  3, 96   )     ref  ->  r10         "arr expr"
;  V10 tmp6         [V10,T03] (  3, 96   )     int  ->  rsi         "arr expr"
;  V11 cse0         [V11,T10] (  2, 20   )   byref  ->   r9         "ValNumCSE"
;  V12 cse1         [V12,T08] (  3, 48   )     ref  ->  r10         "ValNumCSE"
;
; Lcl frame size = 32

G_M55753_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32

G_M55753_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx

G_M55753_IG03:
       4533C0               xor      r8d, r8d
       4C8D4918             lea      r9, bword ptr [rcx+24]

G_M55753_IG04:
       FFC0                 inc      eax
       4D8BD1               mov      r10, r9
       453912               cmp      dword ptr [r10], r10d
       4D8D5A08             lea      r11, bword ptr [r10+8]
       8BF2                 mov      esi, edx
       410FAF32             imul     esi, dword ptr [r10]
       4103F0               add      esi, r8d
       4D8B13               mov      r10, gword ptr [r11]
       4D85D2               test     r10, r10
       742E                 je       SHORT G_M55753_IG08

G_M55753_IG05:
       41037308             add      esi, dword ptr [r11+8]
       413B7208             cmp      esi, dword ptr [r10+8]
       732A                 jae      SHORT G_M55753_IG09
       4C63DE               movsxd   r11, esi
       4389449A10           mov      dword ptr [r10+4*r11+16], eax
       41FFC0               inc      r8d
       4181F840010000       cmp      r8d, 320
       7CC5                 jl       SHORT G_M55753_IG04

G_M55753_IG06:
       FFC2                 inc      edx
       81FAC8000000         cmp      edx, 200
       7CB4                 jl       SHORT G_M55753_IG03

G_M55753_IG07:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      

G_M55753_IG08:
       E8B897A6FF           call     KGySoft.Throw:IndexOutOfRangeException()
       CC                   int3     

G_M55753_IG09:
       E8DACBDE5D           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 103, prolog size 5 for method <>c__DisplayClass1_0:<AccessTest>b__2():this
; ============================================================
.NET 5.0
; Assembly listing for method <>c__DisplayClass1_0:<AccessTest>b__0():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T05] (  3, 18   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T03] (  4, 49   )     int  ->  rax        
;  V02 loc1         [V02,T04] (  5, 29   )     int  ->  rdx        
;  V03 loc2         [V03,T01] (  5, 68   )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05,T02] (  2, 64   )     int  ->  rax         "dup spill"
;  V06 rat0         [V06,T00] (  6,192   )     ref  ->   r9         "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 32

G_M34868_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
						;; bbWeight=1    PerfScore 1.25
G_M34868_IG02:              ;; offset=0005H
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
						;; bbWeight=1    PerfScore 0.50
G_M34868_IG03:              ;; offset=0009H
       4533C0               xor      r8d, r8d
						;; bbWeight=4    PerfScore 1.00
G_M34868_IG04:              ;; offset=000CH
       FFC0                 inc      eax
       4C8B4908             mov      r9, gword ptr [rcx+8]
       448BD2               mov      r10d, edx
       452B5118             sub      r10d, dword ptr [r9+24]
       453B5110             cmp      r10d, dword ptr [r9+16]
       733C                 jae      SHORT G_M34868_IG07
       458BD8               mov      r11d, r8d
       452B591C             sub      r11d, dword ptr [r9+28]
       453B5914             cmp      r11d, dword ptr [r9+20]
       732F                 jae      SHORT G_M34868_IG07
       418B7114             mov      esi, dword ptr [r9+20]
       490FAFF2             imul     rsi, r10
       4D8BD3               mov      r10, r11
       4C03D6               add      r10, rsi
       4389449120           mov      dword ptr [r9+4*r10+32], eax
       41FFC0               inc      r8d
       4181F840010000       cmp      r8d, 320
       7CC1                 jl       SHORT G_M34868_IG04
						;; bbWeight=16    PerfScore 316.00
G_M34868_IG05:              ;; offset=004BH
       FFC2                 inc      edx
       81FAC8000000         cmp      edx, 200
       7CB4                 jl       SHORT G_M34868_IG03
						;; bbWeight=4    PerfScore 6.00
G_M34868_IG06:              ;; offset=0055H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
						;; bbWeight=1    PerfScore 1.75
G_M34868_IG07:              ;; offset=005BH
       E840F7B05E           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 97, prolog size 5, PerfScore 336.20, instruction count 31, allocated bytes for code 97 (MethodHash=e4a377cb) for method <>c__DisplayClass1_0:<AccessTest>b__0():this
; ============================================================

; Assembly listing for method <>c__DisplayClass1_0:<AccessTest>b__1():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T08] (  3,  6   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T04] (  4, 49   )     int  ->  rax        
;  V02 loc1         [V02,T05] (  6, 33   )     int  ->  rdx        
;  V03 loc2         [V03,T02] (  6, 84   )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05,T03] (  2, 64   )     int  ->  rax         "dup spill"
;  V06 tmp2         [V06,T00] (  3, 96   )     ref  ->  r11         "arr expr"
;  V07 tmp3         [V07,T01] (  3, 96   )     ref  ->  r11         "arr expr"
;  V08 cse0         [V08,T06] (  2, 20   )     ref  ->   r9         "CSE - aggressive"
;  V09 cse1         [V09,T07] (  2, 20   )    long  ->  r10         "CSE - aggressive"
;
; Lcl frame size = 32

G_M58453_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
						;; bbWeight=1    PerfScore 1.25
G_M58453_IG02:              ;; offset=0005H
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
						;; bbWeight=1    PerfScore 0.50
G_M58453_IG03:              ;; offset=0009H
       4533C0               xor      r8d, r8d
       4C8B4910             mov      r9, gword ptr [rcx+16]
       4C63D2               movsxd   r10, edx
						;; bbWeight=4    PerfScore 10.00
G_M58453_IG04:              ;; offset=0013H
       FFC0                 inc      eax
       4D8BD9               mov      r11, r9
       413B5308             cmp      edx, dword ptr [r11+8]
       732F                 jae      SHORT G_M58453_IG07
       4F8B5CD310           mov      r11, gword ptr [r11+8*r10+16]
       453B4308             cmp      r8d, dword ptr [r11+8]
       7324                 jae      SHORT G_M58453_IG07
       4963F0               movsxd   rsi, r8d
       418944B310           mov      dword ptr [r11+4*rsi+16], eax
       41FFC0               inc      r8d
       4181F840010000       cmp      r8d, 320
       7CD6                 jl       SHORT G_M58453_IG04
						;; bbWeight=16    PerfScore 180.00
G_M58453_IG05:              ;; offset=003DH
       FFC2                 inc      edx
       81FAC8000000         cmp      edx, 200
       7CC2                 jl       SHORT G_M58453_IG03
						;; bbWeight=4    PerfScore 6.00
G_M58453_IG06:              ;; offset=0047H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
						;; bbWeight=1    PerfScore 1.75
G_M58453_IG07:              ;; offset=004DH
       E8AEF4B05E           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 83, prolog size 5, PerfScore 207.80, instruction count 27, allocated bytes for code 83 (MethodHash=06451baa) for method <>c__DisplayClass1_0:<AccessTest>b__1():this
; ============================================================

; Assembly listing for method <>c__DisplayClass1_0:<AccessTest>b__2():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T11] (  3,  6   )     ref  ->  rcx         this class-hnd
;  V01 loc0         [V01,T07] (  4, 49   )     int  ->  rax        
;  V02 loc1         [V02,T09] (  5, 29   )     int  ->  rdx        
;  V03 loc2         [V03,T04] (  5, 68   )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05,T05] (  2, 64   )     int  ->  rax         "dup spill"
;  V06 tmp2         [V06,T00] (  3, 96   )   byref  ->  r10         "Inlining Arg"
;  V07 tmp3         [V07,T01] (  3, 96   )   byref  ->  r11         "Inlining Arg"
;  V08 tmp4         [V08,T06] (  2, 64   )     int  ->  rsi         "Inlining Arg"
;  V09 tmp5         [V09,T02] (  3, 96   )     ref  ->  r10         "arr expr"
;  V10 tmp6         [V10,T03] (  3, 96   )     int  ->  rsi         "index expr"
;  V11 cse0         [V11,T10] (  2, 20   )   byref  ->   r9         "CSE - aggressive"
;  V12 cse1         [V12,T08] (  3, 48   )     ref  ->  r10         "CSE - aggressive"
;
; Lcl frame size = 32

G_M45046_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
						;; bbWeight=1    PerfScore 1.25
G_M45046_IG02:              ;; offset=0005H
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
						;; bbWeight=1    PerfScore 0.50
G_M45046_IG03:              ;; offset=0009H
       4533C0               xor      r8d, r8d
       4C8D4918             lea      r9, bword ptr [rcx+24]
						;; bbWeight=4    PerfScore 3.00
G_M45046_IG04:              ;; offset=0010H
       FFC0                 inc      eax
       4D8BD1               mov      r10, r9
       4D8D5A08             lea      r11, bword ptr [r10+8]
       8BF2                 mov      esi, edx
       410FAF32             imul     esi, dword ptr [r10]
       4103F0               add      esi, r8d
       4D8B13               mov      r10, gword ptr [r11]
       4D85D2               test     r10, r10
       742E                 je       SHORT G_M45046_IG08
						;; bbWeight=16    PerfScore 108.00
G_M45046_IG05:              ;; offset=002AH
       41037308             add      esi, dword ptr [r11+8]
       413B7208             cmp      esi, dword ptr [r10+8]
       732A                 jae      SHORT G_M45046_IG09
       4C63DE               movsxd   r11, esi
       4389449A10           mov      dword ptr [r10+4*r11+16], eax
       41FFC0               inc      r8d
       4181F840010000       cmp      r8d, 320
       7CC8                 jl       SHORT G_M45046_IG04
						;; bbWeight=16    PerfScore 124.00
G_M45046_IG06:              ;; offset=0048H
       FFC2                 inc      edx
       81FAC8000000         cmp      edx, 200
       7CB7                 jl       SHORT G_M45046_IG03
						;; bbWeight=4    PerfScore 6.00
G_M45046_IG07:              ;; offset=0052H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
						;; bbWeight=1    PerfScore 1.75
G_M45046_IG08:              ;; offset=0058H
       E81BA3A4FF           call     KGySoft.Throw:IndexOutOfRangeException()
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00
G_M45046_IG09:              ;; offset=005EH
       E81DF4B05E           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 100, prolog size 5, PerfScore 254.50, instruction count 33, allocated bytes for code 100 (MethodHash=0c6a5009) for method <>c__DisplayClass1_0:<AccessTest>b__2():this
; ============================================================

Here are my observations:

  1. There is no code difference between 3.1 and 5.0 for all int[y, x] and int[y][x] benchmarks under AccessType.

  2. For Array2D<int>[y, x], .NET 5.0 improved by eliminate one extra cmp instruction in JIT: Don't emit some unnecessary tests #38586.

  3. It is interesting to see that we do not eliminate range checks (.NET3.1 / .NET 5) in either of those methods and then realized that it could be happening because those methods are lambdas although we propagate the constant values of width and height. However, when I tested it not have inside lambda, it turns out that we do not eliminate range check for multi-dimensional array. There is already Range check elimination for multi-dimensional array #35056 that I opened a while back to track it. It is interesting that we don't even eliminate range check for outer loop.

public static void RangeCheck()
{
    int i = 0;
    const int width = 320;
    const int height = 200;
    var array = new int[height, width];
    for (int y = 0; y < height; y++)
    {
        for (int x = 0; x < width; x++)
            array[y, x] = ++i;
    }
}

And here is the loop code:

G_M23248_IG03:              ;; offset=0039H
       33C9                 xor      ecx, ecx

G_M23248_IG04:              ;; offset=003BH
       FFC6                 inc      esi
       448BC2               mov      r8d, edx
       442B4018             sub      r8d, dword ptr [rax+24]
       443B4010             cmp      r8d, dword ptr [rax+16]
       733A                 jae      SHORT G_M23248_IG07
       448BC9               mov      r9d, ecx
       442B481C             sub      r9d, dword ptr [rax+28]
       443B4814             cmp      r9d, dword ptr [rax+20]
       732D                 jae      SHORT G_M23248_IG07
       448B5014             mov      r10d, dword ptr [rax+20]
       4D0FAFD0             imul     r10, r8
       4D8BC1               mov      r8, r9
       4D03C2               add      r8, r10
       4289748020           mov      dword ptr [rax+4*r8+32], esi
       FFC1                 inc      ecx
       81F940010000         cmp      ecx, 320
       7CC7                 jl       SHORT G_M23248_IG04

G_M23248_IG05:              ;; offset=0074H
       FFC2                 inc      edx
       81FAC8000000         cmp      edx, 200
       7CBB                 jl       SHORT G_M23248_IG03
; ..
; ..
G_M23248_IG07:              ;; offset=0084H
       E84709775F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3

cc: @briansull , @AndyAyersMS

That leads to two suspicion: Code alignment or Memory alignment.

Code alignment

I verified the code alignment part and here is my observation:

  1. In .NET 3.1 and .NET 5, inner loops of all 3 methods need at least 3 blocks of 32B to fit (because of the range checking code present inside them). In .NET 6, we started aligning methods at 32B boundary, however that doesn't change anything just because the innermost loop itself is big that instruction caching, etc. won't help.

  2. I also tried to see if my loop alignment changes helps in any of them and I see that for int[y, x] we do align a loop at 16B boundary, but I see marginal improvement, but it is hard to say considering other factors.

Before loop alignment:

G_M34868_IG04:              ;; offset=0010H
 00007ffb`9cb6a710        FFC0                 inc      eax
 00007ffb`9cb6a712        4C8B4908             mov      r9, gword ptr [rcx+8]
 00007ffb`9cb6a716        448BD2               mov      r10d, edx
 00007ffb`9cb6a719        452B5118             sub      r10d, dword ptr [r9+24]
 00007ffb`9cb6a71d        453B5110             cmp      r10d, dword ptr [r9+16]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1) 32B boundary ...............................
 00007ffb`9cb6a721        733C                 jae      SHORT G_M34868_IG07
 00007ffb`9cb6a723        458BD8               mov      r11d, r8d
 00007ffb`9cb6a726        452B591C             sub      r11d, dword ptr [r9+28]
 00007ffb`9cb6a72a        453B5914             cmp      r11d, dword ptr [r9+20]
 00007ffb`9cb6a72e        732F                 jae      SHORT G_M34868_IG07
 00007ffb`9cb6a730        418B7114             mov      esi, dword ptr [r9+20]
 00007ffb`9cb6a734        490FAFF2             imul     rsi, r10
 00007ffb`9cb6a738        4D8BD3               mov      r10, r11
 00007ffb`9cb6a73b        4C03D6               add      r10, rsi
 00007ffb`9cb6a73e        4389449120           mov      dword ptr [r9+4*r10+32], eax
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 3) 32B boundary ...............................
 00007ffb`9cb6a743        41FFC0               inc      r8d
 00007ffb`9cb6a746        4181F840010000       cmp      r8d, 320
 00007ffb`9cb6a74d        7CC1                 jl       SHORT G_M34868_IG04

After loop alignment:

		;; Add alignment: 'Padding= 4, AlignmentBoundary= 16B.' in (<>c__DisplayClass1_0:<AccessTest>b__0():this)
 00007ffb`9cb6a70c        0F1F4000             align    
						;; bbWeight=4    PerfScore 2.00
G_M34868_IG04:              ;; offset=0010H
 00007ffb`9cb6a710        FFC0                 inc      eax
 00007ffb`9cb6a712        4C8B4908             mov      r9, gword ptr [rcx+8]
 00007ffb`9cb6a716        448BD2               mov      r10d, edx
 00007ffb`9cb6a719        452B5118             sub      r10d, dword ptr [r9+24]
 00007ffb`9cb6a71d        453B5110             cmp      r10d, dword ptr [r9+16]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1) 32B boundary ...............................
 00007ffb`9cb6a721        733C                 jae      SHORT G_M34868_IG07
 00007ffb`9cb6a723        458BD8               mov      r11d, r8d
 00007ffb`9cb6a726        452B591C             sub      r11d, dword ptr [r9+28]
 00007ffb`9cb6a72a        453B5914             cmp      r11d, dword ptr [r9+20]
 00007ffb`9cb6a72e        732F                 jae      SHORT G_M34868_IG07
 00007ffb`9cb6a730        418B7114             mov      esi, dword ptr [r9+20]
 00007ffb`9cb6a734        490FAFF2             imul     rsi, r10
 00007ffb`9cb6a738        4D8BD3               mov      r10, r11
 00007ffb`9cb6a73b        4C03D6               add      r10, rsi
 00007ffb`9cb6a73e        4389449120           mov      dword ptr [r9+4*r10+32], eax
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 3) 32B boundary ...............................
 00007ffb`9cb6a743        41FFC0               inc      r8d
 00007ffb`9cb6a746        4181F840010000       cmp      r8d, 320
 00007ffb`9cb6a74d        7CC1                 jl       SHORT G_M34868_IG04
						;; bbWeight=16    PerfScore 316.00
G_M34868_IG05:              ;; offset=004FH
 00007ffb`9cb6a74f        FFC2                 inc      edx
 00007ffb`9cb6a751        81FAC8000000         cmp      edx, 200
 00007ffb`9cb6a757        7CB0                 jl       SHORT G_M34868_IG03
  1. For int[y][x], we skipped adding alignment because it needed more padding and didn't meet the threshold against the loop size.
00007ffb`9cb6a9d0        4C63D2               movsxd   r10, edx
		;; Skip alignment: 'PaddingNeeded= 13, MaxPadding= 8, LoopSize= 42, AlignmentBoundary= 16B.' in (<>c__DisplayClass1_0:<AccessTest>b__1():this)
 00007ffb`9cb6a9d3                             align    
						;; bbWeight=4    PerfScore 11.00
G_M58453_IG04:              ;; offset=0013H
 00007ffb`9cb6a9d3        FFC0                 inc      eax
 00007ffb`9cb6a9d5        4D8BD9               mov      r11, r9
 00007ffb`9cb6a9d8        413B5308             cmp      edx, dword ptr [r11+8]
 00007ffb`9cb6a9dc        732F                 jae      SHORT G_M58453_IG07
 00007ffb`9cb6a9de        4F8B5CD310           mov      r11, gword ptr [r11+8*r10+16]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 3) 32B boundary ...............................
 00007ffb`9cb6a9e3        453B4308             cmp      r8d, dword ptr [r11+8]
 00007ffb`9cb6a9e7        7324                 jae      SHORT G_M58453_IG07
 00007ffb`9cb6a9e9        4963F0               movsxd   rsi, r8d
 00007ffb`9cb6a9ec        418944B310           mov      dword ptr [r11+4*rsi+16], eax
 00007ffb`9cb6a9f1        41FFC0               inc      r8d
 00007ffb`9cb6a9f4        4181F840010000       cmp      r8d, 320
 00007ffb`9cb6a9fb        7CD6                 jl       SHORT G_M58453_IG04
						;; bbWeight=16    PerfScore 180.00
  1. Lastly, for Array2D, the loop was already aligned at 16B boundary and so it didn't try aligning it further.

Memory alignment

Given the fact that no code changes happen between .NET 3.1 and .NET 5 and code alignment doesn't play much role because of the loop size, I think memory alignment could be the reason for inconsistent behavior although I am not sure why it would regress for .NET 5 in particular. I am not sure if any GC heuristics have changed that would align memory differently for allocations.

To conclude, if we can eliminate the range check, we should be able to fit in such loops in a single cache line and hence get better performance. We should also investigate if there is anything around memory alignment that has changed between .NET 3.1 and .NET 5. I also noticed some jmp instructions that cross the 32B boundary and might get affected because of JCC Erratum, but it is hard to measure or say anything definitely. We can revisit those factors once we address other issues.

Hope that helps!

@danmoseley danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-Meta labels Nov 17, 2020
@danmoseley
Copy link
Member

Setting area to codegen since it seems next action is there.

@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Nov 17, 2020
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Nov 17, 2020
@kunalspathak
Copy link
Member

I spent some more time today to compare the performance of .NET3.1 vs. .NET6 for AccessTest. As mentioned earlier, there is no asm difference between .NET3.1 vs. .NET6.0 except "int[y, x] = value" case:

image

Here are the fresh numbers on my Windows x64 machine:

.NET6
1. 14945.6867 msecs.
2. 14368.9511 msecs.
3. 14363.1785 msecs.
4. 14378.0881 msecs.
5. 14360.765 msecs.

.NET3.1

1. 14737.616 msecs.
2. 14010.8173 msecs.
3. 13899.7869 msecs.
4. 13911.288 msecs.
5. 13891.437 msecs.

I then went ahead and measured individual array access numbers and you can see them here:

Individual array access benchmark numbers
Just "int[y, x] = value"

.NET6
1. 6413.1212 msecs.
2. 5869.7881 msecs.
3. 5843.3927 msecs.
4. 5863.7196 msecs.
5. 5933.0814 msecs.

w/o loopalign
1. 5734.7591 msecs.
2. 5195.5625 msecs.
3. 5165.2113 msecs.
4. 5171.7055 msecs.
5. 5163.7163 msecs.

.NET3.1

1. 5963.9592 msecs.
2. 5201.961 msecs.
3. 5188.9293 msecs.
4. 5183.6542 msecs.
5. 5204.5842 msecs.

Just "int[y][x] = value"

.NET6

1. 4350.1877 msecs.
2. 3786.0665 msecs.
3. 3780.6481 msecs.
4. 3780.0093 msecs.
5. 3791.7436 msecs.

.NET3.1

1. 4522.0884 msecs.
2. 3799.1696 msecs.
3. 3769.9459 msecs.
4. 3769.3436 msecs.
5. 3769.9706 msecs.

Just "Array2D<int>[y, x] = value"

.NET6

1. 5314.0604 msecs.
2. 4768.0922 msecs.
3. 4739.6753 msecs.
4. 4736.2614 msecs.
5. 4741.582 msecs.

.NET3.1

1. 5642.1112 msecs.
2. 4967.6614 msecs.
3. 4937.4925 msecs.
4. 4933.6635 msecs.
5. 4935.228 msecs.

As you can see, the only benchmark that is slow is "int[y, x] = value" and the reason is loop alignment padding + the way the benchmark is ran. 4-bytes padding is before the inner most loop and it gets executed Iterations * height times which amplifies the slowness. If I make following change to the benchmark, the slowness disappears:

image

.NET6

1. 11140.8957 msecs.
2. 10542.3173 msecs.
3. 10530.8079 msecs.
4. 10535.2744 msecs.
5. 10560.2627 msecs.

.NET3.1

1. 11512.2896 msecs.
2. 10840.3542 msecs.
3. 10741.4343 msecs.
4. 10747.7232 msecs.
5. 10768.5058 msecs.

We already have #43227 issue that captures the work item to have padding at appropriate location that would not affect the performance adversely. At this point, I don't see any other actionable items to do for this issue so I would go ahead and close it. Feel free to comment / reopen if you have any other questions.

Thank for reporting!

@ghost ghost locked as resolved and limited conversation to collaborators May 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI regression-from-last-release tenet-performance Performance related issue
Projects
Archived in project
Development

No branches or pull requests

8 participants