Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM64] Performance regression: Sorting arrays of primitive types #41741

Closed
adamsitnik opened this issue Sep 2, 2020 · 28 comments
Closed

[ARM64] Performance regression: Sorting arrays of primitive types #41741

adamsitnik opened this issue Sep 2, 2020 · 28 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

adamsitnik commented Sep 2, 2020

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to sorting.

It can be related to the fact that we have moved the sorting implementation from native to managed code (cc @stephentoub).

@DrewScoggins is there any way to see the full historical data for ARM64?

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'System.Collections.Sort<Int32>.Array' 'System.Collections.Sort<Int32>.List'

System.Collections.Sort<Int32>.Array(Size: 512)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 4500.17 4250.27 1.06 +0 Windows 10.0.19041.388 X64 AMD Ryzen 9 3900X 3.1.6 5.0.20.41714
Faster 4129.79 3573.10 1.16 +0 several? Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 4197.36 3672.08 1.14 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40416
Faster 5247.31 4572.02 1.15 +0 multimodal Windows 10.0.19041.450 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 6560.47 6483.01 1.01 +0 bimodal Windows 10.0.19041.450 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 3.1.6 5.0.20.40416
Same 3315.74 3404.51 0.97 +0 Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 4821.34 8602.70 0.56 +0 several? Windows 10.0.19041.450 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 3.1.6 5.0.20.41714
Faster 6214.65 3992.76 1.56 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 6441.58 4187.34 1.54 +0 manjaro X64 Intel Core i7-4771 CPU 3.50GHz (Haswell) 3.1.6 5.0.20.41714
Same 5831.01 5518.00 1.06 +0 bimodal pop 20.04 X64 Intel Core i7-6600U CPU 2.60GHz (Skylake) 3.1.6 5.0.20.41714
Same 5602.64 5486.66 1.02 +0 bimodal alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.41714
Same 6178.62 5918.08 1.04 +0 bimodal ubuntu 18.04 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 16288.81 27067.03 0.60 +0 bimodal ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 16150.43 27030.75 0.60 +0 bimodal ubuntu 16.04 Arm64 Unknown processor 3.1.7 5.0.20.41714
Slower 16139.02 26145.59 0.62 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 14011.07 17551.48 0.80 +0 ubuntu 18.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Faster 7375.30 4817.14 1.53 +0 several? Windows 10.0.19041.450 X86 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 7244.05 7548.09 0.96 +0 several? Windows 10.0.18363.1016 Arm Microsoft SQ1 3.0 GHz 3.1.6 5.0.20.40416
Faster 7932.56 5120.47 1.55 +0 macOS Catalina 10.15.6 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7060.74 4554.61 1.55 +0 macOS Catalina 10.15.6 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7145.76 4761.73 1.50 +0 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40203

System.Collections.Sort<Int32>.List(Size: 512)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Faster 4520.64 4071.51 1.11 +0 Windows 10.0.19041.388 X64 AMD Ryzen 9 3900X 3.1.6 5.0.20.41714
Faster 4332.89 3567.66 1.21 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 6967.61 3518.82 1.98 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40416
Faster 5015.13 4448.40 1.13 +0 bimodal Windows 10.0.19041.450 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 7293.88 7863.56 0.93 +0 several? Windows 10.0.19041.450 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 3.1.6 5.0.20.40416
Same 3339.08 3267.09 1.02 +0 Windows 10.0.19042 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 4806.28 8450.61 0.57 +0 bimodal Windows 10.0.19041.450 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 3.1.6 5.0.20.41714
Faster 6222.44 3938.96 1.58 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.40203
Faster 6146.14 4254.57 1.44 +0 manjaro X64 Intel Core i7-4771 CPU 3.50GHz (Haswell) 3.1.6 5.0.20.41714
Same 5834.00 5493.04 1.06 +0 pop 20.04 X64 Intel Core i7-6600U CPU 2.60GHz (Skylake) 3.1.6 5.0.20.41714
Same 5571.07 6275.42 0.89 +0 several? alpine 3.11 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.41714
Same 6217.53 6206.80 1.00 +0 multimodal ubuntu 18.04 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 3.1.6 5.0.20.40416
Slower 16281.08 27217.40 0.60 +0 bimodal ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 16329.85 27111.99 0.60 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.7 5.0.20.41714
Slower 16395.71 26771.27 0.61 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 7791.41 10397.92 0.75 +0 several? ubuntu 18.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Faster 7371.80 4824.34 1.53 +0 Windows 10.0.19041.450 X86 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40416
Same 7002.96 7065.54 0.99 +0 several? Windows 10.0.18363.1016 Arm Microsoft SQ1 3.0 GHz 3.1.6 5.0.20.40416
Faster 7976.12 5098.72 1.56 +0 macOS Catalina 10.15.6 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7114.58 4576.65 1.55 +0 macOS Catalina 10.15.6 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 3.1.6 5.0.20.41714
Faster 7157.62 4735.35 1.51 +0 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.40203

/cc @JulieLeeMSFT

category:cq
theme:needs-triage
skill-level:expert
cost:large

@adamsitnik adamsitnik added this to the 5.0.0 milestone Sep 2, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Sep 2, 2020
@adamsitnik adamsitnik changed the title [ARM] Performance regression: Sorting arrays of primitive types [ARM64] Performance regression: Sorting arrays of primitive types Sep 2, 2020
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Sep 2, 2020

What's the difference between the two tables? Is it just two different sets of runs?

Never mind -- List vs Array.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 2, 2020
@AndyAyersMS
Copy link
Member

I'll investigate.

@AndyAyersMS AndyAyersMS self-assigned this Sep 2, 2020
@AndyAyersMS
Copy link
Member

I can repro this on an Rpi4:

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
Unknown processor
.NET Core SDK=5.0.100-rc.1.20422.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.42118, CoreFX 5.0.20.42118), Arm64 RyuJIT
  Job-HPEYTJ : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), Arm64 RyuJIT
  Job-LNVBZK : .NET Core 5.0.0 (CoreCLR 5.0.20.42118, CoreFX 5.0.20.42118), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  InvocationCount=5000  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
UnrollFactor=1  WarmupCount=1  

| Method |        Job |       Runtime |     Toolchain | Size |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |----------- |-------------- |-------------- |----- |---------:|---------:|---------:|---------:|---------:|---------:|------:|--------:|------:|------:|------:|----------:|
|  Array | Job-HPEYTJ | .NET Core 3.1 | netcoreapp3.1 |  512 | 26.97 us | 0.352 us | 0.329 us | 26.86 us | 26.43 us | 27.56 us |  1.00 |    0.00 |     - |     - |     - |         - |
|  Array | Job-LNVBZK | .NET Core 5.0 | netcoreapp5.0 |  512 | 35.61 us | 0.196 us | 0.184 us | 35.62 us | 35.29 us | 35.90 us |  1.32 |    0.02 |     - |     - |     - |         - |
|        |            |               |               |      |          |          |          |          |          |          |       |         |       |       |       |           |
|   List | Job-HPEYTJ | .NET Core 3.1 | netcoreapp3.1 |  512 | 28.54 us | 1.530 us | 1.762 us | 27.54 us | 27.12 us | 31.89 us |  1.00 |    0.00 |     - |     - |     - |         - |
|   List | Job-LNVBZK | .NET Core 5.0 | netcoreapp5.0 |  512 | 36.15 us | 0.417 us | 0.390 us | 36.20 us | 34.91 us | 36.51 us |  1.25 |    0.08 |     - |     - |     - |       1 B |

@AndyAyersMS
Copy link
Member

Here's a bit more data, for the array case (from an extracted test, via perf, doing the same amount of work for each runtime). As you can see the cycles moved from native methods to jitted methods, so comparative analysis will be a bit trickier than normal.

;; 3.1

Samples: 8K of event 'cycles', Event count (approx.): 3065502893
Overhead  Command  Shared Object               Symbol
  45.12%  dotnet   libcoreclr.so               [.] ArrayHelpers<int>::PickPivotAndPartition
  19.97%  dotnet   libcoreclr.so               [.] ArrayHelpers<int>::IntroSort
   4.41%  dotnet   libcoreclr.so               [.] ArrayNative::GetLowerBound
   2.49%  dotnet   [unknown]                   [k] 0xffffc6a925388e04
   1.61%  dotnet   [unknown]                   [k] 0xffffc6a924af2194

;; 5.0

Samples: 11K of event 'cycles', Event count (approx.): 4235451510
Overhead  Command          Shared Object               Symbol
  41.31%  dotnet           [JIT] tid 196901            [.] int32 [System.Private.CoreLib] System.Collections.Generic.GenericArraySortHelper`1[System.Int32]::PickPivotAndPartition(valuetype System.Span`1<!0>)[OptimizedTier1]
  22.38%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Collections.Generic.GenericArraySortHelper`1[System.Int32]::InsertionSort(valuetype System.Span`1<!0>)[OptimizedTier1]
   3.30%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Buffer::Memmove(uint8&,uint8&,native uint)[OptimizedTier1]
   2.47%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Collections.Generic.GenericArraySortHelper`1[System.Int32]:
   2.24%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Array::Copy(class System.Array,class System.Array,int32)[Op
   1.66%  dotnet           [unknown]                   [k] 0xffffc6a925388e04
   1.08%  dotnet           [unknown]                   [k] 0xffffc6a924af2194

Total cycle difference per perf is about 1.17b, or around 30%, this compares well to the BDN ratios above.

Seems likely the top two methods in each case perform similar functions, so we'll compare them.

Method Version Cycles
PickPivotAndPartition 3.1 1.38 b
PickPivotAndPartition 5.0 1.75 b
IntroSort 3.1 0.61 b
IntroSort 3.1 0.95 b

Both methods are slower in 5.0; the slowdown in IntroSort is relatively more pronounced; collectively they account for about 0.71b or around 2/3 of the perf difference we observe. The remaining 0.46b cycles diff is harder to pin down.

Let's focus first on IntroSort... (details forthcoming).

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Sep 2, 2020

[edit: pasted in right version of C# InsertionSort .... see #35297 for how we ended up with this]

Here's the codegen for the innermost loop of InsertionSort, which is the hottest part of the respective methods (for native codegen it gets inlined into IntroSort).

;; InsertionSort, innermost loop

;; 3.1 (via perf; Clang codegen)

;; 0x1e4
        
 17.05  40cd90:   ldr   w13, [x19, w12, sxtw #2]
        40cd94:   cmp   w11, w13    
        40cd98: ? b.ge  40cdb0 <ArrayHelpers<int>::IntroSort(int*, int*, int, int, int)+0x204>  // b.tcont

 26.26  40cd9c:   add   w14, w12, #0x1
        40cda0:   sub   w12, w12, #0x1
        40cda4:   cmp   w12, w21    
  3.47  40cda8:   str   w13, [x19, w14, sxtw #2]
        40cdac: ? b.ge  40cd90 <ArrayHelpers<int>::IntroSort(int*, int*, int, int, int)+0x1e4>  // b.tcont

;; 5.0 (via jit disasm)

						;; bbWeight=4    PerfScore 38.00
G_M6103_IG04:
        910043A3          add     x3, fp, #16	// [V00 arg0]    ;; hoistable
        F9400063          ldr     x3, [x3]                       ;; hoistable
        11000405          add     w5, w0, #1                     ;; address mode
        93407CA5          sxtw    x5, w5                         ;;  + iv widen
        D37EF4A5          lsl     x5, x5, #2                     ;; ...
        8B0300A3          add     x3, x5, x3                     ;; ...
        910043A5          add     x5, fp, #16	// [V00 arg0]    ;; redundant
        F94000A5          ldr     x5, [x5]                       ;; redundant
        B86468A4          ldr     w4, [x5, x4]                   ;; redundant around backedge           
        B9000064          str     w4, [x3]
        51000400          sub     w0, w0, #1
						;; bbWeight=8    PerfScore 112.00
G_M6103_IG05:
        7100001F          cmp     w0, #0
        5400010B          blt     G_M6103_IG07
						;; bbWeight=32    PerfScore 48.00
G_M6103_IG06:
        910043A3          add     x3, fp, #16	// [V00 arg0]    ;; redundant
        F9400063          ldr     x3, [x3]                       ;; redundant
        93407C04          sxtw    x4, w0                         ;; address mode
        D37EF484          lsl     x4, x4, #2                     ;;  + iv widen
        B8646863          ldr     w3, [x3, x4]
        6B01007F          cmp     w3, w1
        54FFFDAC          bgt     G_M6103_IG04

We see the jitted version emits quite a few more instructions, and a number of things stand out. For reference, here's the C# source:

private static void InsertionSort(Span<T> keys)
{
    for (int i = 0; i < keys.Length - 1; i++)
    {
        T t = Unsafe.Add(ref MemoryMarshal.GetReference(keys), i + 1);

        int j = i;
        while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(keys), j))))
        {
            Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(keys), j);
            j--;
        }

        Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = t;
    }
}

and here's the native source (simplified since items is known to be NULL):

static void InsertionSort(KIND keys[], KIND items[], int lo, int hi)
{
    int i, j;
    KIND t, ti = NULL;
    for (i = lo; i < hi; i++)
    {
        j = i;
        t = keys[i + 1];
        // if(items != NULL)
        //    ti = items[i + 1];
        while (j >= lo && t < keys[j])
        {
            keys[j + 1] = keys[j];
            // if(items != NULL)
            //    items[j + 1] = items[j];
            j--;
        }
        keys[j + 1] = t;
        // if(items != NULL)
        //    items[j + 1] = ti;
    }
}

Analysis of jitted code

  • the jit fails to hoist and common a number of redundant loads. Each of the keys[...] expressions re-fetches the __pointer field from the span. This field is a loop invariant, but jit can't prove this. Haven't fully root caused, but could be from the simplistic aliasing model; the span data vector is written in the loop and the jit may think this write can interfere with the span fields. Native code has one less level of indirection here so no alias analysis is needed to see that the array base address is invariant. Possibly related to Span-perf for copy via loop much slower than array-version #9977.

  • the jit does not IV widen (hence the sxtws). See RyuJIT: Index Variable Widening optimization for array accesses #7312.

  • the jit does a poor job taking advantage of ARM64 address modes; these are well-suited for array computations. Similar issues noted in Code inefficiencies in loop array indexing #35618.

  • the jit does not recognize and optimize a "backedge CSE"; note that iteration i writes keys[i+1] and this forms a CSE with iteration i+1s read of keys[i]. This is a bit more esoteric and I don't think we have an issue open for it.

  • (likely minor) the jit does not schedule, so the compares are emitted just before the branches

As a result the jitted code does 3 loads in the upper block, compared to 1 for native code, and 1 load in the lower block, compared to none for native code.

So for the most part we see known issues and unfortunately none of this is actionable for 5.0. We are considering enhancing the loop optimizer for .NET 6.0, and this loop might make a good study candidate.

cc @dotnet/jit-contrib

Next I'll look at PickPivotAndPartition ...

@AndyAyersMS
Copy link
Member

Here's the codegen for the inner loops of PickPivotAndPartition:

;; 3.1 (via perf; Clang)

;; 0x1a0

  2.20  40d118:   cmp   w11, w12    
        40d11c: ? b.ge  40d16c <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1f4>  // b.tcont
  3.58  40d120:   sbfiz x13, x11, #2, #32
        40d124:   sbfiz x14, x12, #2, #32
  4.48  40d128:   ldr   w15, [x0, x14]
  1.67  40d12c:   ldr   w16, [x0, x13]
  0.62  40d130:   str   w15, [x0, x13]
  1.76  40d134:   str   w16, [x0, x14]

;; 0x1c0

 12.01  40d138:   cmp   w11, w9     
        40d13c: ? b.ge  40d150 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1d8>  // b.tcont
  3.77  40d140:   add   w11, w11, #0x1
  7.81  40d144:   ldr   w13, [x0, w11, sxtw #2]
        40d148:   cmp   w13, w10    
        40d14c: ? b.lt  40d138 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1c0>  // b.tstop

;; 0x1d8

 23.09  40d150:   cmp   w12, w2     
        40d154: ? b.le  40d118 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1a0>
  6.31  40d158:   sub   w12, w12, #0x1
 14.98  40d15c:   ldr   w13, [x0, w12, sxtw #2]
        40d160:   cmp   w10, w13    
        40d164: ? b.lt  40d150 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1d8>  // b.tstop

 10.99  40d168: ? b     40d118 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1a0>

;; 5.0

G_M55484_IG09:
        EB01005F          cmp     x2, x1
        540000A2          bhs     G_M55484_IG10
        91001042          add     x2, x2, #4
        B9400045          ldr     w5, [x2]
        6B05009F          cmp     w4, w5
        54FFFF6C          bgt     G_M55484_IG09
						;; bbWeight=0.50 PerfScore 3.25
G_M55484_IG10:
        EB00007F          cmp     x3, x0
        540000A9          bls     G_M55484_IG11
        D1001063          sub     x3, x3, #4
        B9400065          ldr     w5, [x3]
        6B05009F          cmp     w4, w5
        54FFFF6B          blt     G_M55484_IG10
						;; bbWeight=4    PerfScore 26.00
G_M55484_IG11:
        EB03005F          cmp     x2, x3
        540000E2          bhs     G_M55484_IG13
						;; bbWeight=4    PerfScore 6.00
G_M55484_IG12:
        B9400045          ldr     w5, [x2]
        B9400066          ldr     w6, [x3]
        B9000046          str     w6, [x2]
        B9000065          str     w5, [x3]
        EB03005F          cmp     x2, x3
        54FFFDA3          blo     G_M55484_IG09

Here the codegen is pretty comparable. It's not obvious why the jitted version appears to be slower. It could be an artifact of running perf over the entire process rather than isolating the part where the process is executing the Tier1 codegen. Repeating the perf measurements with COMPlus_ReadyToRun=0 and COMPlus_TieredCompilation=0 seem to bear this out.

@AndyAyersMS
Copy link
Member

So the likely suspect is codegen in InsertionSort. There's still a fair chunk of cycles unaccounted for beyond the delta for this method -- could also be that my repro sample was too short and I'm seeing impacts of time in the jit, etc.

I don't see anything we can do here for 5.0 in the jit. We might be able to rewrite the C# for InsertionSort to apply some of the above loop optimizations manually, but since this seemingly doesn't impact xArch as badly, perhaps we can hold off on that.

@AndyAyersMS
Copy link
Member

Thinking about this a bit more, it seems odd that we can't promote the span argument to InsertionSort; doing so would also solve the aliasing problem. Let's see why....

*************** In fgPromoteStructs()
;  V00 arg0           struct <System.Span`1[Int32], 16> ld-addr-op
Not promoting promotable struct local V00, because lvIsParam is true and #fields = 2.

This is a jit limitation. For the ARM64 ABI the struct is passed by value, and the first thing we end up doing is spilling it to the stack:

IN002e: 000008                    str     x0, [fp,#16]	// [V00 arg0]
IN002f: 00000C                    str     x1, [fp,#24]	// [V00 arg0+0x08]

So this likely explains why we didn't see this regression on xArch; for those machines this structure is passed by reference and gets promoted, and so the code becomes more optimizable.

A potential fix / workaround is to manually promote the span. Let me try this.

@CarolEidt
Copy link
Contributor

This is a jit limitation. For the ARM64 ABI the struct is passed by value, and the first thing we end up doing is spilling it to the stack

This was something I was hoping to fix for 5.0 (#39326) but was unable to complete in time.

@DrewScoggins
Copy link
Member

@adamsitnik here is a link to an index page with links to every tests full history.

@AndyAyersMS
Copy link
Member

Looking at history for the array case, the lone point in Nov is the 3.1 baseline; we regressed and then recovered most of it. Can't readily pin down what caused the recovery as the historical data uses SDK build versions.

Current runs show fairly extreme bimodality. There may be some aspect of ARM64 micorarchitecture that we need to understand better.

newplot (1)

List case does not appear to have a 3.1 equivalent data point but has similar bimodality.

newplot (2)

@AndyAyersMS
Copy link
Member

Possible workaround:

@@ -541,18 +543,20 @@ private static void DownHeap(Span<T> keys, int i, int n, int lo)
 
         private static void InsertionSort(Span<T> keys)
         {
-            for (int i = 0; i < keys.Length - 1; i++)
+            // Workaround for 41741. Copy span to local to allow promotion on ARM64
+            Span<T> _keys = keys;
+            for (int i = 0; i < _keys.Length - 1; i++)
             {
-                T t = Unsafe.Add(ref MemoryMarshal.GetReference(keys), i + 1);
+                T t = Unsafe.Add(ref MemoryMarshal.GetReference(_keys), i + 1);
 
                 int j = i;
-                while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(keys), j))))
+                while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j))))
                 {
-                    Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(keys), j);
+                    Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j);
                     j--;
                 }
 
-                Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = t!;
+                Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j + 1) = t!;
             }
         }

Inner loop codegen for ARM64 becomes:

G_M6103_IG04:
        11000446          add     w6, w2, #1
        93407CC6          sxtw    x6, w6
        D37EF4C6          lsl     x6, x6, #2
        8B0100C6          add     x6, x6, x1
        B90000C5          str     w5, [x6]
        51000442          sub     w2, w2, #1
                                                ;; bbWeight=8    PerfScore 32.00
G_M6103_IG05:
        7100005F          cmp     w2, #0
        540000CB          blt     G_M6103_IG07
                                                ;; bbWeight=32    PerfScore 48.00
G_M6103_IG06:
        93407C45          sxtw    x5, w2
        D37EF4A5          lsl     x5, x5, #2
        B8656825          ldr     w5, [x1, x5]
        6B0400BF          cmp     w5, w4
        54FFFE8C          bgt     G_M6103_IG04

Which modulo IV widening and address mode formation is not too far off from native.

On a local standalone test perf is now:

Build Time (ms)
3.1 2662
5.0 RC1 3930
Fix 3438

Though the RC1 runs are not very stable and drop down to 3400 sometimes, so there are still things here not well understood.

@AndyAyersMS
Copy link
Member

BDN reports the workaround above gets back about half the perf; dovetails pretty well with my standalone data.

Next steps:

  • profile this version vs 3.1 to see what it says -- maybe we can pin down where the other half of the perf goes
  • look at impact of above change on xArch (hopefully none)
Method Job Runtime Toolchain Size Mean Error StdDev Median Min Max Ratio Gen 0 Gen 1 Gen 2 Allocated
Array Job-HKFDOX .NET Core 3.1 netcoreapp3.1 512 26.72 us 0.262 us 0.205 us 26.75 us 26.29 us 27.04 us 0.69 - - - -
Array Job-WBTPMT .NET Core 5.0 netcoreapp5.0 512 38.91 us 0.470 us 0.440 us 39.11 us 37.74 us 39.25 us 1.00 - - - -
Array Corerun custom net.50 512 33.03 us 0.120 us 0.100 us 33.01 us 32.92 us 33.21 us 0.85 - - - -

@AndyAyersMS
Copy link
Member

perf stat sees this as a pathlength issue, IPC is pretty even (note this the entire standalone benchmark, so includes jitting, etc)

andy@ubuntu:~/bugs/r41741$ perf stat dotnet exec /home/andy/bugs/r41741/bin/Release/net5.0/r41741.dll
Took 3906ms

 Performance counter stats for 'dotnet exec /home/andy/bugs/r41741/bin/Release/net5.0/r41741.dll':

           5220.00 msec task-clock                #    1.014 CPUs utilized          
               493      context-switches          #    0.094 K/sec                  
                 3      cpu-migrations            #    0.001 K/sec                  
             54074      page-faults               #    0.010 M/sec                  
        7784582680      cycles                    #    1.491 GHz                    
        7419443217      instructions              #    0.95  insn per cycle         
   <not supported>      branches                                                    
         164150759      branch-misses                                               

       5.146804442 seconds time elapsed

       4.846692000 seconds user
       0.373440000 seconds sys


andy@ubuntu:~/bugs/r41741$ perf stat dotnet exec /home/andy/bugs/r41741/bin/Release/netcoreapp3.1/r41741.dll
Took 2869ms

 Performance counter stats for 'dotnet exec /home/andy/bugs/r41741/bin/Release/netcoreapp3.1/r41741.dll':

           4061.60 msec task-clock                #    1.004 CPUs utilized          
               402      context-switches          #    0.099 K/sec                  
                 8      cpu-migrations            #    0.002 K/sec                  
             53856      page-faults               #    0.013 M/sec                  
        6050121690      cycles                    #    1.490 GHz                    
        5198855530      instructions              #    0.86  insn per cycle         
   <not supported>      branches                                                    
         118657198      branch-misses                                               

       4.047158027 seconds time elapsed

       3.762541000 seconds user
       0.296200000 seconds sys

andy@ubuntu:~/bugs/r41741$ ./repro.sh
Took 3469ms
 Performance counter stats for 'corerun /home/andy/bugs/r41741/r41741.dll':

           4926.49 msec task-clock                #    0.970 CPUs utilized          
              1146      context-switches          #    0.233 K/sec                  
                10      cpu-migrations            #    0.002 K/sec                  
             54949      page-faults               #    0.011 M/sec                  
        7307046870      cycles                    #    1.483 GHz                    
        6407127469      instructions              #    0.88  insn per cycle         
   <not supported>      branches                                                    
         147762325      branch-misses                                               

       5.076936778 seconds time elapsed

       4.498607000 seconds user
       0.433794000 seconds sys

so you would think differential profiling should be able to pin down the hotspots.

Branch misses are also up but don't seem to impact IPC. Not sure what to make of this.

@AndyAyersMS
Copy link
Member

Have been looking at method and loop alignment impact. Here's a chart showing what happens as we vary the mod32 alignment of the top of Tier1 InsertionSort's inner loop from 0 to 28, on a couple different processors, for a standalone test (so profiling includes startup and tier0; data gathered via perf stat -r 5; data relative to 0 mod 32). Green line is time, yellow is cycles, grey dotted line is instructions executed, blue is branch mispredicts.

image
image

The slight upslope in instructions (grey line) is the impact of the extra nop padding (here placed just after the prolog); as we pad more we execute more instructions overall. But the total instruction count is pretty well constant.

The cycles vary more widely and the fluctuation in cycles is moderately well correlated with branch misses; impact of bad alignment varies. Note this isn't a completely pure experiment as modifying the alignment of one method may well impact alignment of others. But we can see that fluctuations here are in the 3-8% range.

@AndyAyersMS
Copy link
Member

For the remainder of perf loss I suspect time may be going into one of the unmapped stubs?

Per dotnet/coreclr#26897 I am going to build perf so I look at annotations on jitted code.

@AndyAyersMS
Copy link
Member

Haven't had any luck yet with locally built version of perf.

@AndyAyersMS
Copy link
Member

Marking as blocking release, at least until I can get a clearer picture of where the perf is going in 5.0. Currently having trouble getting accurate profiles.

Using perf inject --jit to pull in jitdump data should help, but this is not yet working reliably for me.

@AndyAyersMS
Copy link
Member

Took another look now that perf annotate can annotate jitted code (needed to build perf from recent linux sources). Seems like on the Qualcomm machine all the lost perf is in in InsertionSort, here are the relevant annotations (cycles):

;; 3.1 "insertion sort" inner loop (native)

   94328597 ¦40cc90:   ldr   w13, [x19,w12,sxtw #2]
   49260763 ¦40cc94:   cmp   w11, w13
   57762715 ¦40cc98: ? b.ge  40ccb0
   77756830 ¦40cc9c:   add   w14, w12, #0x1
   47159165 ¦40cca0:   sub   w12, w12, #0x1
   44848974 ¦40cca4:   cmp   w12, w21
   25660062 ¦40cca8:   str   w13, [x19,w14,sxtw #2]
   17110607 ¦40ccac: ? b.ge  40cc90
  442575592 ¦40ccb0:   add   w12, w12, #0x1    // branch latency skid?

;; 5.0 "insertion sort" inner loop (jitted)

  497165872  80:   add  x3, x29, #0x10
   32905197  84:   ldr  x3, [x3]      
             88:   add  w5, w0, #0x1  
   23609205  8c:   sxtw x5, w5        
   67146280  90:   lsl  x5, x5, #2    
   67408728  94:   add  x3, x5, x3    
             98:   add  x5, x29, #0x10
   30088023  9c:   ldr  x5, [x5]      
  133792275  a0:   ldr  w4, [x5,x4]   
             a4:   str  w4, [x3]      
   67686545  a8:   sub  w0, w0, #0x1  
   51829740  ac:   cmp  w0, #0x0      
   63269723  b0: ? b.lt d0
   25938857  b4:   add  x3, x29, #0x10
  116122893  b8:   ldr  x3, [x3]      
   58047416  bc:   sxtw x4, w0        
    4154869  c0:   lsl  x4, x4, #2    
   35269291  c4:   ldr  w3, [x3,x4]   
  162843745  c8:   cmp  w3, w1        
   11409506  cc: ? b.gt 80
 1566344914  d0:   add  x3, x29, #0x10    // branch latency skid?

Direct comparison is a bit tricky as the loops are rotated differently, basically each is two parts and the top of the native version matches the bottom of the jitted version, and vice-versa.

Short of the source hack above to copy the span, there is not much we can do for 5.0. Going to move this to 6.0.

@AndyAyersMS AndyAyersMS modified the milestones: 5.0.0, 6.0.0 Sep 23, 2020
@JulieLeeMSFT
Copy link
Member

Does this happen mainly on Linux?
Windows results show Faster.

@AndyAyersMS
Copy link
Member

Linux arm64. Probably Windows arm64 too, but we don't track that.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 10, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@AndyAyersMS
Copy link
Member

It looks like we regained the missing perf with #43870.

6.0 data now shows about 16 us
newplot (16)

which is roughly where we were in the 3.1 timeframe.

The List case shows similar improvements

newplot (17)

Going to close this one as fixed.

@danmoseley
Copy link
Member

Those graphs also appear to show two distinct improvements in reducing noise - presumably the loop alignment work? If so, that's nice to see.

@AndyAyersMS
Copy link
Member

This is ARM64 -- I don't think we do anything special for loops on this platform yet. Not really sure why the noise level dropped here.

@danmoseley
Copy link
Member

I see. @adamsitnik could it be BDN or test changes to randomize data alignment perhaps?
@kunalspathak

@kunalspathak
Copy link
Member

kunalspathak commented Mar 30, 2021

We have the "randomize data alignment" ability but haven't enabled it for any benchmark yet. Also as Andy pointed out, loop alignment techniques are only done for xarch.

@danmoseley
Copy link
Member

Interesting, unexplained then. Thanks.

@kunalspathak
Copy link
Member

Once I have Arm64 historical data, I will try to dig and narrow down changes.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants