Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List Add is way slower (almost 3 times) in net9.0 preview 3 than with net8.0 #101437

Closed
linkdotnet opened this issue Apr 23, 2024 · 50 comments · Fixed by #105695
Closed

List Add is way slower (almost 3 times) in net9.0 preview 3 than with net8.0 #101437

linkdotnet opened this issue Apr 23, 2024 · 50 comments · Fixed by #105695
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@linkdotnet
Copy link

With the latest preview 3 (net9) there seems to be a major performance regression on MacOS 14.4 (M2 Pro) with lists. Given that they are one of the most used types, it should receive special treatment.

Benchmark

[Benchmark]
public List<int> ListAdd10000()
{
    var list = new List<int>();
    for (var i = 0; i < 10_000; i++)
    {
        list.Add(i);
    }

    return list;
}

[Benchmark]
public List<int> ListAdd10000PreAlloc()
{
    var list = new List<int>(10_000);
    for (var i = 0; i < 10_000; i++)
    {
        list.Add(i);
    }

    return list;
}

Results:

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M2 Pro, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
  .NET 8.0 : .NET 8.0.2 (8.0.224.6711), Arm64 RyuJIT AdvSIMD
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD


| Method               | Runtime  | Mean      | Error     | StdDev    | Ratio | 
|--------------------- |--------- |----------:|----------:|----------:|------:|-
| ListAdd10000         | .NET 8.0 |  9.739 us | 0.0236 us | 0.0209 us |  1.00 | 
| ListAdd10000         | .NET 9.0 | 25.646 us | 0.1531 us | 0.1432 us |  2.63 | 
|                      |          |           |           |           |       | 
| ListAdd10000PreAlloc | .NET 8.0 |  6.701 us | 0.0175 us | 0.0146 us |  1.00 | 
| ListAdd10000PreAlloc | .NET 9.0 | 22.562 us | 0.1429 us | 0.1336 us |  3.37 |

Interestingly even the pre-allocated list is comparably slow.

dotnet --info

Output dotnet --info .NET SDK: Version: 9.0.100-preview.3.24204.13 Commit: 81f61d8290 Workload version: 9.0.100-manifests.77bb7ba9 MSBuild version: 17.11.0-preview-24178-16+7ca3c98fa

Runtime Environment:
OS Name: Mac OS X
OS Version: 14.4
OS Platform: Darwin
RID: osx-arm64
Base Path: /usr/local/share/dotnet/sdk/9.0.100-preview.3.24204.13/

.NET workloads installed:
There are no installed workloads to display.

Host:
Version: 9.0.0-preview.3.24172.9
Architecture: arm64
Commit: 9e6ba1f

.NET SDKs installed:
6.0.417 [/usr/local/share/dotnet/sdk]
7.0.306 [/usr/local/share/dotnet/sdk]
7.0.404 [/usr/local/share/dotnet/sdk]
8.0.100 [/usr/local/share/dotnet/sdk]
8.0.101 [/usr/local/share/dotnet/sdk]
8.0.200 [/usr/local/share/dotnet/sdk]
9.0.100-preview.2.24121.2 [/usr/local/share/dotnet/sdk]
9.0.100-preview.2.24157.14 [/usr/local/share/dotnet/sdk]
9.0.100-preview.3.24204.13 [/usr/local/share/dotnet/sdk]

.NET runtimes installed:
Microsoft.AspNetCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.2.24120.6 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.2.24128.4 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.3.24172.13 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.2.24120.11 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.2.24128.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.3.24172.9 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

Other architectures found:
x64 [/usr/local/share/dotnet/x64]

Environment variables:
Not set

global.json file:
Not found

Learn more:
https://aka.ms/dotnet/info

Download .NET:
https://aka.ms/dotnet/download

@linkdotnet linkdotnet added the tenet-performance Performance related issue label Apr 23, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 23, 2024
@vcsjones vcsjones added area-System.Collections and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 23, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Doesn't repro for me on x64-windows:


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3447/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 8.0 : .NET 8.0.4 (8.0.424.16909), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method               | Job      | Runtime  | Mean     | Error     | StdDev    |
|--------------------- |--------- |--------- |---------:|----------:|----------:|
| ListAdd10000         | .NET 8.0 | .NET 8.0 | 7.339 us | 0.0598 us | 0.0559 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 | 5.002 us | 0.0988 us | 0.1508 us |
| ListAdd10000         | .NET 9.0 | .NET 9.0 | 7.472 us | 0.1401 us | 0.2597 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 4.992 us | 0.0987 us | 0.1730 us |

don't have an macos-arm64 machine to test

@vcsjones
Copy link
Member

@EgorBo I can repro on my M1.

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M1 Ultra, 1 CPU, 20 logical and 20 physical cores
.NET SDK 9.0.100-preview.3.24204.13
[Host] : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD
Job-OISOJI : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD
Job-YEXJFM : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

Method Runtime Mean Error StdDev Ratio
ListAdd10000 .NET 8.0 10.488 us 0.0102 us 0.0080 us 1.00
ListAdd10000 .NET 9.0 28.429 us 0.0501 us 0.0469 us 2.71
ListAdd10000PreAlloc .NET 8.0 7.137 us 0.0147 us 0.0138 us 1.00
ListAdd10000PreAlloc .NET 9.0 24.054 us 0.0877 us 0.0732 us 3.37

@linkdotnet
Copy link
Author

The same applies to 9.0.100-preview.2.24157.14 as well - so the regression might be since preview.1 or preview.2.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Interesting! Reproduces for me on M2 Max as well (just found it 🙂):

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
.NET SDK 9.0.100-preview.3.24204.13
  [Host]   : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  .NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
  .NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD


| Method               | Job      | Runtime  | Mean      | Error     | StdDev    |
|--------------------- |--------- |--------- |----------:|----------:|----------:|
| ListAdd10000         | .NET 8.0 | .NET 8.0 |  9.633 us | 0.0906 us | 0.0848 us |
| ListAdd10000PreAlloc | .NET 8.0 | .NET 8.0 |  6.688 us | 0.0389 us | 0.0364 us |
| ListAdd10000         | .NET 9.0 | .NET 9.0 | 24.315 us | 0.1284 us | 0.1201 us |
| ListAdd10000PreAlloc | .NET 9.0 | .NET 9.0 | 21.207 us | 0.0854 us | 0.0757 us |

@tannergooding
Copy link
Member

Didn't we change Arm64 to start using the newer Arm64 barrier intstructions for the GC if they existed around that point?

IIRC, @kunalspathak did some of that work.

@tannergooding
Copy link
Member

Not finding the barrier PR I was remembering, but there was #97953 too

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Doesn't repro on Linux-arm64

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

codegen diff: https://www.diffchecker.com/6tDgYQ8w/ - the only difference is ldp on net9 vs 2 ldr on net8. Hard to imagine it being a culprit 🤔

@tannergooding
Copy link
Member

Could be something more subtle has changed and is impacting the general measurements or iteration count?

ldr and ldp are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?

I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Could be something more subtle has changed and is impacting the general measurements or iteration count?

ldr and ldp are both supposed to be 3 cycle, 3 latency. But alternatively perhaps there's some weird data dependency or other issue here causing the delay?

I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately.

I've just checked it locally - I've built a runtime that exactly matches 9.0-preview3 and removed the Ldr->Ldp optimization - it has fixed the perf 😐

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

Here is the asm diff for ReplaceLdrStrWithPairInstr being there (left) and removed (right) on the same runtime: https://www.diffchecker.com/8C38Yoor/

@vcsjones vcsjones added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Collections labels Apr 23, 2024
@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

After some brainstorming in our Community Discord:

  1. The loop alignment is the same in both cases - we even tried to put an nop after ldp so the overall codegen's (and loop) size matches -- no change
  2. The issue was introduced in JIT: Reorder indirs on arm64 to make them amenable to ldp optimization #92768 in .NET 9.0. That PR allows pre-existing ReplaceLdrStrWithPairInstr optimization to recognize more pairs of loads as ldp.
  3. Linux-arm64 Ampere is not affected it seems (if I correctly measured) so for now it's macOS arm64 specific. Perhaps someone can try it on some other non-mac arm64 machine? The benchmark is:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[SimpleJob(runtimeMoniker: RuntimeMoniker.Net80)]
[SimpleJob(runtimeMoniker: RuntimeMoniker.Net90)]
public class MyClass
{
    static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(MyClass).Assembly).Run(args);
    }

    [Benchmark]
    public List<int> ListAdd10000PreAlloc()
    {
        var list = new List<int>(10_000);
        for (var i = 0; i < 10_000; i++)
            list.Add(i);
        return list;
    }
}

(with <TargetFrameworks>net8.0;net9.0</TargetFrameworks> in the csproj)

@tannergooding
Copy link
Member

It could potentially be related to behavioral changes that exist for LDP when LSE2 is supported.

In particular

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that load or store two 64-bit registers are single-copy
atomic when all of the following conditions are true:
• The overall memory access is aligned to 16 bytes.
• Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory.

If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access fewer than 16 bytes are single-copy
atomic when all of the following conditions are true:
• All bytes being accessed are within a 16-byte quantity aligned to 16 bytes.
• Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memor

The actual operation (or a snippet of it) is then described loosely as:

if HaveLSE2Ext() && !signed then
    bits(2*datasize) full_data;
    full_data = Mem[address, 2*dbytes, accdesc, TRUE];
    if BigEndian(accdesc.acctype) then
        data2 = full_data<(datasize-1):0>;
        data1 = full_data<(2*datasize-1):datasize>;
    else
        data1 = full_data<(datasize-1):0>;
        data2 = full_data<(2*datasize-1):datasize>;
else
    data1 = Mem[address, dbytes, accdesc];
    data2 = Mem[address+dbytes, dbytes, accdesc];

So it may be beneficial to test this on a Linux machine with LSE2 to see if that makes a difference or not.

@vcsjones
Copy link
Member

vcsjones commented Apr 23, 2024

Graviton 3:

Method Job Runtime Mean Error StdDev
ListAdd10000PreAlloc .NET 8.0 .NET 8.0 23.84 us 0.422 us 0.414 us
ListAdd10000PreAlloc .NET 9.0 .NET 9.0 23.75 us 0.270 us 0.252 us

To @tannergooding's point though, either the hardware or the kernel I am using do not support LSE2 based on dmesg output.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

image

this change fixes the perf too. the loads we're looking at are _size and _version inside List<T> so we're likely dealing with some sort of stall here probably caused by LSE2 that Tanner mentioned

@tannergooding
Copy link
Member

My guess, given the above, is then that since LSE2 makes ldp atomic for loads that exist within a 16-byte window (and we know these will be since the they'll be 8-byte aligned), that the write to _version is placing a stall against the read for _size, which wouldn't exist for two independent ldr.

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

I like that guess. Although, the fact that on Graviton 3 we're seeing 23 us for both cases (and I am seeing similar numbers on Ampere-linux-arm64) while x64 and apple m1 show 7us for the best case, might be hinting something else

@EgorBo
Copy link
Member

EgorBo commented Apr 23, 2024

a dummy field between _size and _version in List<> fixes the issue as well (since it breaks the ldp optimization too)

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

Minimal repro:

using System.Diagnostics;
using System.Runtime.CompilerServices;

public class MyClass
{
    static void Main()
    {
        var mc = new MyClass();
        Stopwatch sw = Stopwatch.StartNew();
        while (true)
        {
            sw.Restart();
            for (int i = 0; i < 10000; i++)
                mc.Test();
            sw.Stop();
            Console.WriteLine(sw.ElapsedMilliseconds);
        }
    }

    int field1;
    int field2;

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void Test()
    {
        for (int i = 0; i < 10000; i++)
        {
            field1++;
            field2++;
        }
    }
}

Codegen diff for Test(): https://www.diffchecker.com/72No1VJ6/
Left - Main, ~191 ms per iteration
Right - LDP optimization disabled, ~67 ms per iteration (~3x faster)

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

cc @a74nh @TamarChristinaArm Perhaps, you know why we could see such a terrible perf hit from ldp on Apple arm64 CPUs

@EgorBo
Copy link
Member

EgorBo commented Apr 24, 2024

The fun part that this benchmark is 3x faster under x64 emulation (Rosetta), e.g.:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;

BenchmarkRunner.Run<MyClass>();

[InProcess]
public class MyClass
{
    [Benchmark]
    public List<int> ListAdd10000PreAlloc()
    {
        var list = new List<int>(10_000);
        for (var i = 0; i < 10_000; i++)
            list.Add(i);
        return list;
    }
}

Then:

dotnet publish --sc -f net9.0 -r osx-x64 
dotnet publish --sc -f net9.0 -r osx-arm64

Native (arm64):

Apple M2 Max, 1 CPU, 12 logical and 12 physical cores
  [Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

| Method               | Mean     | Error    | StdDev   |
|--------------------- |---------:|---------:|---------:|
| ListAdd10000PreAlloc | 22.07 us | 0.258 us | 0.229 us |

Rosetta (x64) on the same hw

Apple M2 Max 2.40GHz, 1 CPU, 12 logical and 12 physical cores
  [Host] : .NET 9.0.0 (9.0.24.17209), X64 RyuJIT SSE4.2

| Method               | Mean     | Error     | StdDev    |
|--------------------- |---------:|----------:|----------:|
| ListAdd10000PreAlloc | 7.724 us | 0.0862 us | 0.0764 us |

@linkdotnet
Copy link
Author

linkdotnet commented Apr 24, 2024

On an arm64 RPi 4B there is no regression as well:

BenchmarkDotNet v0.13.12, Debian GNU/Linux 12 (bookworm)
Unknown processor
.NET SDK 9.0.100-preview.3.24204.13
[Host] : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD
.NET 8.0 : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD
.NET 9.0 : .NET 9.0.0 (9.0.24.17209), Arm64 RyuJIT AdvSIMD

Method Job Runtime Mean Error StdDev
ListAdd10000PreAlloc .NET 8.0 .NET 8.0 74.73 us 0.164 us 0.128 us
ListAdd10000PreAlloc .NET 9.0 .NET 9.0 64.74 us 0.468 us 0.415 us

@Clockwork-Muse

This comment was marked as off-topic.

@tannergooding

This comment was marked as off-topic.

@mqudsi

This comment was marked as off-topic.

@tannergooding

This comment was marked as off-topic.

@jkotas

This comment was marked as off-topic.

@mqudsi

This comment was marked as off-topic.

@EgorBo EgorBo self-assigned this Apr 27, 2024
@EgorBo EgorBo added this to the Future milestone Apr 27, 2024
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Apr 27, 2024
@EgorBo
Copy link
Member

EgorBo commented Apr 27, 2024

So we have a good guess (discussed internally) what's going on here, but it's not clear how to detect bad patterns and disable ldr->ldp optimization. The PRs which added ldr->ldp optimization only have a very few improvement reports attached to them (except ldp for SIMD regs), so we can, probably, just conservatively disable it for Apple platforms (we've confirmed that this problem is Apple specific at this point).

I think the first thing we have to do is to add macOS-arm64 platform to our dotnet/performance runs (it's currently Linux and Windows only) before we make any changes.

@marcan
Copy link

marcan commented Apr 28, 2024

Reminder that Linux on Apple Silicon is a thing, both virtualized and bare metal, and virtual Windows ARM64 on Apple Silicon is also a thing. Please do not gate this on macOS only.

Given the LSE2 theory, I think it makes sense to treat this as a performance issue potentially affecting more CPU platforms (especially future ones), given that there is a plausible explanation for the mechanism and why the code is, for at least some logical implementations, actually sub-optimal. If the gains of this peephole opt are otherwise minimal, and it's not easy to detect the problem code sequences (the ones where it triggers extra stalls) then I think the logical thing to do is just disable it across the board. Alternatively, if you think this is rare enough and the List.Add case is an outlier, then just work around it there (e.g. with the version change).

But please don't use OS platform as a gate for this, since that makes no sense given multiple OSes exist (you'd have to use CPU implementer instead, and I assume that's not practical given ahead-of-time compilation?).

@neon-sunset
Copy link
Contributor

neon-sunset commented Apr 28, 2024

It looks like part of the discussion was marked off-topic.

Just wanted to re-iterate that this particular regression is still solvable by removing _version when users target Release altogether, and it might be a good opportunity to finally explore this option as described in #81523

@linkdotnet
Copy link
Author

It looks like part of the discussion was marked off-topic.

Just wanted to re-iterate that this particular regression is still solvable by removing _version when users target Release altogether, and it might be a good opportunity to finally explore this option as described in #81523

Wouldn’t that only solve the problem for List?
Leaving an area for potential other regressions now and the future!?

@linkdotnet
Copy link
Author

Don’t get me wrong - one can still lead this discussion but I do feel it should be handled separately (and not only because of performance characteristics)

@EgorBo
Copy link
Member

EgorBo commented Apr 28, 2024

Minimal repro for the Apple issue using Clang and inline asm: https://gist.github.com/EgorBo/88196a218559ec93a197a7d1d5600548

ldp makes this "benchmark" ~4-5x slower. An interesting thing is that if we have ldp and stp - it's still slow (we thought that the problem reproduces only when we mix str with ldp and stp with ldp would be fine)

@mqudsi
Copy link

mqudsi commented Apr 29, 2024

Wouldn't just moving the _version write as @EgorBo demonstrated just be the least controversial change/fix here (fixes this particular performance regression, avoids arguing about the bigger questions involved with removing version checks altogether, keeps version checks in place even for List.Add())?

The performance regressions can be tackled by monitoring macOS benchmark results (the point about Apple Silicon != macOS is well taken, but also somewhat besides the point since Linux (OS) performance is tracked separately and monitoring macOS (OS + hw) performance would have caught this for both newer Macs (OS + hw) and Linux-on-Apple-silicon — you do not need "idealized" performance monitoring where only one factor changes between each monitored configuration) and figuring out where other notable regressions occur?

You probably don't want to blanket disable ldr to ldp optimizations, as this was a pathological case where there was a false data dependency on two adjacent fields but there likely are cases where compiler optimizations can take advantage of the single copy atomicity improvements to net performance gains.

(But I must confess I am confused why this would problem would not manifest on non-Apple silicon aarch64 w/ FEAT_LSE2 present, as reported earlier in this thread, unless the cpu wasn't taking advantage of the available instruction-level parallelism before so there is no regression now?)

@Developer-Ecosystem-Engineering

We are reviewing this issue thanks for raising it!

@Developer-Ecosystem-Engineering

The use of LDP instructions is generally encouraged. In this specific case, the structure of the loop (that includes a memory dependence between stores and loads), coupled with the use of LDP instructions prevents certain hardware optimizations from engaging. The small and tight nature of the loop also exacerbates the impact.

Please also see section 4.6.11 of the Apple Silicon CPU optimization guide.

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Jul 30, 2024
…dr -> ldp

Very targeted fix for dotnet#93401 and dotnet#101437: before reordering two
indirections, check if there is a potential store in the same loop that
looks like it could end up being a candidate for store-to-load
forwarding into one of those indirections. Some hardware does not handle
store-to-load forwarding with the same fidelity when `stp`/`ldp` is
involved compared to multiple `str`/`ldr`.

If we detect the situation then avoid doing the reordering.
@jakobbotsch jakobbotsch assigned jakobbotsch and unassigned EgorBo Jul 30, 2024
@jakobbotsch jakobbotsch modified the milestones: Future, 9.0.0 Jul 30, 2024
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Aug 1, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Sep 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.