-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List Add is way slower (almost 3 times) in net9.0 preview 3
than with net8.0
#101437
Comments
Tagging subscribers to this area: @dotnet/area-system-collections |
Doesn't repro for me on x64-windows:
don't have an macos-arm64 machine to test |
@EgorBo I can repro on my M1. BenchmarkDotNet v0.13.12, macOS Sonoma 14.4.1 (23E224) [Darwin 23.4.0]
|
The same applies to |
Interesting! Reproduces for me on M2 Max as well (just found it 🙂):
|
Didn't we change Arm64 to start using the newer Arm64 barrier intstructions for the GC if they existed around that point? IIRC, @kunalspathak did some of that work. |
Not finding the barrier PR I was remembering, but there was #97953 too |
Doesn't repro on Linux-arm64 |
codegen diff: https://www.diffchecker.com/6tDgYQ8w/ - the only difference is |
Could be something more subtle has changed and is impacting the general measurements or iteration count?
I don't think Apple or Arm64 in general has anything like Intel VTune or AMD uProf, so seeing where the stalls are happening isn't as easy, unfortunately. |
I've just checked it locally - I've built a runtime that exactly matches 9.0-preview3 and removed the Ldr->Ldp optimization - it has fixed the perf 😐 |
Here is the asm diff for |
After some brainstorming in our Community Discord:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
[SimpleJob(runtimeMoniker: RuntimeMoniker.Net80)]
[SimpleJob(runtimeMoniker: RuntimeMoniker.Net90)]
public class MyClass
{
static void Main(string[] args)
{
BenchmarkSwitcher.FromAssembly(typeof(MyClass).Assembly).Run(args);
}
[Benchmark]
public List<int> ListAdd10000PreAlloc()
{
var list = new List<int>(10_000);
for (var i = 0; i < 10_000; i++)
list.Add(i);
return list;
}
} (with |
It could potentially be related to behavioral changes that exist for In particular
The actual operation (or a snippet of it) is then described loosely as:
So it may be beneficial to test this on a Linux machine with LSE2 to see if that makes a difference or not. |
Graviton 3:
To @tannergooding's point though, either the hardware or the kernel I am using do not support LSE2 based on dmesg output. |
My guess, given the above, is then that since |
I like that guess. Although, the fact that on Graviton 3 we're seeing 23 us for both cases (and I am seeing similar numbers on Ampere-linux-arm64) while x64 and apple m1 show 7us for the best case, might be hinting something else |
a dummy field between _size and _version in |
Minimal repro: using System.Diagnostics;
using System.Runtime.CompilerServices;
public class MyClass
{
static void Main()
{
var mc = new MyClass();
Stopwatch sw = Stopwatch.StartNew();
while (true)
{
sw.Restart();
for (int i = 0; i < 10000; i++)
mc.Test();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
int field1;
int field2;
[MethodImpl(MethodImplOptions.NoInlining)]
public void Test()
{
for (int i = 0; i < 10000; i++)
{
field1++;
field2++;
}
}
} Codegen diff for |
cc @a74nh @TamarChristinaArm Perhaps, you know why we could see such a terrible perf hit from |
The fun part that this benchmark is 3x faster under x64 emulation (Rosetta), e.g.: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System.Runtime.Intrinsics;
BenchmarkRunner.Run<MyClass>();
[InProcess]
public class MyClass
{
[Benchmark]
public List<int> ListAdd10000PreAlloc()
{
var list = new List<int>(10_000);
for (var i = 0; i < 10_000; i++)
list.Add(i);
return list;
}
} Then: dotnet publish --sc -f net9.0 -r osx-x64
dotnet publish --sc -f net9.0 -r osx-arm64 Native (arm64):
Rosetta (x64) on the same hw
|
On an BenchmarkDotNet v0.13.12, Debian GNU/Linux 12 (bookworm)
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
So we have a good guess (discussed internally) what's going on here, but it's not clear how to detect bad patterns and disable ldr->ldp optimization. The PRs which added I think the first thing we have to do is to add macOS-arm64 platform to our dotnet/performance runs (it's currently Linux and Windows only) before we make any changes. |
Reminder that Linux on Apple Silicon is a thing, both virtualized and bare metal, and virtual Windows ARM64 on Apple Silicon is also a thing. Please do not gate this on macOS only. Given the LSE2 theory, I think it makes sense to treat this as a performance issue potentially affecting more CPU platforms (especially future ones), given that there is a plausible explanation for the mechanism and why the code is, for at least some logical implementations, actually sub-optimal. If the gains of this peephole opt are otherwise minimal, and it's not easy to detect the problem code sequences (the ones where it triggers extra stalls) then I think the logical thing to do is just disable it across the board. Alternatively, if you think this is rare enough and the List.Add case is an outlier, then just work around it there (e.g. with the But please don't use OS platform as a gate for this, since that makes no sense given multiple OSes exist (you'd have to use CPU implementer instead, and I assume that's not practical given ahead-of-time compilation?). |
It looks like part of the discussion was marked off-topic. Just wanted to re-iterate that this particular regression is still solvable by removing |
Wouldn’t that only solve the problem for List? |
Don’t get me wrong - one can still lead this discussion but I do feel it should be handled separately (and not only because of performance characteristics) |
Minimal repro for the Apple issue using Clang and inline asm: https://gist.github.com/EgorBo/88196a218559ec93a197a7d1d5600548
|
Wouldn't just moving the The performance regressions can be tackled by monitoring macOS benchmark results (the point about Apple Silicon != macOS is well taken, but also somewhat besides the point since Linux (OS) performance is tracked separately and monitoring macOS (OS + hw) performance would have caught this for both newer Macs (OS + hw) and Linux-on-Apple-silicon — you do not need "idealized" performance monitoring where only one factor changes between each monitored configuration) and figuring out where other notable regressions occur? You probably don't want to blanket disable ldr to ldp optimizations, as this was a pathological case where there was a false data dependency on two adjacent fields but there likely are cases where compiler optimizations can take advantage of the single copy atomicity improvements to net performance gains. (But I must confess I am confused why this would problem would not manifest on non-Apple silicon aarch64 w/ FEAT_LSE2 present, as reported earlier in this thread, unless the cpu wasn't taking advantage of the available instruction-level parallelism before so there is no regression now?) |
We are reviewing this issue thanks for raising it! |
The use of LDP instructions is generally encouraged. In this specific case, the structure of the loop (that includes a memory dependence between stores and loads), coupled with the use of LDP instructions prevents certain hardware optimizations from engaging. The small and tight nature of the loop also exacerbates the impact. Please also see section 4.6.11 of the Apple Silicon CPU optimization guide. |
…dr -> ldp Very targeted fix for dotnet#93401 and dotnet#101437: before reordering two indirections, check if there is a potential store in the same loop that looks like it could end up being a candidate for store-to-load forwarding into one of those indirections. Some hardware does not handle store-to-load forwarding with the same fidelity when `stp`/`ldp` is involved compared to multiple `str`/`ldr`. If we detect the situation then avoid doing the reordering.
With the latest preview 3 (net9) there seems to be a major performance regression on MacOS 14.4 (M2 Pro) with lists. Given that they are one of the most used types, it should receive special treatment.
Benchmark
Results:
Interestingly even the pre-allocated list is comparably slow.
dotnet --info
Output
dotnet --info .NET SDK: Version: 9.0.100-preview.3.24204.13 Commit: 81f61d8290 Workload version: 9.0.100-manifests.77bb7ba9 MSBuild version: 17.11.0-preview-24178-16+7ca3c98faRuntime Environment:
OS Name: Mac OS X
OS Version: 14.4
OS Platform: Darwin
RID: osx-arm64
Base Path: /usr/local/share/dotnet/sdk/9.0.100-preview.3.24204.13/
.NET workloads installed:
There are no installed workloads to display.
Host:
Version: 9.0.0-preview.3.24172.9
Architecture: arm64
Commit: 9e6ba1f
.NET SDKs installed:
6.0.417 [/usr/local/share/dotnet/sdk]
7.0.306 [/usr/local/share/dotnet/sdk]
7.0.404 [/usr/local/share/dotnet/sdk]
8.0.100 [/usr/local/share/dotnet/sdk]
8.0.101 [/usr/local/share/dotnet/sdk]
8.0.200 [/usr/local/share/dotnet/sdk]
9.0.100-preview.2.24121.2 [/usr/local/share/dotnet/sdk]
9.0.100-preview.2.24157.14 [/usr/local/share/dotnet/sdk]
9.0.100-preview.3.24204.13 [/usr/local/share/dotnet/sdk]
.NET runtimes installed:
Microsoft.AspNetCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.2.24120.6 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.2.24128.4 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 9.0.0-preview.3.24172.13 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
Microsoft.NETCore.App 6.0.25 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 7.0.14 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.0 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.1 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 8.0.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.2.24120.11 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.2.24128.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Microsoft.NETCore.App 9.0.0-preview.3.24172.9 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
Other architectures found:
x64 [/usr/local/share/dotnet/x64]
Environment variables:
Not set
global.json file:
Not found
Learn more:
https://aka.ms/dotnet/info
Download .NET:
https://aka.ms/dotnet/download
The text was updated successfully, but these errors were encountered: