-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a proposal which suggests updating the xarch baseline target #272
base: main
Are you sure you want to change the base?
Conversation
Looks reasonable in general, though I'd like to touch on one specific point:
Why not go for the best-of-both-worlds approach? Build and ship IL, and have AOT compilation occur on the destination machine at installation time, rather than JITting at runtime or AOTing at build time? This is essentially what Android does and it works quite well there. |
This works when the runtime is part of the OS (or part of large app with complex installer) and the OS can manage the app lifecycle. It does not work well for runtimes that ship independently on OS, like what .NET runtime is today. |
One consideration is that Android owns the OS and so is able to guarantee the tools required to do that are available. They also don't support any concepts like "xcopy" of apps and centralize acquisition via the App Store. I think doing the same for .NET would be pretty awesome, but it also comes with considerations like a larger deployment mechanism and potential other negative side effects. Crossgen2 for a higher baseline is pretty much like this already, but without many of the drawbacks. |
A few angles to consider:
|
|
||
## Alternatives | ||
|
||
We could maintain the `x86-x64-v1` baseline for the JIT (optionally removing pre `v2` SIMD acceleration) while changing the default for AOT. This could emit a diagnostic by default elaborting to users that it won't support older hardware and indicate how they could explicitly retarget to `x86-64-v1` that is important for their domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Emitting diagnostic like this by default is just output clutter. If we were to do this, it would be just mentioned in the docs.
Sure, NativeAOT isn't built into the OS. How much work would it be to integrate it with a standard installer-generator system like MSI, though? It would never be The Standard, but it would at least be available for developers in the know. |
It depends on what your requirements are. You can certainly do some variant of it on your own. I do not expect we (Microsoft .NET team) will provide or recommend a solution like this. It would not pass our security signoff. |
Huh. That's not the objection I'd have expected to see. What are the security concerns here? |
For example, the binaries cannot be signed. |
Wasn't signing eliminated from Core a few versions ago anyway? I remember that one pretty clearly because there were breaking changes in .NET 6 that broke my compiler, and when I complained about it the team refused to make even the most inconsequential of changes to alleviate the compatibility break. |
I am not talking about strong name signing. I am talking about Microsoft authenticode, Apple app code signing, and similar type of signatures. |
All right. So how does Android handle it? |
I do not know the details on how Android handles this. I can tell you what it involved to make this scheme work with .NET Framework: NGen service process was recognized as a special process by the Windows OS that was allowed to vouch for authenticity of its output. It involved hardening like disallowing debugger attach to the NGen service process (again, another special service provided by the Windows OS) so that you cannot tamper with its execution. |
Yeah, that makes sense. The AOT compiler has to be in a position of high trust for a scheme like that to work. Joe Duffy said something very similar about the Midori architecture. |
Co-authored-by: Günther Foidl <[email protected]>
Is related to #173? |
Would this also cover how we compile the native parts of the (non-AOT) runtimes (GC, CoreCLR VM, etc.)? My main concern would be about the user experience for the minority of users that don't meet this requirement - I'd like to avoid the user experience to be |
Do we have any motivating scenarios that we expect to meaningfully improve? I tried TechEmpower Json benchmark, but I'm seeing some very confusing results ( |
That's weird because SSE3 specifically doesn't bring any value (except, maybe, HADD for floats but it's unlikely to be touched in TE benchmarks). For shuffle it's SSSE3 that is interesting because it provides overloads we need. |
Wait does COMPlus_EnableSSE3=0 only disable SSE3? I thought it works similar to how we do detection in codeman.cpp - not detecting SSE3 means we also consider SSSE3/4/4.2/AVX etc. unavailable. Or do I need to COMPlus_EnableXXX everything one by one to get the measurement I wanted to measure? |
Its similar, but its about upgrading the baseline for AOT and therefore directly impacts all consumers of .NET 173 impacts the default for crossgen, which only provides worse startup performance on older hardware. |
That would likely be up for debate. MSVC only provides
We have a similar message raised by the VM today as well, its just unlikely to ever be encountered since its just checking for SSE2. |
Most codepaths that use Notably:
For SSSE3, the most important is For SSE4.1, the most important are Not having these means the codegen for many core algorithms, especially in the |
They are hierarchical and so Could you provide more concrete numbers and possibly codegen? This sounds unexpected and doesn't match what I've seen in past benchmarking comparisons. We could certainly get more concrete numbers by running all of |
I'd rather this decision wasn't made solely on microbenchmarks. I have no doubts it helps microbenchmarks. They're good supporting evidence, but something that impacts the users is better as the main evidence. That's why I'm trying TechEmpower (it's an E2E number we care about).
I can give you what I did, but not much more than that. Hopefully it's enough to find what I'm doing wrong:
And then just run the above crank command with/without the Without the EnableSSE3=0 argument:
With the SSE3=0 argument:
|
I agree that it shouldn't be made solely based on microbenchmarks. However, we also know well how frequently span/string APIs are used and the cost of branches in hot loops, even predicted branches. We likewise know the importance of these operations in scenarios like ML, Image Processing, Games, etc. I imagine coming up with real world benchmarks showing improvements won't be difficult. With that being said, we also really should not restrict ourselves to a 20 year old baseline regardless. Such hardware is all officially out of support and discontinued by the respective hardware manufacturers. Holding out for such a small minority of hardware that likely isn't even running on a supported OS is ultimately pretty silly (and I expect such users aren't likely to be using new versions of .NET anyways). At some point, we need to have the freedom/flexibility to tell users that newer versions of .NET won't support hardware that old (at the very least "officially", Jan's suggestion of leaving the support in but making it community supported is reasonable; as would simply making it not the default).
Does this work on Windows, or is it Linux only? Do you also need On Windows, I see
Likewise, how much variance is there here run-to-run (that is against separate attempts to profile using the same command line)? |
I'd like us to have such E2E number - we're discussing making .NET-produced executables FailFast on 1 out of 100 machines in the wild by default - "
I run it on Windows. You need VPN on because asp-citrine-lin is a corpnet machine. AFAIK crank-agent is needed on the machine where you run the test (which is asp-citrine-lin in this case, so no need to worry about it).
I made 2 runs each and there was some noise but the difference in the two runs looked conclusive. I think crank does a warmup, but it's really a ASP.NET team tool that I don't have much experience with (only to the extent that we track it and it's part of our release criteria, and therefore looks relevant). |
I expect its much less than this in practice, especially when taking into account Enterprise/cloud hardware, the users that are likely to be running on a supported OS (especially if you consider officially supported hardware, of which only Linux supports hardware this old), and that are likely to be running/using the latest versions of .NET. It's worth noting I opened dotnet/sdk#28055 so that we can, longer term, get more definitive information on this and other important hardware characteristics.
👍. The below is the median of 5 results for each. I didn't notice any obvious outliers. I notably ran both JSON and Plaintext to get two different comparisons. There is a clear disambiguator when SIMD is disabled entirely and there is a small but measurable difference between The default (which should be Json.NET 7 - Default
.NET 7 - EnableAVX=0 (effectively target
|
load | |
---|---|
CPU Usage (%) | 66 |
Cores usage (%) | 1,847 |
Working Set (MB) | 38 |
Private Memory (MB) | 358 |
Start Time (ms) | 0 |
First Request (ms) | 113 |
Requests/sec | 990,885 |
Requests | 14,962,330 |
Mean latency (ms) | 0.46 |
Max latency (ms) | 39.10 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 143.64 |
Latency 50th (ms) | 0.23 |
Latency 75th (ms) | 0.27 |
Latency 90th (ms) | 0.32 |
Latency 99th (ms) | 8.18 |
.NET 7 - EnableSSE3=0 (effectively target x86-64-v1
)
load | |
---|---|
CPU Usage (%) | 66 |
Cores usage (%) | 1,835 |
Working Set (MB) | 38 |
Private Memory (MB) | 358 |
Start Time (ms) | 0 |
First Request (ms) | 116 |
Requests/sec | 980,136 |
Requests | 14,799,763 |
Mean latency (ms) | 0.49 |
Max latency (ms) | 41.72 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 142.08 |
Latency 50th (ms) | 0.24 |
Latency 75th (ms) | 0.27 |
Latency 90th (ms) | 0.32 |
Latency 99th (ms) | 8.72 |
.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)
load | |
---|---|
CPU Usage (%) | 64 |
Cores usage (%) | 1,783 |
Working Set (MB) | 38 |
Private Memory (MB) | 358 |
Start Time (ms) | 0 |
First Request (ms) | 211 |
Requests/sec | 944,005 |
Requests | 14,253,837 |
Mean latency (ms) | 0.50 |
Max latency (ms) | 48.26 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 136.84 |
Latency 50th (ms) | 0.25 |
Latency 75th (ms) | 0.29 |
Latency 90th (ms) | 0.34 |
Latency 99th (ms) | 9.27 |
Plaintext
.NET 7 - Default
load | |
---|---|
CPU Usage (%) | 44 |
Cores usage (%) | 1,221 |
Working Set (MB) | 38 |
Private Memory (MB) | 358 |
Start Time (ms) | 0 |
First Request (ms) | 90 |
Requests/sec | 4,625,118 |
Requests | 69,838,073 |
Mean latency (ms) | 0.60 |
Max latency (ms) | 29.17 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 582.23 |
Latency 50th (ms) | 0.52 |
Latency 75th (ms) | 0.76 |
Latency 90th (ms) | 1.05 |
Latency 99th (ms) | 0.00 |
.NET 7 - EnableAVX=0 (effectively target x86-64-v2
)
load | |
---|---|
CPU Usage (%) | 44 |
Cores usage (%) | 1,232 |
Working Set (MB) | 38 |
Private Memory (MB) | 358 |
Start Time (ms) | 0 |
First Request (ms) | 93 |
Requests/sec | 4,679,347 |
Requests | 70,655,667 |
Mean latency (ms) | 0.58 |
Max latency (ms) | 35.71 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 589.06 |
Latency 50th (ms) | 0.51 |
Latency 75th (ms) | 0.75 |
Latency 90th (ms) | 1.03 |
Latency 99th (ms) | 0.00 |
.NET 7 - EnableSSE3=0 (effectively target x86-64-v1
)
load | |
---|---|
CPU Usage (%) | 44 |
Cores usage (%) | 1,225 |
Working Set (MB) | 38 |
Private Memory (MB) | 358 |
Start Time (ms) | 0 |
First Request (ms) | 91 |
Requests/sec | 4,635,911 |
Requests | 69,999,632 |
Mean latency (ms) | 0.59 |
Max latency (ms) | 32.99 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 583.59 |
Latency 50th (ms) | 0.53 |
Latency 75th (ms) | 0.76 |
Latency 90th (ms) | 1.09 |
Latency 99th (ms) | 0.00 |
.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)
load | |
---|---|
CPU Usage (%) | 42 |
Cores usage (%) | 1,178 |
Working Set (MB) | 38 |
Private Memory (MB) | 358 |
Start Time (ms) | 0 |
First Request (ms) | 158 |
Requests/sec | 4,370,389 |
Requests | 65,991,281 |
Mean latency (ms) | 0.63 |
Max latency (ms) | 32.96 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 550.17 |
Latency 50th (ms) | 0.55 |
Latency 75th (ms) | 0.80 |
Latency 90th (ms) | 1.10 |
Latency 99th (ms) | 0.00 |
I'm actually trying to do exactly this kind of real-world codebase benchmarking, to see if the .NET 7 performance benefits touted in the blog posts make a measurable difference in some performance-sensitive code. Unfortunately, I've been stymied by the inability to actually get anything to run in .NET 7. Any help would be welcome, and I promise to report back with relevant numbers once I have some to share. |
I'm not sure if that one would help - it's the hardware the .NET developers use, not hardware where .NET code runs. Developers are more likely to skew towards the latest and greatest. Users are the "secretary's machine" and "school computer". Windows org is more likely to have the kind of telemetry.
What command line arguments did you use for crank? The numbers for JSON are all a bit lower than I would expect (compare with mine above). |
Ah, you know what I ran
Which again, doesn't really matter when you consider that most Operating Systesm don't support hardware that old. In the case of macOS, it looks to be impossible for any OS we currently support to be running on pre-AVX2 hardware. In the case of Windows, 8.1 is the oldest client SKU we still support. For 8.1, Windows themselves updated the baseline CPU required for x64 (must have CMPXCHG16B and LAHF/SAHF). Various articles quote a comment stating "the number of affected processors are extremely small since this instruction has been supported for greater than 10 years.". For 7, its only supported with an ESU subscription in which case other factors like the Windows Processor Requirements list comes into play and they are all post Linux is really the only interesting case where the kernel still officially supports running on an 80386 (older than we support) and where many distros intentionally keep their specs "low". This is also a case where many recommend using alternative GUIs or specialized distro-builds for such low-spec computers to help. Ubuntu's docs go so far as to describe 10 and 15 year old systems and the scenarios that will likely prevent their usage in a default configuration. The biggest of which is typically that they don't support and have no way of supporting an SSD. In short, such hardware is simply too old to be meaningful and given our official OS support matrix, is already unlikely to have a good experience with the latest versions of .NET. |
@tannergooding Agreed. "Not supported" loses all meaning if no changes can be made based on disregarding the existence of things officially not supported. |
Notes about TechEmpower benchmarks: They operate on extremely small data, e.g. here is what JSON benchmarks test: https://github.com/aspnet/Benchmarks/blob/e3095f4021fef7171bb3ae86616b9156df39b7bd/src/Benchmarks/Middleware/JsonMiddleware.cs#L51 its string representation probably doesn't even fit into AVX vector. And here is the Plaintext - https://github.com/aspnet/Benchmarks/blob/e3095f4021fef7171bb3ae86616b9156df39b7bd/src/Benchmarks/Middleware/PlaintextMiddleware.cs#L16-L41 even smaller. It's nowhere near to be called a "real" workload. I mean TE benchmarks are great to spot obvious regressions and they already helped us a lot to spot problems like in GC Regions, Crossgen, etc, measure internal aspnet overhead and threadpool scaling, but definitely not something we can use to make decisions around vector width IMO. Same regarding HTTP headers, in TechEmpower benchmarks they're:
UTF8 so only Accept is "probably" worth using AVX for. While normally your browser sends like 16 or more headers. |
It's worth noting its not just about the size of the data, its primarily about the amount of data required to be processed until the algorithm can exit. Even for very large inputs, if you're just finding the first index of a common character then the number of iterations you execute is small and the overhead for the required checks slow things down. Where-as if its an uncommon character or if you have to process the whole input, then you can get a 1.5x or more perf improvement as payoff
Reran ensuring I used Json.NET 7 - Default
.NET 7 - EnableAVX2=0 (effectively target
|
load | |
---|---|
CPU Usage (%) | 80 |
Cores usage (%) | 2,229 |
Working Set (MB) | 38 |
Private Memory (MB) | 363 |
Start Time (ms) | 0 |
First Request (ms) | 85 |
Requests/sec | 1,215,177 |
Requests | 18,348,339 |
Mean latency (ms) | 0.69 |
Max latency (ms) | 48.18 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 169.20 |
Latency 50th (ms) | 0.38 |
Latency 75th (ms) | 0.44 |
Latency 90th (ms) | 0.55 |
Latency 99th (ms) | 9.71 |
.NET 7 - EnableAVX=0 (effectively target x86-64-v2
)
load | |
---|---|
CPU Usage (%) | 80 |
Cores usage (%) | 2,238 |
Working Set (MB) | 38 |
Private Memory (MB) | 363 |
Start Time (ms) | 0 |
First Request (ms) | 78 |
Requests/sec | 1,221,743 |
Requests | 18,447,377 |
Mean latency (ms) | 0.66 |
Max latency (ms) | 36.48 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 170.11 |
Latency 50th (ms) | 0.38 |
Latency 75th (ms) | 0.44 |
Latency 90th (ms) | 0.54 |
Latency 99th (ms) | 9.17 |
.NET 7 - EnableSSE3=0 (effectively target x86-64-v1
)
load | |
---|---|
CPU Usage (%) | 79 |
Cores usage (%) | 2,206 |
Working Set (MB) | 38 |
Private Memory (MB) | 363 |
Start Time (ms) | 0 |
First Request (ms) | 79 |
Requests/sec | 1,205,550 |
Requests | 18,203,349 |
Mean latency (ms) | 0.75 |
Max latency (ms) | 46.42 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 167.86 |
Latency 50th (ms) | 0.38 |
Latency 75th (ms) | 0.45 |
Latency 90th (ms) | 0.60 |
Latency 99th (ms) | 10.20 |
.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)
load | |
---|---|
CPU Usage (%) | 78 |
Cores usage (%) | 2,198 |
Working Set (MB) | 38 |
Private Memory (MB) | 363 |
Start Time (ms) | 0 |
First Request (ms) | 99 |
Requests/sec | 1,206,129 |
Requests | 18,211,020 |
Mean latency (ms) | 0.77 |
Max latency (ms) | 41.32 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 167.94 |
Latency 50th (ms) | 0.38 |
Latency 75th (ms) | 0.45 |
Latency 90th (ms) | 0.57 |
Latency 99th (ms) | 10.92 |
Plaintext
.NET 7 - Default
load | |
---|---|
CPU Usage (%) | 93 |
Cores usage (%) | 2,602 |
Working Set (MB) | 38 |
Private Memory (MB) | 370 |
Start Time (ms) | 0 |
First Request (ms) | 78 |
Requests/sec | 10,914,328 |
Requests | 164,806,480 |
Mean latency (ms) | 1.28 |
Max latency (ms) | 61.41 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 1,310.72 |
Latency 50th (ms) | 0.76 |
Latency 75th (ms) | 1.14 |
Latency 90th (ms) | 1.88 |
Latency 99th (ms) | 14.39 |
.NET 7 - EnableAVX2=0 (effectively target x86-64-v2
but allowing VEX encoding)
load | |
---|---|
CPU Usage (%) | 93 |
Cores usage (%) | 2,611 |
Working Set (MB) | 38 |
Private Memory (MB) | 370 |
Start Time (ms) | 0 |
First Request (ms) | 74 |
Requests/sec | 10,979,623 |
Requests | 165,786,328 |
Mean latency (ms) | 1.27 |
Max latency (ms) | 54.10 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 1,320.96 |
Latency 50th (ms) | 0.75 |
Latency 75th (ms) | 1.13 |
Latency 90th (ms) | 1.84 |
Latency 99th (ms) | 14.13 |
.NET 7 - EnableAVX=0 (effectively target x86-64-v2
)
load | |
---|---|
CPU Usage (%) | 94 |
Cores usage (%) | 2,637 |
Working Set (MB) | 38 |
Private Memory (MB) | 370 |
Start Time (ms) | 0 |
First Request (ms) | 66 |
Requests/sec | 10,994,770 |
Requests | 165,985,771 |
Mean latency (ms) | 1.35 |
Max latency (ms) | 55.85 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 1,320.96 |
Latency 50th (ms) | 0.75 |
Latency 75th (ms) | 1.14 |
Latency 90th (ms) | 1.98 |
Latency 99th (ms) | 15.22 |
.NET 7 - EnableSSE3=0 (effectively target x86-64-v1
)
load | |
---|---|
CPU Usage (%) | 92 |
Cores usage (%) | 2,585 |
Working Set (MB) | 38 |
Private Memory (MB) | 370 |
Start Time (ms) | 0 |
First Request (ms) | 71 |
Requests/sec | 10,916,742 |
Requests | 164,843,707 |
Mean latency (ms) | 1.20 |
Max latency (ms) | 51.97 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 1,310.72 |
Latency 50th (ms) | 0.76 |
Latency 75th (ms) | 1.13 |
Latency 90th (ms) | 1.78 |
Latency 99th (ms) | 13.05 |
.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)
load | |
---|---|
CPU Usage (%) | 84 |
Cores usage (%) | 2,359 |
Working Set (MB) | 38 |
Private Memory (MB) | 370 |
Start Time (ms) | 0 |
First Request (ms) | 109 |
Requests/sec | 9,972,152 |
Requests | 150,576,179 |
Mean latency (ms) | 1.25 |
Max latency (ms) | 76.28 |
Bad responses | 0 |
Socket errors | 0 |
Read throughput (MB/s) | 1,198.08 |
Latency 50th (ms) | 0.86 |
Latency 75th (ms) | 1.27 |
Latency 90th (ms) | 1.74 |
Latency 99th (ms) | 13.06 |
As per the doc, I propose the minimum required hardware for x86/x64 on .NET should be changed from
x86-x64-v1
tox86-x64-v2
.