Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation in .Net 9 #108058

Open
iSazonov opened this issue Sep 20, 2024 · 23 comments
Open

Performance degradation in .Net 9 #108058

iSazonov opened this issue Sep 20, 2024 · 23 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@iSazonov
Copy link
Contributor

Description

I ran a performance comparison of my simple project and was surprised that the code runs noticeably slower on .Net 9.0 than on .Net 8.0. (35 ms vs 28 ms on my notebook)
Moreover, a code (not my code) that I used as baseline (BinaryTrieBase) works almost the same, but my code is slower.
The project:
BinaryTrie.zip

Regression?

Yes.

Data

BenchmarkDotNet v0.14.0, Windows 10 (10.0.17763.6054/1809/October2018Update/Redstone5)
Intel Core i5-7400 CPU 3.00GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-AXRZBO : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-TZZJOW : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX2

IterationCount=32  

| Method              | Runtime  | Mean     | Error    | StdDev   | Ratio | Allocated | Alloc Ratio |
|-------------------- |--------- |---------:|---------:|---------:|------:|----------:|------------:|
| BinaryTrieBool      | .NET 8.0 | 27.94 ns | 0.136 ns | 0.211 ns |  0.40 |         - |          NA |
| BinaryTrieIPAddress | .NET 8.0 | 28.31 ns | 0.169 ns | 0.252 ns |  0.41 |         - |          NA |
| BinaryTrieBase      | .NET 8.0 | 69.29 ns | 0.392 ns | 0.599 ns |  1.00 |         - |          NA |
| BinaryTrieBool      | .NET 9.0 | 35.24 ns | 0.173 ns | 0.269 ns |  0.51 |         - |          NA |
| BinaryTrieIPAddress | .NET 9.0 | 35.36 ns | 0.132 ns | 0.202 ns |  0.51 |         - |          NA |
| BinaryTrieBase      | .NET 9.0 | 70.04 ns | 0.172 ns | 0.236 ns |  1.01 |         - |          NA |
@iSazonov iSazonov added the tenet-performance Performance related issue label Sep 20, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 20, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 20, 2024
@vcsjones vcsjones added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 20, 2024
@EgorBo

This comment was marked as resolved.

@iSazonov
Copy link
Contributor Author

Looking the bot results I guess it is something specific for my cpu so I share disasm results.
results.zip

@JulieLeeMSFT
Copy link
Member

@EgorBo, please review the disasm from @iSazonov.

Looking the bot results I guess it is something specific for my cpu so I share disasm results. results.zip

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Sep 23, 2024
@JulieLeeMSFT
Copy link
Member

@amanasifkhalid PTAL.

@amanasifkhalid
Copy link
Member

In the optimized .NET 9 codegen for IPBinaryTrie:Lookup, we get pretty unlucky with hitting the JCC erratum mitigation. There are two problematic jumps:

G_M12859_IG04:  ;; offset=0x003C
       cmp      r10d, 16
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 0 ; jcc erratum) 32B boundary ...............................
       ja       G_M12859_IG21
       cmp      gword ptr [rdx+0x08], 0
       jne      G_M12859_IG18
G_M12859_IG10:  ;; offset=0x007F
       test     rdx, rdx
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (test: 2 ; jcc erratum) 32B boundary ...............................
       je       SHORT G_M12859_IG14

The latter is particularly painful, since it's inside a loop body. For .NET 8, we're comparatively lucky -- I don't see any potential for hitting the JCC erratum mitigation.

@iSazonov I believe the JCC erratum mitigation affects Kaby Lake, hence why you're seeing a performance hit locally but not with EgorBot. Could you please try re-running the benchmark with the environment variable DOTNET_JitDoReversePostOrderLayout=0 set, and report back here? This will force the .NET 9 JIT to use the same code layout strategy as .NET 8.

@iSazonov
Copy link
Contributor Author

iSazonov commented Oct 1, 2024

@amanasifkhalid Thanks! It seems DOTNET_JitDoReversePostOrderLayout=0 has no effect.
results.zip

@amanasifkhalid
Copy link
Member

@iSazonov thanks for trying that out. Looking at the codegen with DOTNET_JitDoReversePostOrderLayout=0, we're still hitting the JCC erratum penalty in IPBinaryTrie:Lookup, though the problematic jumps are no longer in the loop body, which sounds like an improvement we'd notice if JCC erratum is the culprit. Then again, we're only looking at one method -- disabling the new code layout strategy might affect codegen for all other methods with nontrivial control flow, so we might've added and removed other instances of JCC erratum penalties, hence the lack of improvement you saw.

The JIT does have a mechanism for mitigating JCC erratum penalties in loops by adding additional loop padding. Currently, this is only available in Debug/Checked builds of the JIT, and it requires us to disable adaptive loop alignment for all methods, which might muddle improvements. But if this mitigation seems to help, I can tweak the loop alignment configuration to allow us to enable the mitigation on a per-method basis. @iSazonov if I share a Checked build of the .NET 9 RC1 JIT with you, would you be willing to try out the mitigation on your machine? I'm afraid I cannot repro the JCC erratum penalty locally. Thanks!

@iSazonov
Copy link
Contributor Author

iSazonov commented Oct 2, 2024

if I share a Checked build of the .NET 9 RC1 JIT with you, would you be willing to try out the mitigation on your machine?

Yes, I can. I hope this helps team to improve JIT.

I found there are many processor families affected by this problem. I feel the number of such processors in operation in the world is simply huge. So I would like to get a solution, if not in the release, then in 9.0.1.

Moreover, in my test, degradation is ~ 20%, and how much is the gain from new optimizations? 1%? 5%? In other words, don't we lose more than we gain in real scenarios? It may be worth doing these optimizations explicitly "opt-in" until there is a better and complete solution.

Perhaps related:
#74771
#35730

@amanasifkhalid
Copy link
Member

amanasifkhalid commented Oct 2, 2024

Moreover, in my test, degradation is ~ 20%, and how much is the gain from new optimizations? 1%? 5%? In other words, don't we lose more than we gain in real scenarios? It may be worth doing these optimizations explicitly "opt-in" until there is a better and complete solution.

The potential performance benefit from the new block layout strategy depends on the specific example; for the benchmark you shared, I looked at the old and new layouts, and the hottest paths are identical, so if we do see improvement for this example, it'll be from other .NET 9 enhancements.

As far as I can tell, the new block layout isn't any more prone to triggering the JCC erratum penalty than the old one, and vice versa. Neither one tries to explicitly avoid JCC erratum conditions when reordering blocks because it's too early in the JIT to know the actual code offsets of the blocks. The tricky thing is even if the block layout of a method doesn't change, increases/reductions in code size from unrelated optimizations can push/pull a jump to a cache boundary, thus triggering the erratum mitigation. And regardless of the block layout strategy, changes in profile data from run to run can result in different block layouts, which may also trigger the erratum mitigation. Part of the reason we've seen an uptick in JCC erratum-related regressions is the new layout strategy has churned users' code enough to create new JCC erratum mitigation sites -- this churn has likely also removed many instances of JCC erratum penalties, but I wouldn't expect customers to report these improvements.

Because of these factors, I don't see justification for disabling the new block layout on the basis of avoiding JCC erratum penalties. However, I think we are justified in trying to enable a compiler-level mitigation in product builds, similar to what MSVC has done. The JIT team has previously considered productizing this (#93243), though one of the tricky parts is determining if the user's CPU is affected by the JCC erratum, and as far as I know, we don't have the ability to detect the user's CPU model during jitting; since a compiler-level mitigation would increase code size, we'd prefer to not enable this for all x64 codegen scenarios.

I apologize for the inconvenience the JCC erratum has caused you and other users. Internally, we've faced plenty of similar headaches when triaging benchmark regressions on Intel machines, and since we're seeing users' code affected by it too, I think we're justified in exposing some JIT configurations to mitigate this. cc @dotnet/jit-contrib for other opinions.

Yes, I can. I hope this helps team to improve JIT.

Thanks! I've attached a zip of the Checked JIT build below. You can patch this into your RC1 installation by replacing its current clrjit.dll with the one below; for me, my install location is C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.0-rc.1.24431.7\clrjit.dll. I recommend you save the current clrjit.dll somewhere else, so you can restore it after. Once you've done this, set the environment variable DOTNET_JitAlignLoopForJcc="Lookup"; this will enable the compiler-level mitigation in loop bodies for IPBinaryTrie:Lookup. Note that setting this will disable adaptive loop alignment, causing us to fall back to a less performant loop alignment implementation, so I recommend you only enable this for methods known to be impacted for now. For me, the mitigation slightly regresses performance, but my machine isn't impacted by the JCC erratum. Looking at the disassembly, I can verify the JCC erratum site in the loop body is gone.

clrjit.zip

Please let me know if you'd like any clarification.

@jkotas
Copy link
Member

jkotas commented Oct 2, 2024

we don't have the ability to detect the user's CPU model during jitting

That's not correct. We pass a lot of information about the current CPU model over JIT/EE interface today, so passing one more bit for JCC erratum would not be a big deal.

@EgorBo
Copy link
Member

EgorBo commented Oct 2, 2024

A good example is the way how we detect CPUs with "bad" AVX512: https://github.com/dotnet/runtime/blob/main/src/coreclr/vm/codeman.cpp#L1613

@amanasifkhalid
Copy link
Member

I see, thanks for clarifying. So we can easily determine if the target CPU is affected, but the current mitigation is less sophisticated than ideal: If adaptive loop alignment is disabled and DOTNET_JitAlignLoopForJcc is set, then for all loops in the method, we will tweak their alignment regardless of whether there's a jump in the loop that might trigger the erratum. It would be nice to have some detection mechanism, though it seems unlikely that this would make it into .NET 9 at this point. For the time being, I think it would be reasonable to make DOTNET_JitAlignLoopForJcc available in Release builds so we can assist impacted customers. We can then pursue an automatic mitigation for .NET 10.

@kunalspathak
Copy link
Member

@amanasifkhalid - was the asmdiff between net8 vs. net9 just with regards to JCC or were there more diffs that could have contributed to the regression? JCC is not a problem on newer Intel hardware and is not problem on AMD. @iSazonov - If you don't mind sharing, what hardware are you seeing this?

@amanasifkhalid
Copy link
Member

@kunalspathak @iSazonov ran the benchmark on Kaby Lake, which I believe is susceptible to the JCC erratum. For IPBinaryTrie::Lookup, there was a slight diff in block layout between .NET 8 and 9 that is unlikely to overshadow the JCC erratum penalty in the loop; in short, the .NET 9 layout strategy didn't move some blocks with near-zero execution counts all the way to the end of the method because the counts just exceeded the JIT's threshold for "cold," though these blocks were already off the hot path, so I doubt they're the main driver of this regression.

@kunalspathak
Copy link
Member

kunalspathak commented Oct 2, 2024

Kaby Lake, which I believe is susceptible to the JCC erratum

Ah, just noticed that in the PR description. This jcc doc from Intel confirms that it is affected.

there was a slight diff in block layout between .NET 8 and 9

Ok, so no other diffs other than block layout related changes?

Edit: Just a word of caution, DOTNET_JitAlignLoopForJcc and non-adaptive loop alignment is untested currently and was there for experimental purposes.

@amanasifkhalid
Copy link
Member

Ok, so no other diffs other than block layout related changes?

Nothing else stuck out to me from the diffs in the benchmark report (zip).

@AndyAyersMS
Copy link
Member

Edit: Just a word of caution, DOTNET_JitAlignLoopForJcc and non-adaptive loop alignment is untested currently and was there for experimental purposes.

It's not clear to me we actually have an effective mitigation strategy for JCC errata. And as Aman says the impact of it is somewhat "random".

Ok, so no other diffs other than block layout related changes?

Nothing else stuck out to me from the diffs in the benchmark report (zip).

It might be worth investigating with VTune. I can try this but it may take a few days.

@kunalspathak
Copy link
Member

It's not clear to me we actually have an effective mitigation strategy for JCC errata.

That's my point. We do not and DOTNET_JitAlignLoopForJcc was added a while back with the intention of mitigating JCC erratum but it was not implemented completely and hence is still experimental.

@iSazonov
Copy link
Contributor Author

iSazonov commented Oct 3, 2024

@amanasifkhalid Here is new results for Checked JIT build. I don't see any change in performance.
results2.zip

@iSazonov
Copy link
Contributor Author

iSazonov commented Oct 3, 2024

It's not clear to me we actually have an effective mitigation strategy for JCC errata.

I remember back in the 90s, Intel CPUs were already crashing operating systems due to unforeseen "features" of these processors. Even then, compilers had to compensate for these problems.
So if we talk about strategy, then I wouldn't just talk about JCC errata. JIT engine should have something more universal/smart to take into account the "features" of existing and future processors.

@iSazonov
Copy link
Contributor Author

iSazonov commented Oct 7, 2024

@amanasifkhalid
Copy link
Member

amanasifkhalid commented Oct 11, 2024

@iSazonov thanks for giving the Checked build a try -- I'm sorry to hear it didn't help. It looks like the JIT needs a more sophisticated mitigation to make a difference. Unfortunately, we're at a point in the product cycle where additional feature work won't be included in .NET 9. I'm going to keep this issue open, and re-target for .NET 10; based on feedback from you and other users, I think we're justified in pursuing a mitigation.

So if we talk about strategy, then I wouldn't just talk about JCC errata. JIT engine should have something more universal/smart to take into account the "features" of existing and future processors.

I hear you. To paraphrase @AndyAyersMS, the JIT is quite sophisticated in some places, and immature in others -- users with compiler backgrounds are sometimes surprised to discover the JIT lacks a particular feature. Historically, we've prioritized optimizations that would most benefit .NET customers. Thus, optimizations that reduce the cost of commonly-used language features would typically take precedence over more standard compiler features. We've recently made an effort to catch up in the latter domain (especially in .NET 9, considering the emphasis placed on loop optimizations and code layout), though clearly we still have some holes to patch. When compared to features that benefit all users, processor-specific features typically take lower priority, though the frequency of JCC errata-related regressions impacting our ability to triage changes to block layout incentivizes us to implement a mitigation.

@amanasifkhalid amanasifkhalid modified the milestones: 9.0.0, 10.0.0 Oct 11, 2024
@iSazonov
Copy link
Contributor Author

In code I shared above I manually unroll internal loop and I get great perf results:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.17763.6054/1809/October2018Update/Redstone5)
Intel Core i5-7400 CPU 3.00GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET SDK 9.0.100-rc.2.24474.11
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-OZMANY : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX2
  Job-FKJJVW : .NET 9.0.0 (9.0.24.47305), X64 RyuJIT AVX2

IterationCount=32  

Method Runtime Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
BinaryTrieBool .NET 8.0 19.38 ns 0.151 ns 0.235 ns 0.28 0.00 213 B - NA
BinaryTrieIPAddress .NET 8.0 18.42 ns 0.158 ns 0.247 ns 0.26 0.00 232 B - NA
BinaryTrieBase .NET 8.0 70.32 ns 0.622 ns 0.969 ns 1.00 0.02 338 B - NA
BinaryTrieBool .NET 9.0 17.11 ns 0.181 ns 0.282 ns 0.24 0.01 201 B - NA
BinaryTrieIPAddress .NET 9.0 17.58 ns 0.165 ns 0.256 ns 0.25 0.00 214 B - NA
BinaryTrieBase .NET 9.0 68.21 ns 0.572 ns 0.891 ns 0.97 0.02 333 B - NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

8 participants