Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving ARM64 Performance in .NET 7.0 #64820

Closed
20 of 32 tasks
Tracked by #70527
kunalspathak opened this issue Feb 4, 2022 · 10 comments
Closed
20 of 32 tasks
Tracked by #70527

Improving ARM64 Performance in .NET 7.0 #64820

kunalspathak opened this issue Feb 4, 2022 · 10 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@kunalspathak
Copy link
Member

kunalspathak commented Feb 4, 2022

In .NET 7.0, we will continue our efforts to improve the Arm64 code quality and closing the performance gap with x64. Similar to how we did this in .NET 5 in #35853 , we will continue the trend of tracking all the Arm64 issues in a top level issue.

Moved to Future Work

@kunalspathak kunalspathak added the User Story A single user-facing feature. Can be grouped under an epic. label Feb 4, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 4, 2022
@ghost
Copy link

ghost commented Feb 4, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

In .NET 7.0, we will continue our efforts to improve the Arm64 code quality and closing the performance gap with x64. Similar to how we did this in .NET 5 in #35853 , we will continue the trend of tracking all the Arm64 issues in a top level issue.

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged, User Story

Milestone: -

@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@JulieLeeMSFT JulieLeeMSFT changed the title Improving ARM64 Performance in .NET 7.0 – Closing the gap with x64 Improving ARM64 Performance in .NET 7.0 Feb 4, 2022
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Feb 5, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Feb 5, 2022
@adamsitnik
Copy link
Member

Based on #67339 I think it would be good to add #62302 to this list.

@kunalspathak
Copy link
Member Author

Based on #67339 I think it would be good to add #62302 to this list.

Done. Thanks for preparing the report.

@kunalspathak
Copy link
Member Author

kunalspathak commented Apr 8, 2022

.NET 7 items:

Issue Owner ETA Doable in .NET 7
Arm64: Use 8.1 atomics #67824 @kpathak June' 22 Yes
Arm64: Align methods containing loops to 32B #59828 @kpathak DONE
Loop Alignment support for Arm64 #60135 @kpathak DONE
Hide 'align' instruction behind jmp #60787 @kpathak DONE
Arm64: Better addressing mode for float/double array access #64819 @EgorBo DONE
Correctly get the last level cache size used by GC #60166 @mangod9 DONE
The thread pool's global queue doesn't scale well on machines with a large processor count #67845 @mangod9 DONE Yes
Equivalent thread pool change in Kestrel #67845 @sebastienros Done
Arm64: Environment.ProcessorCount returns wrong value on higher core machine #67180 (WIP: #68639) @mangod9 June' 22 ?
Arm64: Revisit the heuristics for IO completion poller threads #67266 @mangod9 June' 22 ?
Optimize jump stubs on arm64 #62302 @EgorBo June' 22 ?
ARM64: Optimize a % b operation #34937 @TIHan DONE
Double constants usage in a loop can be CSEed #35257 (WIP) @TIHan June' 22 ?
x64 vs ARM64 Microbenchmarks Performance Study Report #67339 @EgorBo, @kpathak June' 22 Yes
Arm64: Better addressing mode for array access whose elements are accessed byref #67981 @EgorBo June' 22 Yes
Arm64: Forward memset/memcpy to CRT implementation #67326 @kpathak DONE
Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers #68085 @kpathak DONE
Hoisting the invariant out of multi-level nested loops #61420 @kpathak, @BruceForstall Done
Arm64: Generate conditional comparison and selection instructions #55364 @a74nh WIP: #67894 Yes
Optimize System.Text.ASCIIUtility for arm64 using cross-platform intrinsics #41292 @a74nh Done
Optimize System.Buffers for arm64 using cross-platform intrinsics #35033 @a74nh Done

Stretch goals:

Issue Owner ETA
[Arm64] Peephole optimization opportunities #55365 @a74nh TBD
[LSRA] Add support for allocating consecutive registers #39457 @kpathak Future
Enable multi-register intrinsics support for Arm64 #64921 @BruceForstall Future
API Proposal : Arm TableVectorLookup and TableVectorExtension intrinsics #1277 @a74nh Future
jitdump output not accepted by ARM streamline #62456 @RobertHenry6bev Future
Optimize set_brick code in GC @Maoni0 TBD
[ARM64] Performance regression: Utf8Encoding #41699 @a74nh TBD
[ARM64/Linux] Inefficient conditionals branching #12735 @a74nh TBD
JIT: Redundant fmov's on arm64 for a simple function #58954 @a74nh TBD
Arm64: Consider using "DC ZVA" instruction #67244 @kpathak, @a74nh TBD
Review the multi-op instruction usage for Arm64 #68028 @TIHan Future
Arm64: Evaluate if it is possible to combine subsequent field loads in a single load #64815 (lower priority) TBD TBD
Arm64: In mod operation happening inside the loop, if divisor is an invariant, hoist the divisor checks #64795 @TIHan .NET 8

@JulieLeeMSFT
Copy link
Member

#68028

@kunalspathak
Copy link
Member Author

#68028

Included in the table above.

@a74nh
Copy link
Contributor

a74nh commented Jul 13, 2022

Optimize System.Text.ASCIIUtility for arm64 using cross-platform intrinsics
Issue: #41292
PR: #70080 and #71637
Approach taken:
The existing Sse2/Sse41 implementation was moved to use the Vector128 API.
Arm64 was switched to use the Vector128 implementation instead of the vector generic version.
Where required for performance, the Sse2/Sse41/AdvSimd APIs were used.
Impact:
Small improvement in the relevant microbenchmarks on Arm64.
These changes were not significant enough to be picked up by the Performanceautofiler post merge.
Performance gain was small due to the vector generic version being good/simple.
Use of the Vector128 API helps to reduce code debt.
Follow on work:
None.

Optimize System.Buffers for arm64 using cross-platform intrinsics
Issue: #35033
PR: #70654 and dotnet/performance#2479
Approach taken:
Fixed issue with the microbenchmarks using invalid data.
The existing Ssse3 implementation was moved to use the Vector128 API.
Arm64 was switched to use the Vector128 implementation instead of non-vectorised version.
Where required for performance, the Sse3/AdvSimd APIs were used.
Impact:
Large improvement in the relevant microbenchmarks on Arm64.
Performance improvements detected by the Performanceautofiler:
dotnet/perf-autofiling-issues#6346
dotnet/perf-autofiling-issues#6334
dotnet/perf-autofiling-issues#6340
dotnet/perf-autofiling-issues#6328
dotnet/perf-autofiling-issues#6327
dotnet/perf-autofiling-issues#6326
dotnet/perf-autofiling-issues#6321
Use of the Vector128 API helps to reduce code debt.
Follow on Work:
Once multiple register instructions (such as LD4) have been implemented in Vector128, then further improvements may be possible switching to an implementation based on the NEON version of the Aklomp base64 algorithm.

@kunalspathak
Copy link
Member Author

The only thing remaining from the list is #55364 and it is already set for .NET 7. I will move this to .NET 8 to track remaining work.

@kunalspathak
Copy link
Member Author

Replaced with #77010

@ghost ghost locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI User Story A single user-facing feature. Can be grouped under an epic.
Projects
Archived in project
Development

No branches or pull requests

4 participants