-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers #68085
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsDo not restrict SIMD registers only for memory that are 16B aligned. Motivation: https://godbolt.org/z/eb53xPvYT
|
@@ -2680,7 +2680,7 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node) | |||
// The following condition prevents using 16-byte stores when dstRegAddrAlignment is: | |||
// 1) unknown (i.e. dstReg is neither FP nor SP) or | |||
// 2) non-zero (i.e. dstRegAddr is not 16-byte aligned). | |||
const bool hasAvailableSimdReg = isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to adjust LSRA. Otherwise, it won't allocate SIMD register(s) when src/dst doesn't use sp/fp as base register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @EgorChesakov . I believe that might be the reason SPC compilation is failing although, from what I understand, this was just the heuristics and we need not have to adjust LSRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to adjust the following logic for InitBlock
runtime/src/coreclr/jit/lsraarmarch.cpp
Lines 632 to 636 in 2d4f2d0
if (isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES)) | |
{ | |
// For larger block sizes CodeGen can choose to use 16-byte SIMD instructions. | |
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates()); | |
} |
and for CopyBlock
runtime/src/coreclr/jit/lsraarmarch.cpp
Lines 711 to 720 in 2d4f2d0
bool canUse16ByteWideInstrs = isSrcAddrLocal && isDstAddrLocal && (size >= 2 * FP_REGSIZE_BYTES); | |
// Note that the SIMD registers allocation is speculative - LSRA doesn't know at this point | |
// whether CodeGen will use SIMD registers (i.e. if such instruction sequence will be more optimal). | |
// Therefore, it must allocate an additional integer register anyway. | |
if (canUse16ByteWideInstrs) | |
{ | |
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates()); | |
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates()); | |
} |
b1d2ff0
to
0866d29
Compare
Do not restrict SIMD registers only for memory that are 16B aligned. Motivation: https://godbolt.org/z/eb53xPvYT
0866d29
to
c96bcf6
Compare
The (code size) diffs look very promising. I am inclined to take this PR and monitor how MicroBenchmark performs. If there are just handful of cases that are negatively impacted, we can reconsider the heuristics. benchmarks.run.windows.arm64.checked.mch:Summary of Code Size diffs: Total bytes of base: 11793820 (overridden on cmd) coreclr_tests.pmi.windows.arm64.checked.mch:Summary of Code Size diffs: Total bytes of base: 121897680 (overridden on cmd) libraries.crossgen2.windows.arm64.checked.mch:Summary of Code Size diffs: Total bytes of base: 48036464 (overridden on cmd) libraries.pmi.windows.arm64.checked.mch:Summary of Code Size diffs: Total bytes of base: 60496980 (overridden on cmd) libraries_tests.pmi.windows.arm64.checked.mch:Summary of Code Size diffs: Total bytes of base: 136917816 (overridden on cmd) Full report: https://dev.azure.com/dnceng/public/_build/results?buildId=1724373&view=ms.vss-build-web.run-extensions-tab |
@dotnet/jit-contrib |
ping. |
Improvements: dotnet/perf-autofiling-issues#4992 |
Do not restrict SIMD registers only for memory that are 16B aligned.
Experiment to see how many cases we miss out with today's restriction.
Motivation: https://godbolt.org/z/eb53xPvYT
Related discussion: #67326 (comment)