New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers #68085

Merged

kunalspathak merged 4 commits into dotnet:main from kunalspathak:memset_quad

Apr 22, 2022

Member

kunalspathak commented Apr 15, 2022 •

edited

Loading

Do not restrict SIMD registers only for memory that are 16B aligned.
Experiment to see how many cases we miss out with today's restriction.

Motivation: https://godbolt.org/z/eb53xPvYT
Related discussion: #67326 (comment)

ghost assigned kunalspathak

dotnet-issue-labeler bot added the area-CodeGen-coreclr label

ghost commented Apr 15, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Do not restrict SIMD registers only for memory that are 16B aligned.
Experiment to see how many cases we miss out with today's restriction.

Motivation: https://godbolt.org/z/eb53xPvYT

Author:	kunalspathak
Assignees:	kunalspathak
Labels:	`area-CodeGen-coreclr`
Milestone:	-

echesakov reviewed

View reviewed changes

src/coreclr/jit/codegenarmarch.cpp

    
            @@ -2680,7 +2680,7 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node)
          
                  // The following condition prevents using 16-byte stores when dstRegAddrAlignment is:

                  //   1) unknown (i.e. dstReg is neither FP nor SP) or

                  //   2) non-zero (i.e. dstRegAddr is not 16-byte aligned).

                  const bool hasAvailableSimdReg = isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES);

Contributor

echesakov Apr 15, 2022

You will need to adjust LSRA. Otherwise, it won't allocate SIMD register(s) when src/dst doesn't use sp/fp as base register.

Member Author

kunalspathak Apr 15, 2022 •

edited

Loading

Thanks @EgorChesakov . I believe that might be the reason SPC compilation is failing although, from what I understand, this was just the heuristics and we need not have to adjust LSRA.

Contributor

echesakov Apr 16, 2022

You need to adjust the following logic for InitBlock

runtime/src/coreclr/jit/lsraarmarch.cpp

Lines 632 to 636 in 2d4f2d0

    
           if (isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES)) 
        
           { 
        
               // For larger block sizes CodeGen can choose to use 16-byte SIMD instructions. 
        
               buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates()); 
        
           }

and for CopyBlock

runtime/src/coreclr/jit/lsraarmarch.cpp

Lines 711 to 720 in 2d4f2d0

    
           bool canUse16ByteWideInstrs = isSrcAddrLocal && isDstAddrLocal && (size >= 2 * FP_REGSIZE_BYTES); 
        
           // Note that the SIMD registers allocation is speculative - LSRA doesn't know at this point 
        
           // whether CodeGen will use SIMD registers (i.e. if such instruction sequence will be more optimal). 
        
           // Therefore, it must allocate an additional integer register anyway. 
        
           if (canUse16ByteWideInstrs) 
        
           { 
        
               buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates()); 
        
               buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates()); 
        
           }

kunalspathak force-pushed the memset_quad branch from b1d2ff0 to 0866d29 Compare

April 18, 2022 17:19

kunalspathak added 2 commits

April 18, 2022 12:01


          Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers

d870e25

Do not restrict SIMD registers only for memory that are 16B aligned.
Motivation: https://godbolt.org/z/eb53xPvYT


          also adjust the LSRA for SIMD registers

c96bcf6

kunalspathak force-pushed the memset_quad branch from 0866d29 to c96bcf6 Compare

April 18, 2022 19:02

Member Author

kunalspathak commented Apr 19, 2022

The (code size) diffs look very promising. I am inclined to take this PR and monitor how MicroBenchmark performs. If there are just handful of cases that are negatively impacted, we can reconsider the heuristics.

benchmarks.run.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 11793820 (overridden on cmd)
Total bytes of diff: 11791688 (overridden on cmd)
Total bytes of delta: -2132 (-0.02 % of base)
diff is an improvement.
relative diff is an improvement.

coreclr_tests.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 121897680 (overridden on cmd)
Total bytes of diff: 121843412 (overridden on cmd)
Total bytes of delta: -54268 (-0.04 % of base)
diff is an improvement.
relative diff is an improvement.

libraries.crossgen2.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 48036464 (overridden on cmd)
Total bytes of diff: 48023944 (overridden on cmd)
Total bytes of delta: -12520 (-0.03 % of base)
diff is an improvement.
relative diff is an improvement.

libraries.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 60496980 (overridden on cmd)
Total bytes of diff: 60486588 (overridden on cmd)
Total bytes of delta: -10392 (-0.02 % of base)
diff is an improvement.
relative diff is an improvement.

libraries_tests.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 136917816 (overridden on cmd)
Total bytes of diff: 136898512 (overridden on cmd)
Total bytes of delta: -19304 (-0.01 % of base)
diff is an improvement.
relative diff is an improvement.

Full report: https://dev.azure.com/dnceng/public/_build/results?buildId=1724373&view=ms.vss-build-web.run-extensions-tab

kunalspathak added 2 commits

April 18, 2022 22:46


          cleanup

25a3791


          jit formatting

a678a02

kunalspathak changed the title ~~WIP Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers~~ Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers

kunalspathak marked this pull request as ready for review

April 19, 2022 05:48

Member Author

kunalspathak commented Apr 19, 2022

@dotnet/jit-contrib

Member Author

kunalspathak commented Apr 22, 2022

ping.

BruceForstall approved these changes

View reviewed changes

kunalspathak merged commit 12c06c0 into dotnet:main

kunalspathak deleted the memset_quad branch

April 22, 2022 22:23

This was referenced Apr 22, 2022

Arm64: Forward memset/memcpy to CRT implementation #67326

Closed

Improving ARM64 Performance in .NET 7.0 #64820

Closed

AndyAyersMS mentioned this pull request

Arm64: Perf regressions from using SIMD registers in cp/init blk #68665

Closed

Member

AndyAyersMS commented Apr 28, 2022

Improvements: dotnet/perf-autofiling-issues#4992

AndyAyersMS mentioned this pull request

Test failure JIT/Directed/VectorABI/VectorMgdMgdStatic_r/VectorMgdMgdStatic_r.sh #68530

Closed

kunalspathak mentioned this pull request

Arm64: Make unroll code for cpblk non-interruptible #69202

Merged

ghost locked as resolved and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr