Allow multiple kmask registers to be allocated and cleanup some codegen around them #89059

tannergooding · 2023-07-17T23:36:03Z

Various pessimizations around the kmask handling were found which were causing suboptimal performance numbers for various algorithms updated to support V512.

To address this, the PR ensures that kmask registers are properly supported in the register allocator and that the core patterns necessary are recognized around them to get efficient codegen.

The asmdiffs, particularly for Windows x64 are very positive.

The TP impact, however, is not great and comes from LSRA needing additional checks in many paths to handle the new register kind. I don't think there is a trivial fix for this as an additional register file simply requires more checks/handling.

The Arm64 TP impact comes from the fix to calleeSaveRegs/callerSaveRegs called out here: https://github.com/dotnet/runtime/pull/89059/files#r1269918053. Namely they were doing an IsType rather than a UsesReg check and were technically returning the wrong CALLEE_SAVED set for TYP_STRUCT. Its unclear why this is viewed as more expensive given that the two checks are doing similar checks (IsType does ((varTypeClassification[TypeGet(vt)] & (VTF_INT)) != 0);, while UsesReg does varTypeRegister[TypeGet(vt)] == VTR_INT. This is really just the difference between test; jcc vs cmp; jcc

…en around them

tannergooding · 2023-07-20T20:07:27Z

src/coreclr/jit/lsrabuild.cpp

+#if defined(TARGET_XARCH)
+    // xarch has mask registers available when EVEX is supported
+
+    if (compiler->canUseEvexEncoding())
+    {
+        for (unsigned int i = 0; i < lsraRegOrderMskSize; i++)
+        {
+            regNumber  reg  = lsraRegOrderMsk[i];
+            RegRecord* curr = &physRegs[reg];
+            curr->regOrder  = (unsigned char)i;
+        }
+    }
+#endif // TARGET_XARCH


Much of the TP regression comes from this. LSRA is very sensitive to the additional cost from building these extra 8 registers.

tannergooding · 2023-07-20T20:10:28Z

src/coreclr/jit/lsrabuild.cpp

+#if defined(TARGET_XARCH)
+        killMask &= ~RBM_MSK_CALLEE_TRASH;
+#endif // TARGET_XARCH


Perhaps interestingly, having this vs not having it makes a difference to the integer registers allocated.

That, in general, seems odd and one would expect that the kill mask including or excluding floating/mask registers shouldn't impact how integer registers are allocated (particularly when there are no floating-point or mask registers actually in use).

src/coreclr/jit/lsra.h

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

tannergooding · 2023-07-20T20:18:02Z

src/coreclr/jit/targetamd64.h

@@ -92,7 +92,9 @@
  #define REG_MASK_FIRST           REG_K0
  #define REG_MASK_LAST            REG_K7

-  #define RBM_ALLMASK              RBM_K1
+  #define RBM_ALLMASK_INIT         (0)
+  #define RBM_ALLMASK_EVEX         (RBM_K1 | RBM_K2 | RBM_K3 | RBM_K4 | RBM_K5 | RBM_K6 | RBM_K7)


Notably this doesn't include K0 because it is a bit special.

In particular, while K0 can be used as an input or an output of many of the kmask instructions, it cannot be used as a predicate to many of the SIMD instructions because it instead represents "no predication".

We should likely ensure K0 can still be used longer term, but for this PR it's unnecessary.

tannergooding · 2023-07-20T20:19:06Z

src/coreclr/jit/morph.cpp

+        case NI_AVX512F_Add:
+        case NI_AVX512BW_Add:
+        case NI_AVX512F_And:
+        case NI_AVX512DQ_And:
+        case NI_AVX512F_AndNot:
+        case NI_AVX512DQ_AndNot:
+        case NI_AVX512F_Or:
+        case NI_AVX512DQ_Or:
+        case NI_AVX512F_Xor:
+        case NI_AVX512DQ_Xor:


This covers the core bitwise patterns around kmask registers. It notably doesn't factor in knot, kshift, or kxnor but those are somewhat less common and can be handled in the future.

tannergooding · 2023-07-20T20:19:50Z

src/coreclr/jit/valuenum.cpp

+            // We want to ensure that we get a TYP_MASK local to
+            // ensure the relevant optimizations can kick in


This ensures we don't typically end up with a local hiding the ConvertMaskToVector and results in overall better codegen since most cases allow it to be elided.

And this is not needed for ConvertVectorToMask?

Right. We just care about seeing the TYP_MASK node since it needs special handling. TYP_SIMD nodes are already well handled throughout.

…n't happen for floating-point

tannergooding · 2023-07-22T18:40:41Z

Do you mind sending a separate PR for this change? It is unrelated to the kmask register change and it will be easier to review and spot that asmdiff/TP is not impacted for arm64 because of this change.

I extracted it to an isolated PR, but its worth noting it is not completely unrelated to the kmask register change. We need to check and handle if the register type is TYP_MASK in these methods.

kunalspathak

Just to confirm, the TP regression is coming from creating the register mask registers (as you pointed out), but its impact should be similar on Minopts vs. Regular. Do we know why MinOpts shows higher TP regression? I am also don't understand why replacing the varTypeIsFloating with varTypeUsesIntReg would cause it. Did you find out the assembly difference between the two?

(numbers are similar for linux/x64):

Also, do we know why asmdiffs for linux/x64 is way less (858 bytes) than windows/x64 (11K bytes)?

Could you confirm why clang shows TP regression on arm64?

kunalspathak · 2023-07-24T02:22:26Z

src/coreclr/jit/typelist.h

-DEF_TP(MASK     ,"mask"   , TYP_MASK,     8, 8,  8,   2, 8, VTR_MASK,  availableMaskRegs,   VTF_ANY)
+DEF_TP(SIMD32   ,"simd32" , TYP_SIMD32,  32,32, 32,   8,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED,    RBM_FLT_CALLEE_TRASH,    VTF_S|VTF_VEC)
+DEF_TP(SIMD64   ,"simd64" , TYP_SIMD64,  64,64, 64,  16,16, VTR_FLOAT, availableDoubleRegs, RBM_FLT_CALLEE_SAVED,    RBM_FLT_CALLEE_TRASH,    VTF_S|VTF_VEC)
+DEF_TP(MASK     ,"mask"   , TYP_MASK,     8, 8,  8,   2, 8, VTR_MASK,  availableMaskRegs,   RBM_MSK_CALLEE_SAVED,    RBM_MSK_CALLEE_TRASH,    VTF_S)


Earlier, was this mistakenly marked as VTF_ANY instead of VTF_S?

Yes. It is strictly a struct and should be involved in struct like copying and other optimizations.

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

kunalspathak · 2023-07-24T02:51:57Z

src/coreclr/jit/lowerxarch.cpp

@@ -1927,11 +2002,16 @@ GenTree* Lowering::LowerHWIntrinsicCmpOp(GenTreeHWIntrinsic* node, genTreeOps cm

                        default:
                        {
-                            unreached();
+                            maskIntrinsicId = NI_AVX512F_NotMask;


can you elaborate how all the default cases sets NI_AVX512F_NotMask? Can we add some assert to make sure that we are setting this for correct intrinsicid?

There's already a couple lengthy comment above this code elaborating on the overall needs and the needs of this path in particular.

For the cases where we have a partial mask, we need to invert the comparison. This is because matches become 1 and bits are otherwise 0, including any bits that are unused in the comparison.

Consider for example if we had Vector128<int> and so we have 4-bits in the mask, but the mask instruction always checks at least 8-bits. If we did CompareEqual and all elements matched, we'd still get 0b0000_1111 and so we couldn't easily check "are all matching". However, if we invert the comparison and do CompareNotEqual we get 0b0000_0000 for all matches and can then trivially check "are all matching".

This is meant to handle the general case of "other mask intrinsic" where we don't have a well known inverse. Notably I think I forgot a bit of code here, we need to either invert just the lower n bits or always set the upper 4-bits. Will fix to ensure just that is happening

src/coreclr/jit/lowerxarch.cpp

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Packed.cs

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs

src/coreclr/jit/lsrabuild.cpp

kunalspathak · 2023-07-24T03:30:58Z

src/coreclr/jit/lsrabuild.cpp

@@ -3968,6 +3998,12 @@ int LinearScan::BuildReturn(GenTree* tree)
                            {
                                buildInternalIntRegisterDefForNode(tree, dstRegMask);
                            }
+#if defined(TARGET_XARCH) && defined(FEATURE_SIMD)
+                            else if (varTypeUsesMaskReg(dstType))


Should this check be guarded for AVX512 platforms only to reduce the impact of TP on non-AVX512 xarch platforms?

varTypeUsesMaskReg is basically doing a dstType == TYP_MASK check (and we could actually optimize it to that given we only have the 1 mask type today).

Checking comp->canUseEvexEncoding would just add an additional branch and not really save us much (we'd be trading one comparison/branch for another). It may even and potentially slow down the actual execution due to more branches being in a small code window which can pessimize the branch predictor.

sure, that's what I thought, because it is still a runtime check and will cost us. In LSRA, I think this check would have increased the TP regression.

Right. It will increase the cost to get a floating-point internal register. But, that cost basically "has" to exist; we're either going to have else if (comp->canUseEvexEncoding && varTypeUsesMaskReg(dstType)) -or- else if (varTypeUsesMaskReg(dstType)) -or- else if (varTypeUsesFloatReg(dstType))

The first one is a memory access, branch, comparison, and branch

The second one is currently just a comparison and branch.

The third is currently a memory access, comparison, and branch

So what we have currently is the cheapest we can make it.

We could potentially have a small lookup table that maps type to a fnptr representing the buildInternalReg call. However, that's potentially more expensive in terms of actual execution cost.

I'd think there are better ways to win the TP back instead.

tannergooding · 2023-07-24T15:18:03Z

Just to confirm, the TP regression is coming from creating the register mask registers (as you pointed out), but its impact should be similar on Minopts vs. Regular. Do we know why MinOpts shows higher TP regression? I am also don't understand why replacing the varTypeIsFloating with varTypeUsesIntReg would cause it. Did you find out the assembly difference between the two?

They won't and shouldn't be similar as the number of total instructions executed for minopts vs fullopts has never been close. fullopts is itself executing a lot more overall code to do optimizations, CSE, folding, inlining, etc. So the number of "
"new" instructions introduced to LSRA represents an overall smaller percentage of the work done.

You can see the following, for example, where fullopts is executing 183.66x the number of instructions:

Context	Collection	Base # instructions	Diff # instructions	PDIFF
minopts	libraries.pmi.windows.x64.checked.mch	1,446,462,124	1,465,926,185	+1.35%
fullopts	libraries.pmi.windows.x64.checked.mch	265,659,020,750	266,361,349,309	+0.26%

tannergooding · 2023-07-24T15:28:50Z

Also, do we know why asmdiffs for linux/x64 is way less (858 bytes) than windows/x64 (11K bytes)?

Linux x64 doesn't include aspnet while Windows x64 does. For Windows this represents -10,803 bytes of asm.

Could you confirm why clang shows TP regression on arm64?

TP is a measurement of instructions executed, not strictly a measurement of perf. Clang prefers to emit more instructions by default, generating code that it believes will take less time to execute and so it will almost always show a worse TP regression than MSVC. In practice the code will take similar amounts of time to execute end to end. There are a great number of scenarios where longer codegen (more instructions) are better than shorter codegen (less instructions), as well as many scenarios where a change may not be on a hot path, may get removed via PGO, may actually execute more code but result in faster execution time, etc. The only way to know for sure if a change like this is good or bad is to actually measure the average execution time for various minopts vs fullopts scenarios.

kunalspathak · 2023-07-24T18:04:48Z

src/coreclr/jit/lsra.h

    }

+    // Not all of the callee trash values are constant, so don't declare this as a method local static


not for this PR, but should we refactor this at other places as well, if similar situation exists?

The other places are currently all constant, from what I saw at least. If we have any other non-constant places introduced in this PR then we should update them

kunalspathak · 2023-07-24T18:44:30Z

so it will almost always show a worse TP regression than MSVC

I meant that for arm64, this should have shown improvement regardless of msvc or clang, and wanted to see the diff (if possible) to understand why arm64 has regression only on clang.

tannergooding · 2023-07-24T18:52:43Z

I meant that for arm64, this should have shown improvement regardless of msvc or clang, and wanted to see the diff (if possible) to understand why arm64 has regression only on clang.

That's making an assumption that Clang will generate the same code as MSVC, when they can frequently differ. In this case it was because the varTypeCalleeTrashRegs wasn't constant due to RBM_FLT_CALLEE_TRASH and RBM_MSK_CALLEE_TRASH weren't constant and the number of instructions required to support that is significantly more when using Clang.

This should be resolved now that I've changed it to be an array that's initialized when the LSRA is created instead.

kunalspathak · 2023-07-25T12:21:23Z

Seems there is a new test failure:

Assert failure(PID 4303 [0x000010cf], Thread: 4380 [0x111c]): Assertion failed '(reinterpret_cast<size_t>(addr) & (byteSize - 1)) == 0' in 'System.Linq.Enumerable:Sum[long,long](System.Collections.Generic.IEnumerable`1[long]):long' during 'Emit code' (IL size 92; hash 0x698aae1b; Tier1)

    File: /__w/1/s/src/coreclr/jit/emitxarch.cpp Line: 14085
    Image: /datadisks/disk1/work/A3DE08FB/p/dotnet

tannergooding · 2023-07-25T15:18:50Z

Seems there is a new test failure:

Yes, I'm looking into it, but I'm not convinced its related. It also showed up on an previously passing commit and there hasn't been any change to the constant handling for the code bit in question (the only change was from using vpand to using vptest, both of which expect a 32-byte aligned constant).

tannergooding · 2023-07-25T16:42:41Z

Found the issue. There was an edge case under DPGO where we were getting a op_equality(and(x, cns), zero) and where the and(x, cns) had been transformed into and(x, broadcast(cns)) to take advantage of embedded broadcast. In that scenario, we shouldn't transform to ptest(x, broadcast(cns)) but should rather keep it is tmp = and(x, cns); ptest(tmp, tmp) (as that is better than tmp = broadcast(cns); ptest(x, tmp))

…cast

kunalspathak

LGTM

Allow multiple kmask registers to be allocated and cleanup some codeg…

f81815f

…en around them

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 17, 2023

ghost assigned tannergooding Jul 17, 2023

tannergooding added 2 commits July 18, 2023 08:59

Apply formatting patch

2452f49

Fix an assert to include TYP_STRUCT

28dda4a

build-analysis bot mentioned this pull request Jul 18, 2023

LibraryImportGenerator.Unit.Tests crashed in CI #87951

Closed

tannergooding added 2 commits July 18, 2023 19:48

Merge remote-tracking branch 'dotnet/main' into kmask

692c8b4

Ensure kmask registers aren't in the default killset

47827c4

build-analysis bot mentioned this pull request Jul 19, 2023

Test build fails with "libSystem.IO.Ports.Native.so.dbg' to be packed was not found on disk." #89146

Closed

Apply formatting patch

442e26d

build-analysis bot mentioned this pull request Jul 19, 2023

[mono] Build failure: System.IO.Ports.Native.so.dbg not found on disk #89201

Closed

tannergooding added 2 commits July 19, 2023 19:57

Move the kmask optimizations up to morph

d1ef0ea

Merge remote-tracking branch 'dotnet/main' into kmask

f78a16e

build-analysis bot mentioned this pull request Jul 20, 2023

Tracking issue for CI build timeouts #76454

Closed

tannergooding added 3 commits July 20, 2023 09:01

Ensure unique VN for ConvertMaskToVector

80a565f

Ensure some basic other handling for kmask testing is handled

4b73427

Improve the implementation for some managed Vector512 code paths

1e45fdd

tannergooding force-pushed the kmask branch from d052631 to 1e45fdd Compare July 20, 2023 19:21

Apply formatting patch

31b0893

tannergooding commented Jul 20, 2023

View reviewed changes

src/coreclr/jit/lsra.h Show resolved Hide resolved

tannergooding commented Jul 20, 2023

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs Outdated Show resolved Hide resolved

tannergooding commented Jul 20, 2023

View reviewed changes

tannergooding added 3 commits July 20, 2023 13:53

Ensure that the knot intrinsic is inserted into the IR

dfe3e31

Apply formatting patch

75d4d25

Ensure the conversion of CompareEqualMask(x, zero) to Test(x, x) does…

07bfc1f

…n't happen for floating-point

tannergooding marked this pull request as ready for review July 21, 2023 20:09

Have callee/callerSaveRegs() use an array based lookup

136e898

kunalspathak requested changes Jul 24, 2023

View reviewed changes

ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 24, 2023

kunalspathak reviewed Jul 24, 2023

View reviewed changes

tannergooding force-pushed the kmask branch 2 times, most recently from a89d145 to 14f269b Compare July 25, 2023 01:39

Respond to PR feedback and try to reduce TP regression more

8a0c9a3

tannergooding force-pushed the kmask branch from 14f269b to 8a0c9a3 Compare July 25, 2023 02:20

Merge remote-tracking branch 'dotnet/main' into kmask

370e95b

tannergooding force-pushed the kmask branch from dc16632 to e8b8301 Compare July 25, 2023 17:29

Ensure PTEST doesn't try to handle something utilizing embedded broad…

43ad9a0

…cast

tannergooding force-pushed the kmask branch from e8b8301 to 43ad9a0 Compare July 25, 2023 17:30

kunalspathak approved these changes Jul 25, 2023

View reviewed changes

tannergooding merged commit 90df613 into dotnet:main Jul 25, 2023
197 checks passed

tannergooding deleted the kmask branch July 25, 2023 20:14

build-analysis bot mentioned this pull request Jul 25, 2023

Test failure profiler/gc/nongcheap/nongcheap.sh #88507

Closed

BruceForstall mentioned this pull request Jul 28, 2023

Add opmask (k) registers to the register allocator #80823

Closed

cincuranet mentioned this pull request Aug 1, 2023

[Perf] Linux/x64: 3 Regressions on 7/26/2023 12:02:55 AM #89796

Closed

AndyAyersMS mentioned this pull request Aug 1, 2023

[Perf] Linux/x64: 2 Improvements on 7/26/2023 12:02:55 AM dotnet/perf-autofiling-issues#20134

Closed

cincuranet mentioned this pull request Aug 8, 2023

[Perf] Linux/x64: 2 Regressions on 7/26/2023 12:02:55 AM dotnet/perf-autofiling-issues#20438

Closed

ghost locked as resolved and limited conversation to collaborators Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple kmask registers to be allocated and cleanup some codegen around them #89059

Allow multiple kmask registers to be allocated and cleanup some codegen around them #89059

tannergooding commented Jul 17, 2023 •

edited

Loading

tannergooding Jul 20, 2023

tannergooding Jul 20, 2023

tannergooding Jul 20, 2023

tannergooding Jul 20, 2023

tannergooding Jul 20, 2023 •

edited

Loading

kunalspathak Jul 22, 2023

tannergooding Jul 22, 2023

tannergooding commented Jul 22, 2023

kunalspathak left a comment

kunalspathak Jul 24, 2023

tannergooding Jul 24, 2023

kunalspathak Jul 24, 2023

tannergooding Jul 24, 2023

kunalspathak Jul 24, 2023

tannergooding Jul 24, 2023

kunalspathak Jul 24, 2023

tannergooding Jul 24, 2023

tannergooding Jul 24, 2023

tannergooding commented Jul 24, 2023

tannergooding commented Jul 24, 2023

kunalspathak Jul 24, 2023 •

edited

Loading

tannergooding Jul 24, 2023

kunalspathak commented Jul 24, 2023

tannergooding commented Jul 24, 2023

kunalspathak commented Jul 25, 2023

tannergooding commented Jul 25, 2023

tannergooding commented Jul 25, 2023

kunalspathak left a comment

		// We want to ensure that we get a TYP_MASK local to
		// ensure the relevant optimizations can kick in

		}

		// Not all of the callee trash values are constant, so don't declare this as a method local static

Allow multiple kmask registers to be allocated and cleanup some codegen around them #89059

Allow multiple kmask registers to be allocated and cleanup some codegen around them #89059

Conversation

tannergooding commented Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jul 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Jul 22, 2023

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Jul 24, 2023

tannergooding commented Jul 24, 2023

kunalspathak Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak commented Jul 24, 2023

tannergooding commented Jul 24, 2023

kunalspathak commented Jul 25, 2023

tannergooding commented Jul 25, 2023

tannergooding commented Jul 25, 2023

kunalspathak left a comment

Choose a reason for hiding this comment

tannergooding commented Jul 17, 2023 •

edited

Loading

tannergooding Jul 20, 2023 •

edited

Loading

kunalspathak Jul 24, 2023 •

edited

Loading