Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414

tannergooding · 2022-01-05T21:41:10Z

This resolves #20510 and everything except Shuffle from #63331

Shuffle is going to be a separate PR as its:

Not immediately needed for most of the BCL logic
The logic to enable it is quite a bit more complicated as it needs to recognize and specially handle various vector constants for each platform

…ical to Vector<T> and Vector64/128/256<T>

…dUnsafe to Vector64/128/256<T>

…StoreUnsafe to Vector64/128/256<T>

…etic, and ShiftRightLogical

…ntime so the `unmanaged` constraint can be used

…nedNonTemporal, and LoadUnsafe

… are being taken

…lignedNonTemporal, and StoreUnsafe

…nificantBit for nint/nuint

dotnet-issue-labeler · 2022-01-05T21:41:15Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2022-01-05T21:41:19Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

This resolves #20510 and everything except Shuffle from #63331

Shuffle is going to be a separate PR as its:

Not immediately needed for most of the BCL logic
The logic to enable it is quite a bit more complicated as it needs to recognize and specially handle various vector constants for each platform

Author:	tannergooding
Assignees:	-
Labels:	`area-System.Runtime.Intrinsics`, `new-api-needs-documentation`
Milestone:	-

tannergooding · 2022-01-06T04:12:34Z

This is ready for review.

src/libraries/System.Runtime/ref/System.Runtime.cs

src/coreclr/jit/gentree.cpp

echesakov · 2022-01-12T06:20:27Z

src/coreclr/jit/hwintrinsicarm64.cpp

+            {
+                if (!varTypeIsLong(simdBaseType))
+                {
+                    op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType,


Same comment regarding using AddAcross versus many AddPairwise

This is roughly the same scenario as #63414 (comment). We need a proper sum here that fits in 32-bits.

There are some other potential optimization opportunities here, but they would require handling in lowering and aren't "critical" to getting the feature out, so it's likely better to handle them in a separate PR.

These would involve specially handling the input and whether or not its an intrinsic that is known to produce an all-bits set vs no-bits set per-element result. In that scenario, we can elide the shift in favor of masking the right bits directly and then use two AddPairwise, which may be better.

The same applies to certain codegen for x86/x64 (but not ExtractMostSignificantBits, since that's just MoveMask there) where we could optimize to just things like Blend instead

echesakov · 2022-01-12T06:29:30Z

src/coreclr/jit/hwintrinsicarm64.cpp

+                op2 = gtNewOperNode(GT_ADD, op1->TypeGet(), op1, op2);
+            }
+
+            retNode = gtNewSimdHWIntrinsicNode(retType, op2, op1, NI_AdvSimd_Store, simdBaseJitType, simdSize);


I think this could be implemented more efficiently if we were using Store SIMD&FP register (register offset). str instruction:

str Dt, op1Reg, op2Reg, LSL #Log2(genTypeSize(simdBaseType)) str Qt, op1Reg, op2Reg, LSL #Log2(genTypeSize(simdBaseType))

Should we introduce an intrinsic under AdvSimd that exposes this instruction?

Or instead introduce an optimization in lower that "contains" such address expression and handle the case in CodeGen?

This should be a general thing done as part of Lowering so that any SIMD store can be optimized accordingly, just as we already do for x64.

src/coreclr/jit/hwintrinsicarm64.cpp

echesakov

The JIT changes LGTM.

stephentoub · 2022-01-12T19:02:56Z

src/libraries/System.Private.CoreLib/src/System/Numerics/Vector.cs

+
+            for (int index = 0; index < Vector<byte>.Count; index++)
+            {
+                var element = Scalar<byte>.ShiftLeft(value.GetElementUnsafe(index), shiftCount);


Nit: can we avoid using var here and elsewhere? This is a good example of where I don't actually know what type element is.

I can do a separate PR to fix it up for all the vector code. It has (and has for a long time) used var fairly extensively to help make copying the algorithms around easier.

The "type" really doesn't matter, the action being done does and that's generally guaranteed by the generic context (e.g. Scalar<byte> means its operating on byte).

I can do a separate PR to fix it up for all the vector code

Thanks

stephentoub · 2022-01-12T19:05:38Z

src/libraries/System.Private.CoreLib/src/System/Numerics/Vector.cs

+            return result;
+        }
+
+        /// <summary>Shifts each element of a vector right by the specified amount.</summary>


The docs for ShiftRightArithmetic and ShiftRightLogical are identical. Is that intentional?

Partially. I don't have a good alternative wording to differentiate and explain the differences between arithmetic (signed) and logical (unsigned) shifting in the space of a doc summary.

I don't have a good alternative wording to differentiate and explain the differences between arithmetic (signed) and logical (unsigned) shifting in the space of a doc summary.

"Arithmetic shifts each element" and "Logical shifts each element"?
or
"Shift (arithmetic) each element" and "Shifts (logical) each element"?
or
"Shifts (signed) each element" and "Shifts (unsigned) each element"?
or something along those lines? It'd be nice to differentiate them.

I think Shift (signed) is probably my favorite out of those. If you don't have any push back, I'll do that in the follow up PR I mentioned above (to avoid churning CI just for a docs change and to unblock the downstream work dependent on this PR).

stephentoub · 2022-01-12T19:08:12Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Scalar.cs

+            }
+            else if (typeof(T) == typeof(nint))
+            {
+                if (Environment.Is64BitProcess)


When are we choosing to use

#if TARGET_64BIT

vs

if (Environment.Is64BitProcess)

for this code in corelib?

The code in here largely uses Environment because it wasn't always in corelib and so it continues using it for consistency.

This is fallback logic and isn't perf critical (it basically only runs on x86 Unix and Arm32) so I don't think its super important to minimize the IL even more.

This is fallback logic and isn't perf critical (it basically only runs on x86 Unix and Arm32) so I don't think its super important to minimize the IL even more.

That leaves me wondering:

We're using things like Unsafe.SkipInit in code that's not perf critical?

Does this fallback logic get trimmed out on platforms where it's not used?

It still gets used in in indirect calls (delegates/fnptrs/reflection) and so we still need an implementation. This just happens to be the easiest way to do it (mostly because the backing fields are ulong, not T).

These are all generics over structs and so all irrelevant code paths are trimmed for each scenario. That is, byte won't have paths for sbyte, short, ushort, int, uint, long, ulong, float, double, nint, or nuint. Likewise, Environment.Is64BitProcess is also constant folded by both the JIT and AOT, there isn't actual "cost" here (other than an extra branch preserved in IL).

These are all generics over structs and so all irrelevant code paths are trimmed for each scenario

You're talking about the asm. I'm wondering about the IL and the linker/trimmer, i.e. will System.Private.Corelib.dll be identical in size whether the ifdef is used or the Is64BitProcess check is used (which would require the linker to recognize it and eliminate the dead code based on it).

That I don't know. I know the linker has some understanding of Is64BitProcess but I'm not sure under what contexts its used vs not.

stephentoub · 2022-01-12T19:22:19Z

src/libraries/System.Runtime.Intrinsics/tests/Vectors/Vector128Tests.cs

+            long* value = null;
+
+            try
+            {
+                value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16);


Any reason not to condense it to:

Suggested change

long* value = null;

try

{

value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16);

long* value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16);

try

{

?

I have a habit of putting the thing that is impacted by the catch in the try to differentiate on what failed and avoid issues with SkipLocalsInit, etc.

stephentoub

I skimmed the .cs files. Other than my comments, LGTM.

tannergooding · 2022-01-12T21:26:29Z

Logged #63704 to track the managed code cleanup. I plan on doing that basically as soon as this is merged, so it won't stay on some "indefinite" backlog.

EgorBo · 2022-01-25T16:55:28Z

Improvement on Windows-x64 dotnet/perf-autofiling-issues#2944

tannergooding added 14 commits January 5, 2022 08:25

Exposing Sum<T> for Vector64/128/256<T>

7a96ec8

Adding support for ShiftLeft, ShiftRightArithmetic, and ShiftRightLog…

9500135

…ical to Vector<T> and Vector64/128/256<T>

Adding support for Load, LoadAligned, LoadAlignedNonTemporal, and Loa…

f640916

…dUnsafe to Vector64/128/256<T>

Adding support for Store, StoreAligned, StoreAlignedNonTemporal, and …

578c8f8

…StoreUnsafe to Vector64/128/256<T>

Adding support for ExtractMostSignificantBits to Vector64/128/256<T>

6e6af9e

Adding tests covering Vector64/128/256<T>.Sum

aad4550

Adding tests covering Vector64/128/256<T>.ShiftLeft, ShiftRightArithm…

0d545e9

…etic, and ShiftRightLogical

Moving System.Runtime.InteropServices.UnmanagedType down to System.Ru…

0880f9b

…ntime so the `unmanaged` constraint can be used

Adding tests covering Vector64/128/256<T>.Load, LoadAligned, LoadAlig…

c5686a0

…nedNonTemporal, and LoadUnsafe

Fixing a few issues in the source and tests to ensure the right paths…

ae36f2b

… are being taken

Adding tests covering Vector64/128/256<T>.Store, StoreAligned, StoreA…

541e833

…lignedNonTemporal, and StoreUnsafe

Adding tests covering Vector64/128/256<T>.ExtractMostSignificantBits

f00ec56

Ensure AlignedAlloc is matched by AlignedFree

70154eb

Fixing a couple test issues and the handling of Scalar.ExtractMostSig…

418f999

…nificantBit for nint/nuint

dotnet-issue-labeler bot added the area-System.Runtime.Intrinsics label Jan 5, 2022

dotnet-issue-labeler bot added the new-api-needs-documentation label Jan 5, 2022

ghost assigned tannergooding Jan 5, 2022

tannergooding requested a review from echesakov January 5, 2022 21:41

tannergooding mentioned this pull request Jan 5, 2022

Begin using the xplat hardware intrinsics to simplify the SIMD code in the libraries #63416

Closed

tannergooding added 4 commits January 5, 2022 14:29

Applying formatting patch

9931a84

Ensure gtNewOperNode uses TYP_INT when dealing with the shiftCount

0fcb27b

Fixing a couple ARM64 node types

185c83f

Ensure the shift intrinsics use impPopStack().val on ARM64

b1f6b19

EgorBo mentioned this pull request Jan 9, 2022

Faster IndexOf for substrings #63285

Merged

9 tasks

tannergooding commented Jan 10, 2022

View reviewed changes

src/libraries/System.Runtime/ref/System.Runtime.cs Show resolved Hide resolved

echesakov reviewed Jan 12, 2022

View reviewed changes

Responding to PR feedback

2af1849

echesakov approved these changes Jan 12, 2022

View reviewed changes

stephentoub reviewed Jan 12, 2022

View reviewed changes

stephentoub approved these changes Jan 12, 2022

View reviewed changes

tannergooding mentioned this pull request Jan 12, 2022

Code cleanup of Vector/64/128/256 #63704

Open

tannergooding merged commit cfe5e98 into dotnet:main Jan 13, 2022

ociaw mentioned this pull request Jan 26, 2022

Investigate new .NET 6 APIs and determine if they're relevant ociaw/RandN#16

Closed

ghost locked as resolved and limited conversation to collaborators Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414

Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414

tannergooding commented Jan 5, 2022

dotnet-issue-labeler bot commented Jan 5, 2022

ghost commented Jan 5, 2022

tannergooding commented Jan 6, 2022

echesakov Jan 12, 2022

tannergooding Jan 12, 2022

echesakov Jan 12, 2022

tannergooding Jan 12, 2022

echesakov left a comment

stephentoub Jan 12, 2022

tannergooding Jan 12, 2022

stephentoub Jan 12, 2022

stephentoub Jan 12, 2022

tannergooding Jan 12, 2022

stephentoub Jan 12, 2022

tannergooding Jan 12, 2022

stephentoub Jan 12, 2022 •

edited

Loading

tannergooding Jan 12, 2022

stephentoub Jan 12, 2022

tannergooding Jan 12, 2022 •

edited

Loading

stephentoub Jan 12, 2022 •

edited

Loading

tannergooding Jan 12, 2022

stephentoub Jan 12, 2022

tannergooding Jan 12, 2022

stephentoub left a comment

tannergooding commented Jan 12, 2022 •

edited

Loading

EgorBo commented Jan 25, 2022

Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414

Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414

Conversation

tannergooding commented Jan 5, 2022

dotnet-issue-labeler bot commented Jan 5, 2022

ghost commented Jan 5, 2022

tannergooding commented Jan 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echesakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

stephentoub Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub left a comment

Choose a reason for hiding this comment

tannergooding commented Jan 12, 2022 • edited Loading

EgorBo commented Jan 25, 2022

stephentoub Jan 12, 2022 •

edited

Loading

tannergooding Jan 12, 2022 •

edited

Loading

stephentoub Jan 12, 2022 •

edited

Loading

tannergooding commented Jan 12, 2022 •

edited

Loading