-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414
Conversation
…ical to Vector<T> and Vector64/128/256<T>
…dUnsafe to Vector64/128/256<T>
…StoreUnsafe to Vector64/128/256<T>
…etic, and ShiftRightLogical
…ntime so the `unmanaged` constraint can be used
…nedNonTemporal, and LoadUnsafe
…lignedNonTemporal, and StoreUnsafe
…nificantBit for nint/nuint
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsThis resolves #20510 and everything except
|
This is ready for review. |
{ | ||
if (!varTypeIsLong(simdBaseType)) | ||
{ | ||
op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment regarding using AddAcross
versus many AddPairwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is roughly the same scenario as #63414 (comment). We need a proper sum here that fits in 32-bits
.
There are some other potential optimization opportunities here, but they would require handling in lowering and aren't "critical" to getting the feature out, so it's likely better to handle them in a separate PR.
- These would involve specially handling the input and whether or not its an intrinsic that is known to produce an all-bits set vs no-bits set per-element result. In that scenario, we can elide the
shift
in favor of masking the right bits directly and then use twoAddPairwise
, which may be better. - The same applies to certain codegen for
x86/x64
(but notExtractMostSignificantBits
, since that's justMoveMask
there) where we could optimize to just things likeBlend
instead
op2 = gtNewOperNode(GT_ADD, op1->TypeGet(), op1, op2); | ||
} | ||
|
||
retNode = gtNewSimdHWIntrinsicNode(retType, op2, op1, NI_AdvSimd_Store, simdBaseJitType, simdSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be implemented more efficiently if we were using Store SIMD&FP register (register offset). str
instruction:
str Dt, op1Reg, op2Reg, LSL #Log2(genTypeSize(simdBaseType))
str Qt, op1Reg, op2Reg, LSL #Log2(genTypeSize(simdBaseType))
Should we introduce an intrinsic under AdvSimd
that exposes this instruction?
Or instead introduce an optimization in lower that "contains" such address expression and handle the case in CodeGen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a general thing done as part of Lowering
so that any SIMD store can be optimized accordingly, just as we already do for x64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIT changes LGTM.
|
||
for (int index = 0; index < Vector<byte>.Count; index++) | ||
{ | ||
var element = Scalar<byte>.ShiftLeft(value.GetElementUnsafe(index), shiftCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we avoid using var here and elsewhere? This is a good example of where I don't actually know what type element is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do a separate PR to fix it up for all the vector code. It has (and has for a long time) used var
fairly extensively to help make copying the algorithms around easier.
The "type" really doesn't matter, the action being done does and that's generally guaranteed by the generic context (e.g. Scalar<byte>
means its operating on byte
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do a separate PR to fix it up for all the vector code
Thanks
return result; | ||
} | ||
|
||
/// <summary>Shifts each element of a vector right by the specified amount.</summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for ShiftRightArithmetic and ShiftRightLogical are identical. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially. I don't have a good alternative wording to differentiate and explain the differences between arithmetic (signed) and logical (unsigned) shifting in the space of a doc summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good alternative wording to differentiate and explain the differences between arithmetic (signed) and logical (unsigned) shifting in the space of a doc summary.
"Arithmetic shifts each element" and "Logical shifts each element"?
or
"Shift (arithmetic) each element" and "Shifts (logical) each element"?
or
"Shifts (signed) each element" and "Shifts (unsigned) each element"?
or something along those lines? It'd be nice to differentiate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Shift (signed)
is probably my favorite out of those. If you don't have any push back, I'll do that in the follow up PR I mentioned above (to avoid churning CI just for a docs change and to unblock the downstream work dependent on this PR).
} | ||
else if (typeof(T) == typeof(nint)) | ||
{ | ||
if (Environment.Is64BitProcess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When are we choosing to use
#if TARGET_64BIT
vs
if (Environment.Is64BitProcess)
for this code in corelib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in here largely uses Environment
because it wasn't always in corelib and so it continues using it for consistency.
This is fallback logic and isn't perf critical (it basically only runs on x86 Unix
and Arm32
) so I don't think its super important to minimize the IL even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fallback logic and isn't perf critical (it basically only runs on x86 Unix and Arm32) so I don't think its super important to minimize the IL even more.
That leaves me wondering:
- We're using things like Unsafe.SkipInit in code that's not perf critical?
- Does this fallback logic get trimmed out on platforms where it's not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still gets used in in indirect calls (delegates/fnptrs/reflection) and so we still need an implementation. This just happens to be the easiest way to do it (mostly because the backing fields are ulong
, not T
).
These are all generics over structs and so all irrelevant code paths are trimmed for each scenario. That is, byte
won't have paths for sbyte
, short
, ushort
, int
, uint
, long
, ulong
, float
, double
, nint
, or nuint
. Likewise, Environment.Is64BitProcess
is also constant folded by both the JIT and AOT, there isn't actual "cost" here (other than an extra branch preserved in IL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all generics over structs and so all irrelevant code paths are trimmed for each scenario
You're talking about the asm. I'm wondering about the IL and the linker/trimmer, i.e. will System.Private.Corelib.dll be identical in size whether the ifdef is used or the Is64BitProcess check is used (which would require the linker to recognize it and eliminate the dead code based on it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I don't know. I know the linker has some understanding of Is64BitProcess
but I'm not sure under what contexts its used vs not.
long* value = null; | ||
|
||
try | ||
{ | ||
value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to condense it to:
long* value = null; | |
try | |
{ | |
value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16); | |
long* value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16); | |
try | |
{ |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a habit of putting the thing that is impacted by the catch
in the try
to differentiate on what failed and avoid issues with SkipLocalsInit
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed the .cs files. Other than my comments, LGTM.
Logged #63704 to track the managed code cleanup. I plan on doing that basically as soon as this is merged, so it won't stay on some "indefinite" backlog. |
Improvement on Windows-x64 dotnet/perf-autofiling-issues#2944 |
This resolves #20510 and everything except
Shuffle
from #63331Shuffle
is going to be a separate PR as its: