-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Vector.AddSaturate/SubtractSaturate #107193
base: main
Are you sure you want to change the base?
Conversation
Note regarding the
|
1 similar comment
Note regarding the
|
Tagging subscribers to this area: @dotnet/area-system-numerics |
Could the existing internal runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Line 3904 in 4c10eff
|
I removed the existing methods that I could find in this PR, but perhaps there are additional methods I have missed. |
if (AdvSimd.IsSupported) | ||
{ | ||
if (typeof(T) == typeof(byte)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>(); | ||
} | ||
if (typeof(T) == typeof(sbyte)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>(); | ||
} | ||
if (typeof(T) == typeof(short)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>(); | ||
} | ||
if (typeof(T) == typeof(ushort)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>(); | ||
} | ||
if (typeof(T) == typeof(int)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsInt32(), right.AsInt32()).As<int, T>(); | ||
} | ||
if (typeof(T) == typeof(uint)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsUInt32(), right.AsUInt32()).As<uint, T>(); | ||
} | ||
if (typeof(T) == typeof(long)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsInt64(), right.AsInt64()).As<long, T>(); | ||
} | ||
if (typeof(T) == typeof(ulong)) | ||
{ | ||
return AdvSimd.AddSaturate(left.AsUInt64(), right.AsUInt64()).As<ulong, T>(); | ||
} | ||
} | ||
|
||
if (Sse2.IsSupported) | ||
{ | ||
if (typeof(T) == typeof(byte)) | ||
{ | ||
return Sse2.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>(); | ||
} | ||
if (typeof(T) == typeof(sbyte)) | ||
{ | ||
return Sse2.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>(); | ||
} | ||
if (typeof(T) == typeof(short)) | ||
{ | ||
return Sse2.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>(); | ||
} | ||
if (typeof(T) == typeof(ushort)) | ||
{ | ||
return Sse2.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>(); | ||
} | ||
} | ||
|
||
if (PackedSimd.IsSupported) | ||
{ | ||
if (typeof(T) == typeof(byte)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsByte(), right.AsByte()).As<byte, T>(); | ||
} | ||
if (typeof(T) == typeof(sbyte)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsSByte(), right.AsSByte()).As<sbyte, T>(); | ||
} | ||
if (typeof(T) == typeof(short)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsInt16(), right.AsInt16()).As<short, T>(); | ||
} | ||
if (typeof(T) == typeof(ushort)) | ||
{ | ||
return PackedSimd.AddSaturate(left.AsUInt16(), right.AsUInt16()).As<ushort, T>(); | ||
} | ||
} | ||
|
||
if (IsHardwareAccelerated) | ||
{ | ||
return VectorMath.AddSaturate<Vector128<T>, T>(left, right); | ||
} | ||
|
||
return Create( | ||
Vector64.AddSaturate(left._lower, right._lower), | ||
Vector64.AddSaturate(left._upper, right._upper) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an approach we want to take for most of the xplat APIs, which are considered "perf critical".
Rather instead we want them to be implemented in the JIT so that they don't eat away at the inlining budget or run into other issues.
Doing this requires adding an AddSaturate
entry to https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistxarch.h and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistarm64.h, for the relevant vector sizes (and mostly mirroring the entry for op_Additition
)
You'd then add handling for that in https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicxarch.cpp#L1387-L1402 and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsicarm64.cpp#L700-L710, mostly following op_Addition
again; but since we don't have a general GT_*
kind, you'd instead use gtNewSimdHWIntrinsicNode(retType, op1, op2, intrinsic, simdBaseJitType, simdSize)
where intrinsic
is NI_ISA_Name
, such as NI_SSE2_AddSaturate
For int
, uint
, long
, and ulong
on x86/x64, you'd need to implement handling as well. Unsigned
is simple as its effectively just the following, as x + y
will always be greater than or equal to either input, unless it overflows:
var tmp = x + y;
return Vector.ConditionalSelect(
Vector.LessThan(tmp, x),
MaxValue,
tmp
);
Signed is a bit trickier, but it basically boils down to (there may be a more efficient way, but this is the basics):
var z = x + y;
return Vector.ConditionalSelect(
(((x ^ y) ^ SignMask) & (x ^ z)) >> (sizeof(T) * 8 - 1),
SignMask ^ (z >> (sizeof(T) * 8 - 1))),
z
);
This works because x + y
for differing signs cannot overflow; while for same signs it can. In general, given two bool
you can detect equality via x ^ y ^ 1
and inequality via x ^ y
. Given that we want (signX == signY) && (signX != signZ)
that gives us the (x ^ y ^ 1) & (x ^ z)
given above to determine if overflow occurred. We then arithmetic right shift to propagate the bit so we get AllBitsSet
(overflow occurred) or Zero
(no overflow) per-element.
If overflow did occur, then we know that a negative result means it should be MaxValue
while a positive result means it should be MinValue
. Artihmetic shifting z
gives us AllBitsSet
(negative) or Zero
(positive) on a per-element basis, we just need to xor with the sign mask. This gives us 0xFFFF_FFFF ^ 0x8000_0000
or 0x0000_0000 ^ 0x8000_0000
, thus negative results become 0x7FFF_FFFF
(MaxValue
) and positive results become 0x8000_0000
(MinValue
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an approach we want to take for most of the xplat APIs, which are considered "perf critical".
Understood. Thanks for the clear instructions on how to implement this in JIT instead 👍 .
For
int
,uint
,long
, andulong
on x86/x64, you'd need to implement handling as well.
There is a "fallback" algorithm in the PR already in VectorMath
class.
Should the substitution be done in JIT as well for x86/x64 case too, or does it suffice to leave as it is for those cases? If handled in JIT, should the fallback in VectorMath
be kept?
I'll try setting this PR as draft until I have successfully made necessary changes.
@lilinus in case you didn't knew, there is a patch created by the format leg: https://github.com/dotnet/runtime/actions/runs/10928828860?pr=107193 (under artifacts) $ cd /path/to/runtime
$ unzip ~/Downloads/format.linux.patch.zip
$ git apply format.patch
$ rm format.patch
# commit and push |
I should be getting to this soon, just working through the backlog of PRs now that I can start focusing on things for .NET 10 |
Implement #82559