-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerating Vector512.Sum() #87851
Accelerating Vector512.Sum() #87851
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis accelerates Vector512.Sum() using AVX512 instructions.
@dotnet/avx512-contrib
|
Is
(from SixLabors/ImageSharp#1630 (comment)) and
(from https://stackoverflow.com/a/59326242/347870 I know that Vector128/256 also uses |
That's a good question and something we can possibly consider. ICC for eg. creates the following - https://godbolt.org/z/ojGcK9a4W I'd imagine this would mean optimizing the whole I'd love to hear what @tannergooding and @BruceForstall have to add here |
Should be fine. We'd be looking at Would be good to ensure that all 3 sizes for |
Related: #85207 |
You can mark 'Shuffle()' from there done. I had a PR a few weeks ago |
if (simdSize == 64) | ||
{ | ||
assert(IsBaselineVector512IsaSupportedDebugOnly()); | ||
// This is roughly the following managed code: | ||
// ... | ||
// simd64 tmp2 = tmp1; | ||
// tmp3 = tmp2.GetUpper(); | ||
// simd32 tmp4 = Isa.Add(tmp1.GetLower(), tmp2); | ||
// tmp5 = tmp4; | ||
// simd16 tmp6 = tmp4.GetUpper(); | ||
// tmp1 = Isa.Add(tmp1.GetLower(), tmp2); | ||
// ... | ||
// From here on we can treat this as a simd16 reduction | ||
GenTree* op1Dup = fgMakeMultiUse(&op1); | ||
GenTree* op1Lower32 = gtNewSimdGetUpperNode(TYP_SIMD32, op1Dup, simdBaseJitType, simdSize); | ||
GenTree* op1Upper32 = gtNewSimdGetLowerNode(TYP_SIMD32, op1, simdBaseJitType, simdSize); | ||
|
||
simdSize = simdSize / 2; | ||
op1Lower32 = gtNewSimdBinOpNode(GT_ADD, TYP_SIMD32, op1Lower32, op1Upper32, simdBaseJitType, simdSize); | ||
haddCount--; | ||
|
||
GenTree* op1Dup32 = fgMakeMultiUse(&op1Lower32); | ||
GenTree* op1Lower16 = gtNewSimdGetUpperNode(TYP_SIMD16, op1Lower32, simdBaseJitType, simdSize); | ||
GenTree* op1Upper16 = gtNewSimdGetLowerNode(TYP_SIMD16, op1Dup32, simdBaseJitType, simdSize); | ||
simdSize = simdSize / 2; | ||
op1 = gtNewSimdBinOpNode(GT_ADD, TYP_SIMD16, op1Lower16, op1Upper16, simdBaseJitType, simdSize); | ||
haddCount--; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is "correct", but has an interesting side effect in that can change the result for float/double.
Since float/double are not associative due to the rounding, summing accross as [0] + [1] + [2] + [3] + ...
is different from summing pairwise as (([0] + [1]) + ([2] + [3])) + ...
, which is different from summing per lane, then combining the lanes, etc.
Today, we're basically doing it per lane, then combining lanes. Within that lane we're typically doing pairwise
because that's how addv
(add across) works on Arm64, it's how hadd
(horizontal add) works on x86/x64, and its trivial to emulate using shufps
on older hardware.
We don't really want to have the results subtly change based on what the hardware supports, so we should probably try to ensure this keeps things overall consistent in how it operates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm busy with something else but will get back to this once I'm done. Sorry for the delay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. This one isn't "critical" for .NET 8 and we're already generating decent (but not amazing) code that would be similar to what a user might manually write.
Markign this as .NET 9. |
@DeepakRajendrakumaran Can you mark this "Draft" until it is ready to review again? |
Done |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
This accelerates Vector512.Sum() using AVX512 instructions.
Vector512.Sum()
->@dotnet/avx512-contrib