Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for vector constants via GenTreeVecCon #68874

Merged
merged 17 commits into from
May 31, 2022

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented May 4, 2022

This adds direct support for vector constant nodes via GenTreeVecCon. It involved quite a bit of cleanup to normalize the places that were touching GT_SIMD, those that were touching GT_HWINTRINSIC, and those that touched both so it grew up a bit more than I initially desired.

The result, overall, however is that vector constants are now centrally handled with less overall allocations and nodes required to represent them. They are also now base type independent which allows more CSE opportunities and which means that you can easily see the bits however the user needs to interpret them. This in turn will simplify logic for the xplat shuffle APIs where otherwise simple operations, such as .AsInt32() would break the handling and force it down the fallback path.

@ghost ghost assigned tannergooding May 4, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 4, 2022
@ghost
Copy link

ghost commented May 4, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This is a draft that adds direct support for vector constant nodes via GenTreeVecCon. It involved quite a bit of cleanup to normalize the places that were touching GT_SIMD, those that were touching GT_HWINTRINSIC, and those that touched both so it grew up a bit more than I initially desired.

The result, overall, however is that vector constants are now centrally handled with less overall allocations and nodes required to represent them. They are also now base type independent which allows more CSE opportunities and which means that you can easily see the bits however the user needs to interpret them. This in turn will simplify logic for the xplat shuffle APIs where otherwise simple operations, such as .AsInt32() would break the handling and force it down the fallback path.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

Will re-open after I get tests passing.

@tannergooding tannergooding reopened this May 5, 2022
@tannergooding tannergooding force-pushed the vector-cns branch 5 times, most recently from 4921110 to 360ff7e Compare May 9, 2022 14:40
@tannergooding tannergooding marked this pull request as ready for review May 10, 2022 20:19
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib

This is needed to correctly handle the Shuffle APIs and some of the more complex patterns around vector constant nodes that pop up (part of the remaining xplat hwintrinsic APIs: #63331).

It provides some throughput gains on all platforms and spot-checking the regressions they are mostly from new CSE opportunities and other optimizations that can now kick in due to knowing these are constant. There are a couple regressions where we emit additional xorps instructions since its considered cheap enough CSE doesn't always kick in (this issue is more broadly tracked by #6264).

else
{
assert(false);
return {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to define a simd8/16/32_t::Zero-like static constexpr field (or function), semantics of { } can be unclear to people not intimately familiar with the C++ initialization rules.

Copy link
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial feedback; will continue the review tomorrow.

src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/valuenum.cpp Show resolved Hide resolved
src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/emit.cpp Show resolved Hide resolved
src/coreclr/jit/importer.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/rationalize.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/valuenum.h Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.h Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.h Outdated Show resolved Hide resolved
src/coreclr/jit/lsraarm64.cpp Show resolved Hide resolved
src/coreclr/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerarmarch.cpp Show resolved Hide resolved
src/coreclr/jit/lowerloongarch64.cpp Show resolved Hide resolved
src/coreclr/jit/hwintrinsicxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/assertionprop.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.h Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frontend/IR changes look good. I did not drill too much into the backend parts, but they look mostly mechanical (and correct).

Overall, I think this change is a good step in the right direction w.r.t. SIMDs in IR and removes a good amount of suboptimal representation that we had for constant vectors. Thank you for making it happen!

@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib. This should be ready for review.

Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SingleAccretion did an incredible review of this.

The changes look great. Only a minor comment regarding the CORINFO_TYPE_FLOAT.

For ARM64 diffs, the regressions seem to just appear as regressions due to new CSE opportunities - so that all looks good to me.

@mrsharm
Copy link
Member

mrsharm commented Aug 11, 2022

@tannergooding - from our analysis while creating the perf report for August, we found the following regression that seemed to line up with this PR specifically related to the Ubuntu 18.04 x64 configuration. Would you consider these regressions as "by design" as there are closed auto-filed regressions above?

image

System.Numerics.Tests.Perf_Matrix4x4.CreateShadowBenchmark

Result Ratio Alloc Delta Operating System Bit Processor Name
Same 1.00 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Same 1.01 +0 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Slower 0.29 +0 macOS Monterey 12.3 Arm64 Apple M1 Max
Same 1.01 +0 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 1.03 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same 1.04 +0 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same 0.99 +0 Windows 10 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)
Same 1.02 +0 Windows 10 X64 Intel Core i9-10900K CPU 3.70GHz
Same 1.06 +0 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Same 1.03 +0 Windows 11 X64 AMD Ryzen 9 3950X
Same 0.98 +0 Windows 11 X64 AMD Ryzen 9 5900X
Same 0.95 +0 Windows 11 X64 AMD Ryzen 9 5950X
Same 0.94 +0 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Same 0.95 +0 Windows 11 X64 Intel Core i9-10900K CPU 3.70GHz
Same 0.92 +0 Windows 11 X64 11th Gen Intel Core i9-11900H 2.50GHz
Slower 0.68 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Slower 0.86 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Slower 0.62 +0 ubuntu 18.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Slower 0.68 +0 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Slower 0.58 +0 ubuntu 20.04 X64 Intel Core i9-10900K CPU 3.70GHz
Faster 1.54 +0 Windows 10 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Same 1.00 +0 Windows 10 X86 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Same 0.99 +0 Windows 11 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Slower 0.65 +0 macOS Big Sur 11.6.8 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)
Slower 0.64 +0 macOS Monterey 12.3.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Slower 0.65 +0 macOS Monterey 12.4 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell)

@tannergooding
Copy link
Member Author

Not by design, would need to see the disassembly to see exactly what's being pessimized here.

This isn't important for .NET 7, however. Matrix4x4.CreateShadow is a case that isn't accelerated today and where the overall codegen is already suboptimal. The "proper" fix would be to rewrite the implementation to properly take advantage of the hardware intrinsics where possible.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants