-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code generation of HW intrinsics / loading struct into vector register #31692
Comments
The generated code is pretty normal, isn't it? Probably is even better than in Unfortunately JIT doesn't recognize custom vector-like structs as simd types so you will see those additional movs here and there. public struct MyVector4
{
private Vector128<float> xyzw;
} for better codegen (also jit doesn't pass Vector128 via simd registers yet, does it?) |
CC. @CarolEidt, @echesakovMSFT. Most of the tagged bits above look like side effects of pinning, needing to zero stack slots, or stack spilling. However, the root problem is that the user has a |
Thanks for your quick comments. One of my most curious questions: |
I think that @EgorBo is correct that you should be wrapping a vector type that's known to the JIT (either |
For .NET Core 5, |
I've now added two versions, one that uses
and one with
Both generate the code as expected. Using overlapping fields even allow to use own structs, but there is an overhead to pay in the constructor as it requires initialization. .NET 5 should have a solution for that with |
It looks like there are still redundant loads, and in the |
I believe this is due to the older SIMD code not generally supporting containment; as we had discussed a couple weeks back. |
@CarolEidt @tannergooding going to mark this as 5.0 as it seems like we're doing related work and so should see improvements here. Let me know if you disagree. |
@tannergooding @CarolEidt Is there any work left to do here? Does the issue need to remain in 5.0? Does the title accurately reflect the issue? |
The latter issue called out here (unnecessary moves in the existing codegen) has been largely fixed by the iterative work of porting the GT_SIMD nodes to be imported as GT_HWINTRINSIC nodes instead. #37882 is likely the last one we should take for .NET 5 as it handles
The codegen for these remaining cases didn't look particularly problematic so it can likely wait for .NET 6 |
Hi, I just updated the IntrinsicCodeGen repository to .net 5. There is a tiny improvement in V1 and one
On the downside, the reference implementation is slower and there already was a small degradation with the 3.1.9 runtime, even the disassembly looks identical, except that the
I also tried using Overall, the |
@tannergooding Can you take a look at this issue and tell me if there are still outstanding work items here? |
Hello!
I’m trying to make use of HW intrinsics to improve the performance of a vector data type:
struct MyVector4 { public float X, Y, Z, W; }
My goal is to implement typical properties/methods such as Length, LengthSquared, DotProduct with superior performance to a naive implementation.
I started with the Length property and wanted to make use of Sse41.DotProduct. I have experimented with different implementations, however, I did not manage to get the code I was looking for.
For benchmarking and evaluation I'm using BenchmarkDotnet. The test routine calls the property in a loop like this:
Here are my 3 implementations using Sse and details on their generated code within the loop body. I've marked curious regions with ***.
The benchmark results are the following:
Note: On an i7-4790K V2 performs slightly better than V1.
There is already a small performance increase with V1 and V2, but the optimal code I'm targeting for is the one I get by pinning the entire array and directly using the offset pointer in LoadVector128:
The difficulty when implementing this as property seems to be loading the data from the struct into the vector registers. There are always some additional instructions, which purpose I do not understand, but I suppose that they are leftover of the compiler from optimizing the abstraction away. I'm no expert in this field and would be grateful if you could comment on my implementation and point out why the generated code is the way it is. Maybe it is also an interesting test case and it is possible to find improvements.
You can find the entire code in this repository: https://github.com/luithefirst/IntrinsicsCodeGen
Thanks!
category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: