Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code generation of HW intrinsics / loading struct into vector register #31692

Open
luithefirst opened this issue Feb 3, 2020 · 13 comments
Open
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@luithefirst
Copy link

luithefirst commented Feb 3, 2020

Hello!

I’m trying to make use of HW intrinsics to improve the performance of a vector data type:
struct MyVector4 { public float X, Y, Z, W; }
My goal is to implement typical properties/methods such as Length, LengthSquared, DotProduct with superior performance to a naive implementation.
I started with the Length property and wanted to make use of Sse41.DotProduct. I have experimented with different implementations, however, I did not manage to get the code I was looking for.
For benchmarking and evaluation I'm using BenchmarkDotnet. The test routine calls the property in a loop like this:

[Benchmark]
public float Vec4Length_Sse_V1()
{
    var local = arr;
    var sum = 0.0f;
    for (int i = 0; i < local.Length; i++)
        sum += local[i].Length_Sse_V1;
    return sum;
}

Here are my 3 implementations using Sse and details on their generated code within the loop body. I've marked curious regions with ***.

  1. Using fixed:
float Length_Sse_V1 {
    get {
        unsafe {
            fixed (MyVector4* pthis = &this)
            {
                var mmx = Sse.LoadVector128((float*)pthis);
                mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
                var l2 = mmx.GetElement(0);
                return MathF.Sqrt(l2);
            }
} } }
8fe8975b  movsxd  r8,ecx
8fe8975e  shl     r8,4
8fe89762  lea     r8,[rax+r8+10h]
8fe89767  xor     r9d,r9d		***
8fe8976a  mov     qword ptr [rsp],r9	***
8fe8976e  mov     qword ptr [rsp],r8	***
8fe89772  vmovups xmm1,xmmword ptr [r8]
8fe89777  vdpps   xmm1,xmm1,xmm1,0F1h
8fe8977d  vsqrtss xmm1,xmm1,xmm1
8fe89781  mov     qword ptr [rsp],r9	***
8fe89785  vaddss  xmm0,xmm0,xmm1
8fe89789  inc     ecx
8fe8978b  cmp     ecx,edx
8fe8978d  jl      00007fff`8fe8975b
  1. Using a helper function:
static unsafe float Length_Sse_V2_Helper(MyVector4 vec)
{
    var ptr = (float*)&vec;
    var mmx = Sse.LoadVector128(ptr);
    mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
    var l2 = mmx.GetElement(0);
    return MathF.Sqrt(l2);
}
float Length_Sse_V2
{
    get { return Length_Sse_V2_Helper(this); }
}
8fe89758  movsxd  r8,ecx
8fe8975b  shl     r8,4
8fe8975f  lea     r8,[rax+r8+10h]
8fe89764  vmovdqu xmm1,xmmword ptr [r8]		***
8fe89769  vmovdqu xmmword ptr [rsp+8],xmm1	***
8fe8976f  lea     r8,[rsp+8]			***
8fe89774  vmovups xmm1,xmmword ptr [r8]
8fe89779  vdpps   xmm1,xmm1,xmm1,0F1h
8fe8977f  vsqrtss xmm1,xmm1,xmm1
8fe89783  vaddss  xmm0,xmm0,xmm1
8fe89787  inc     ecx
8fe89789  cmp     ecx,edx
8fe8978b  jl      00007fff`8fe89758
  1. Helper Inlined:
float Length_Sse_V3 {
    get { 
        unsafe {
            var vec = this;
            var ptr = (float*)&vec;
            var mmx = Sse.LoadVector128(ptr);
            mmx = Sse41.DotProduct(mmx, mmx, 0xF1);
            var l2 = mmx.GetElement(0);
            return MathF.Sqrt(l2);
        }
} } }
8fe69764  movsxd  r8,ecx
8fe69767  shl     r8,4
8fe6976b  lea     r8,[rax+r8+10h]
8fe69770  lea     r9,[rsp+8]			***
8fe69775  vxorps  xmm1,xmm1,xmm1		***
8fe69779  vmovdqu xmmword ptr [r9],xmm1 	***
8fe6977e  vmovdqu xmm1,xmmword ptr [r8]
8fe69783  vmovdqu xmmword ptr [rsp+8],xmm1	***
8fe69789  lea     r8,[rsp+8]			***
8fe6978e  vmovups xmm1,xmmword ptr [r8]     	***
8fe69793  vdpps   xmm1,xmm1,xmm1,0F1h
8fe69799  vsqrtss xmm1,xmm1,xmm1
8fe6979d  vaddss  xmm0,xmm0,xmm1
8fe697a1  inc     ecx
8fe697a3  cmp     ecx,edx
8fe697a5  jl      00007fff`8fe69764

The benchmark results are the following:

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18362
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.100
  [Host]     : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT
  Job-OBQONZ : .NET Core 3.1.0 (CoreCLR 4.700.19.56402, CoreFX 4.700.19.56404), X64 RyuJIT

Runtime=.NET Core 3.0  
Method Mean Error StdDev
Vec4Length_Reference 1.451 ms 0.0046 ms 0.0041 ms
Vec4Length_Sse_V1 1.191 ms 0.0052 ms 0.0049 ms
Vec4Length_Sse_V2 1.283 ms 0.0035 ms 0.0029 ms
Vec4Length_Sse_V3 1.455 ms 0.0145 ms 0.0121 ms
Vec4Length_Sse_Array 1.107 ms 0.0231 ms 0.0308 ms

Note: On an i7-4790K V2 performs slightly better than V1.

There is already a small performance increase with V1 and V2, but the optimal code I'm targeting for is the one I get by pinning the entire array and directly using the offset pointer in LoadVector128:

8fe49787 4c63c1          movsxd  r8,ecx
8fe4978a 49c1e004        shl     r8,4
8fe4978e c4a178100c00    vmovups xmm1,xmmword ptr [rax+r8]
8fe49794 c4e37140c9f1    vdpps   xmm1,xmm1,xmm1,0F1h
8fe4979a c5f251c9        vsqrtss xmm1,xmm1,xmm1
8fe4979e c5fa58c1        vaddss  xmm0,xmm0,xmm1
8fe497a2 ffc1            inc     ecx
8fe497a4 3bca            cmp     ecx,edx
8fe497a6 7cdf            jl      00007fff`8fe49787

The difficulty when implementing this as property seems to be loading the data from the struct into the vector registers. There are always some additional instructions, which purpose I do not understand, but I suppose that they are leftover of the compiler from optimizing the abstraction away. I'm no expert in this field and would be grateful if you could comment on my implementation and point out why the generated code is the way it is. Maybe it is also an interesting test case and it is possible to find improvements.

You can find the entire code in this repository: https://github.com/luithefirst/IntrinsicsCodeGen
Thanks!

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 3, 2020
@EgorBo
Copy link
Member

EgorBo commented Feb 4, 2020

The generated code is pretty normal, isn't it? Probably is even better than in System.Numerics.Vector4.get_Length

Unfortunately JIT doesn't recognize custom vector-like structs as simd types so you will see those additional movs here and there.
You can try to re-design your vector to be:

public struct MyVector4
{
    private Vector128<float> xyzw;
}

for better codegen (also jit doesn't pass Vector128 via simd registers yet, does it?)

@tannergooding
Copy link
Member

CC. @CarolEidt, @echesakovMSFT.

Most of the tagged bits above look like side effects of pinning, needing to zero stack slots, or stack spilling. However, the root problem is that the user has a Vector4 like struct which they want to efficiently interoperate with HWIntrinsics.

@luithefirst
Copy link
Author

luithefirst commented Feb 4, 2020

Thanks for your quick comments.
@EgorBo I did some early experiments with Numerics.Vector4, but I did not like wrapping that. Would using Vector128<float> be any different? Maybe I could try overlapping this using explicit struct layout?

One of my most curious questions:
Looking at the code in Sse_V2_Helper, the first marked line vmovdqu xmm1,xmmword ptr [r8] moves the data into xmm1, but then it's moved back to the stack and then to xmm1 again, but this time using vmovups. From what I've read in the documentation, both just move the bits into the register. Is there any difference I did not get? This looks like something that could be gone after an optimization. Are there some other security mechanisms from unsafe code in play?

@CarolEidt
Copy link
Contributor

I think that @EgorBo is correct that you should be wrapping a vector type that's known to the JIT (either Vector4 or Vector128<float>, where the former has the advantage of being more platform-independent, while the latter gives access to more operations). This has two advantages: first, it will allow the JIT to enregister it, and second, it will avoid the pessimization due to pinning. This should eliminate at least some of the moves to the stack.

@tannergooding
Copy link
Member

For .NET Core 5, Vector4 and Vector128<float> also have helper methods for converting to/from eachother that should be zero cost; so you should be able to freely use one and still have it optimized correctly/efficiently later if needed.

@luithefirst
Copy link
Author

I've now added two versions, one that uses Vector128<float>:

63559c04  movsxd  r8,ecx
63559c07  shl     r8,4
63559c0b  lea     r8,[rax+r8+10h]
63559c10  vmovupd xmm1,xmmword ptr [r8]
63559c15  vdpps   xmm1,xmm1,xmmword ptr [r8],0F1h
63559c1b  vsqrtss xmm1,xmm1,xmm1
63559c1f  vaddss  xmm0,xmm0,xmm1
63559c23  inc     ecx
63559c25  cmp     ecx,edx
63559c27  jl      00007ffe`63559c04

and one with Vector4 (using DotProduct(this, this) internally):

63559c04  movsxd  r8,ecx
63559c07  shl     r8,4
63559c0b  lea     r8,[rax+r8+10h]
63559c10  vmovupd xmm1,xmmword ptr [r8]
63559c15  vmovupd xmm2,xmmword ptr [r8]
63559c1a  vdpps   xmm1,xmm1,xmm2,0F1h
63559c20  vsqrtss xmm1,xmm1,xmm1
63559c24  vaddss  xmm0,xmm0,xmm1
63559c28  inc     ecx
63559c2a  cmp     ecx,edx
63559c2c  jl      00007ffe`63559c04

Both generate the code as expected. Using overlapping fields even allow to use own structs, but there is an overhead to pay in the constructor as it requires initialization. .NET 5 should have a solution for that with Unsafe.SkipInit.

@CarolEidt
Copy link
Contributor

It looks like there are still redundant loads, and in the Vector4 case a failure to fold the (redundant) load into the vdpps. Would you be fine with making that the focus of this issue?

@tannergooding
Copy link
Member

in the Vector4 case a failure to fold the (redundant) load into the vdpps.

I believe this is due to the older SIMD code not generally supporting containment; as we had discussed a couple weeks back.

@BruceForstall BruceForstall added optimization and removed untriaged New issue has not been triaged by the area owner labels Feb 6, 2020
@AndyAyersMS
Copy link
Member

@CarolEidt @tannergooding going to mark this as 5.0 as it seems like we're doing related work and so should see improvements here. Let me know if you disagree.

@AndyAyersMS AndyAyersMS added this to the 5.0 milestone May 6, 2020
@BruceForstall
Copy link
Member

@tannergooding @CarolEidt Is there any work left to do here? Does the issue need to remain in 5.0? Does the title accurately reflect the issue?

@tannergooding
Copy link
Member

The latter issue called out here (unnecessary moves in the existing codegen) has been largely fixed by the iterative work of porting the GT_SIMD nodes to be imported as GT_HWINTRINSIC nodes instead.

#37882 is likely the last one we should take for .NET 5 as it handles DotProduct (which is the exact scenario listed here) and also fixes a perf regression in Vector2/3/4.LengthSquared. After that goes in, there will still be a few remaining scenarios that will remain as GT_SIMD nodes, most notably:

  • Get/SetX, Y, Z, and W
  • Narrow and Widen
  • Conversion operations
  • Some remnants around SIMDIntrinsicInit, SIMDIntrinsicZero, and SIMDIntrinsicUpperSave/Restore

The codegen for these remaining cases didn't look particularly problematic so it can likely wait for .NET 6

@BruceForstall BruceForstall modified the milestones: 5.0.0, Future Jul 13, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@luithefirst
Copy link
Author

Hi, I just updated the IntrinsicCodeGen repository to .net 5. There is a tiny improvement in V1 and one mov instruction is removed and there is a slightly different order:

movsxd    r8,ecx
shl       r8,4
lea       r8,[rax+r8+10]
mov       [rsp],r8
vmovups   xmm1,[r8]
vdpps     xmm1,xmm1,xmm1,0F1
vsqrtss   xmm1,xmm1,xmm1
xor       r8d,r8d
mov       [rsp],r8
vaddss    xmm0,xmm0,xmm1

On the downside, the reference implementation is slower and there already was a small degradation with the 3.1.9 runtime, even the disassembly looks identical, except that the xmm0, xmm1 and xmm2 registers are used in a different order. We have the same case in the Numerics.Vector4 implementation, still the same disassembly, but slower. Here I have created a combined table from the net 3.1.9 and net 5.0 benchmarks I did today and numbers posted here from before:

Method Mean Error StdDev
Vec4Length_Reference(3.1.0) 1.451 ms 0.0046 ms 0.0041 ms
Vec4Length_Reference(3.1.9) 1.553 ms 0.0137 ms 0.0122 ms
Vec4Length_Reference(5.0.0) 1.620 ms 0.0039 ms 0.0035 ms
Vec4Length_Sse_V1(3.1.0) 1.191 ms 0.0052 ms 0.0049 ms
Vec4Length_Sse_V1(3.1.9) 1.210 ms 0.0161 ms 0.0150 ms
Vec4Length_Sse_V1(5.0.0) 1.170 ms 0.0061 ms 0.0051 ms
Vec4Length_NumericsVec(3.1.9) 1.113 ms 0.0040 ms 0.0038 ms
Vec4Length_NumericsVec(5.0.0) 1.454 ms 0.0077 ms 0.0072 ms

I also tried using Unsafe.SkipInit in the implementation that uses an overlapped Numerics.Vector4, but there is still the same overhead constructing a struct using this.

Overall, the V1 implementation is now a bit closer to the optimum and gives a reasonable improvement over the reference implementation. Maybe someone can find an explanation for the performance degradation in the other cases. Switching the target framework in my benchmark can be done quickly.

@BruceForstall
Copy link
Member

@tannergooding Can you take a look at this issue and tell me if there are still outstanding work items here?

@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

7 participants