Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement for Guid.Equals #52296

Closed
bill-poole opened this issue May 5, 2021 · 39 comments
Closed

Performance improvement for Guid.Equals #52296

bill-poole opened this issue May 5, 2021 · 39 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@bill-poole
Copy link

Description

The current Guid.Equals(Guid) method compares the 128 bits of each GUID with four 32-bit integer comparisons:

ref int rA = ref Unsafe.AsRef(in left._a);
ref int rB = ref Unsafe.AsRef(in right._a);

// Compare each element

return rA == rB
    && Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
    && Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
    && Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);

However, platforms that support SSE2 can do this comparison more efficiently using Vector128<byte>:

var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(guid));
var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(other));

return g1.Equals(g2);

However, I've noticed that the Vector128<T>.Equals(Vector128<T>) implementation's software fallback (shown below) compares each byte of the two Vector128<byte> instances individually, when it could actually test the 128 bits as four 32-bit comparisons (as per the existing Guid.Equals(Guid) method implementation), or could be even further optimised to detect whether the platform is 64-bit and if so, perform the comparison as two 64-bit comparisons.

static bool SoftwareFallback(in Vector128<T> vector, Vector128<T> other)
{
    for (int i = 0; i < Count; i++)
    {
        if (!((IEquatable<T>)(vector.GetElement(i))).Equals(other.GetElement(i)))
        {
            return false;
        }
    }

    return true;
}

Alternatively, the Guid.Equals(Guid) method could be implemented by comparing two Vector128<int> values (instead of Vector128<byte> values):

var g1 = Unsafe.As<Guid, Vector128<int>>(ref Unsafe.AsRef(guid));
var g2 = Unsafe.As<Guid, Vector128<int>>(ref Unsafe.AsRef(other));

return g1.Equals(g2);

This way, the vectors would be compared int-by-int, instead of byte-by-byte. That being said, the software fallback in Vector128<T>.Equals(Vector128<T>) is implemented as a loop, which is slower than the current Guid.Equals(Guid) implementation, which compares the four 32-bit integers of the two GUIDs as four separate comparison statements - i.e. no loop.

Therefore, I believe the best outcome would be achieved by implementing Guid.Equals(Guid) using Vector128<byte>.Equals(Vector128<byte>) as shown above, but also updating the software fallback of the Vector128<T>.Equals(Vector<T>) method with a more efficient implementation that doesn't involve a loop and compares based on the native width of the platform (i.e. 32 bits or 64 bits) as opposed to the width of T.

Alternatively, the Guid.Equals(Guid) method could be updated to test whether the platform supports SSE2 and if so, use the Vector128<byte> comparison and if not, fall back to the existing implementation. Or possibly, it could also detect if the platform is 64-bit and if so, compare using two 64-bit integers instead of four 32-bit integers.

Data

The performance comparison of the existing Guid.Equals(Guid) method implementation versus the implementation shown above using Vector128<byte> on a machine that supports SSE2 is below.

Method Mean Error StdDev
GuidEquals_Net50 1.7735 ns 0.0067 ns 0.0060 ns
GuidEquals_Vector128 0.2646 ns 0.0024 ns 0.0022 ns

This is a 6.7x performance improvement.

@bill-poole bill-poole added the tenet-performance Performance related issue label May 5, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels May 5, 2021
@jkotas
Copy link
Member

jkotas commented May 5, 2021

See #35654 (comment)

@EgorBo
Copy link
Member

EgorBo commented May 5, 2021

In addition to #35654 make sure it doesn't regress probably the most popular case - when two guids are different and most likely it is enough to check, let's say, first several bytes for fast out. (but it for sure will regress)

@bill-poole
Copy link
Author

Good point @EgorBo, I have now benchmarked with equal and unequal GUIDs:

Method Mean Error StdDev
GuidEquals_UnequalGuids_Net50 1.9277 ns 0.0056 ns 0.0053 ns
GuidEquals_UnequalGuids_Vector128 0.2639 ns 0.0020 ns 0.0019 ns
GuidEquals_EqualGuids_Net50 1.9301 ns 0.0073 ns 0.0068 ns
GuidEquals_EqualGuids_Vector128 0.5017 ns 0.0012 ns 0.0010 ns

Comparing equal GUIDs using Vector128<byte> on a platform with SSE2 is 3.8x faster, whereas my previous benchmark results were for comparing unequal GUIDs.

@bill-poole
Copy link
Author

Interestingly, re-running the benchmarks for equal GUIDs, I got the following results:

Method Mean Error StdDev
GuidEquals_EqualGuids_Net50 1.8590 ns 0.0048 ns 0.0045 ns
GuidEquals_EqualGuids_Vector128 0.2577 ns 0.0027 ns 0.0024 ns

This shows a 7.2x performance improvement for equal GUIDs. I suspect the difference in performance from my last measurement is due to loop alignment, which is rectified in .NET 6 as discussed here.

If I'm right, then we're looking at around 7x performance improvement for both equal and unequal GUIDs.

@NN---
Copy link
Contributor

NN--- commented May 5, 2021

Wouldn't it be better to tweak JIT to recognize 4 sequential integers comparison pattern and produce SSE2 ?
This would improve not only Guid but other places.

@bill-poole
Copy link
Author

@NN--- , that's a good point, but I think that would fall under the category of "auto-vectorization", which has been raised here. It seems like that is way off in the future. This proposal achieves a performance gain that we can get now.

@SingleAccretion
Copy link
Contributor

What does the codegen for both versions look like? A quarter of a nanosecond is indeed very quick, interesting where is the gain coming from.

@NN---
Copy link
Contributor

NN--- commented May 5, 2021

@NN--- , that's a good point, but I think that would fall under the category of "auto-vectorization", which has been raised here. It seems like that is way off in the future. This proposal achieves a performance gain that we can get now.

I see the ticket is closed, so probably won't happen.
Perhaps the problem is that ticket is too generic.

@bill-poole
Copy link
Author

bill-poole commented May 5, 2021

@SingleAccretion, based on the implementation of the Vector128<T>.Equals(Vector128<T>) method where SSE2 is supported, I'm assuming the proposed Guid.Equals(Guid) implementation where SSE2 is available loads two 128-bit registers - one with each GUID (copied from the stack) and then executes the following two SSE2 instructions:

  • __m128i _mm_cmpeq_epi8 (__m128i a, __m128i b)
  • _mm_movemask_epi8 (__m128i a)

It then takes the result of the second instruction and compares it with 65,535 and if equal, returns 'true'.

Using SharpLab, the existing Guid.Equals(Guid) implementation is much more complicated:

L0000: push ebp
L0001: mov ebp, esp
L0003: sub esp, 0x44
L0006: vxorps xmm4, xmm4, xmm4
L000a: vmovdqu[ebp - 0x44], xmm4
L000f: vmovdqu[ebp - 0x34], xmm4
L0014: vmovdqu[ebp - 0x24], xmm4
L0019: xor eax, eax
L001b: mov[ebp - 0x14], eax
L001e: mov[ebp - 0x10], eax
L0021: mov[ebp - 0xc], eax
L0024: mov[ebp - 4], ecx
L0027: mov[ebp - 8], edx
L002a: cmp dword ptr [0x92cc1a8], 0
L0031: je short L0038
L0033: call 0x6fc660e0
L0038: nop
L0039: mov ecx, [ebp - 4]
L003c: call System.Runtime.CompilerServices.Unsafe.AsRef[[System.Int32, System.Private.CoreLib]](Int32 ByRef)
L0041: mov[ebp - 0x18], eax
L0044: mov ecx, [ebp - 0x18]
L0047: mov[ebp - 0xc], ecx
L004a: mov ecx, [ebp - 8]
L004d: call System.Runtime.CompilerServices.Unsafe.AsRef[[System.Int32, System.Private.CoreLib]](Int32 ByRef)
L0052: mov[ebp - 0x1c], eax
L0055: mov ecx, [ebp - 0x1c]
L0058: mov[ebp - 0x10], ecx
L005b: mov ecx, [ebp - 0xc]
L005e: mov ecx, [ecx]
L0060: mov edx, [ebp - 0x10]
L0063: cmp ecx, [edx]
L0065: jne L010a
L006b: mov ecx, [ebp - 0xc]
L006e: mov edx, 1
L0073: call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
L0078: mov[ebp - 0x24], eax
L007b: mov ecx, [ebp - 0x24]
L007e: mov ecx, [ecx]
L0080: mov[ebp - 0x28], ecx
L0083: mov ecx, [ebp - 0x10]
L0086: mov edx, 1
L008b: call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
L0090: mov[ebp - 0x2c], eax
L0093: mov ecx, [ebp - 0x28]
L0096: mov edx, [ebp - 0x2c]
L0099: cmp ecx, [edx]
L009b: jne short L010a
L009d: mov ecx, [ebp - 0xc]
L00a0: mov edx, 2
L00a5: call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
L00aa: mov[ebp - 0x30], eax
L00ad: mov ecx, [ebp - 0x30]
L00b0: mov ecx, [ecx]
L00b2: mov[ebp - 0x34], ecx
L00b5: mov ecx, [ebp - 0x10]
L00b8: mov edx, 2
L00bd: call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
L00c2: mov[ebp - 0x38], eax
L00c5: mov ecx, [ebp - 0x34]
L00c8: mov edx, [ebp - 0x38]
L00cb: cmp ecx, [edx]
L00cd: jne short L010a
L00cf: mov ecx, [ebp - 0xc]
L00d2: mov edx, 3
L00d7: call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
L00dc: mov[ebp - 0x3c], eax
L00df: mov ecx, [ebp - 0x3c]
L00e2: mov ecx, [ecx]
L00e4: mov[ebp - 0x40], ecx
L00e7: mov ecx, [ebp - 0x10]
L00ea: mov edx, 3
L00ef: call System.Runtime.CompilerServices.Unsafe.Add[[System.Int32, System.Private.CoreLib]](Int32 ByRef, Int32)
L00f4: mov[ebp - 0x44], eax
L00f7: mov eax, [ebp - 0x40]
L00fa: mov edx, [ebp - 0x44]
L00fd: cmp eax, [edx]
L00ff: sete al
L0102: movzx eax, al
L0105: mov[ebp - 0x20], eax
L0108: jmp short L010f
L010a: xor eax, eax
L010c: mov[ebp - 0x20], eax
L010f: mov eax, [ebp - 0x20]
L0112: movzx eax, al
L0115: mov[ebp - 0x14], eax
L0118: nop
L0119: jmp short L011b
L011b: mov eax, [ebp - 0x14]
L011e: mov esp, ebp
L0120: pop ebp
L0121: ret

The above implementation seems too complicated to me for what it's doing though, so maybe I didn't use SharpLab correctly.

@SingleAccretion
Copy link
Contributor

Using SharpLab, the existing Guid.Equals(Guid) implementation is much more complicated:

You've fallen into the usual trap of Sharplab defaulting to Debug x86 :).

I'm assuming the proposed Guid.Equals(Guid) implementation where SSE2 is available loads two 128-bit registers - one with each GUID (copied from the stack) and then executes the following two SSE2 instructions:

__m128i _mm_cmpeq_epi8 (__m128i a, __m128i b)
_mm_movemask_epi8 (__m128i a)

It then takes the result of the second instruction and compares it with 65,535 and if equal, returns 'true'.

I would think the same, but that wasn't really the goal of my question. There is a substantial performance difference in your benchmarks, and I was asking for the disassembly (note that BDN has a built-in DisassemblyDiagnoser attribute for this).

@bill-poole
Copy link
Author

You've fallen into the usual trap of Sharplab defaulting to Debug x86 :).

Aha! Traps for beginners. Thanks @SingleAccretion, that tip is much appreciated. The updated code gen for the existing Guid.Equals(Guid) method is below.

L0000: push ebp
L0001: mov ebp, esp
L0003: mov eax, [ecx]
L0005: cmp eax, [edx]
L0007: jne short L0027
L0009: mov eax, [ecx+4]
L000c: cmp eax, [edx+4]
L000f: jne short L0027
L0011: mov eax, [ecx+8]
L0014: cmp eax, [edx+8]
L0017: jne short L0027
L0019: mov eax, [ecx+0xc]
L001c: cmp eax, [edx+0xc]
L001f: sete al
L0022: movzx eax, al
L0025: pop ebp
L0026: ret
L0027: xor eax, eax
L0029: pop ebp
L002a: ret

The code gen for the proposed Guid.Equals(Guid) method where there is SSE 2 support is:

L0000: vzeroupper
L0003: vmovupd xmm0, [ecx]
L0007: vmovupd xmm1, [esp+4]
L000d: vpcmpeqb xmm0, xmm0, xmm1
L0011: vpmovmskb eax, xmm0
L0015: cmp eax, 0xffff
L001a: sete al
L001d: movzx eax, al
L0020: ret 0x10

@bill-poole
Copy link
Author

bill-poole commented May 5, 2021

Okay, so it looks like what's happening here is that because the proposed Guid.Equals(Guid) implementation is fewer lines of code, it is being inlined, whereas the existing implementation is not. If I decorate the method with "no inlining", then we get the following results:

Method Mean Error StdDev
GuidEquals_UnequalGuids_Net50 1.932 ns 0.0098 ns 0.0082 ns
GuidEquals_UnequalGuids_Vector128 1.797 ns 0.0051 ns 0.0045 ns
GuidEquals_EqualGuids_Net50 1.936 ns 0.0085 ns 0.0080 ns
GuidEquals_EqualGuids_Vector128 1.746 ns 0.0065 ns 0.0060 ns

But if we let the JIT compiler decide for itself whether to inline, it chooses to inline and we get this 7x performance boost. Even if it isn't inlined, it's still slightly faster.

@bill-poole
Copy link
Author

Note also that if SSE 4.1 is available, then it is possible to use Sse41.TestZ to compare two 128-bit vectors as follows (as a further improvement to the performance of Vector128<T>.Equals(Vector128<T>):

L0000: vzeroupper
L0003: vmovupd xmm0, [ecx]
L0007: vmovupd xmm1, [esp+4]
L000d: vpxor xmm0, xmm0, xmm1
L0011: vptest xmm0, xmm0
L0016: sete al
L0019: movzx eax, al
L001c: ret 0x10

This is one fewer CPU instruction.

@tannergooding
Copy link
Member

That codegen should end up being the following even, as I'd expect the second load to be folded

vzeroupper
vmovupd xmm0, [ecx]
vpxor xmm0, xmm0, [esp+4]
vptest xmm0, xmm0
sete al
movzx eax, al
ret 0x10

@bill-poole
Copy link
Author

Good point. Is that a JIT issue that is already on the radar?

@AndyAyersMS
Copy link
Member

@tannergooding can you shepherd this PR?

@tannergooding
Copy link
Member

tannergooding commented May 10, 2021

If a PR gets placed up with corresponding numbers, certainly.

That being said, the software fallback in Vector128.Equals(Vector128) is implemented as a loop, which is slower than the current Guid.Equals(Guid) implementation, which compares the four 32-bit integers of the two GUIDs as four separate comparison statements - i.e. no loop.

Notably this software fallback is only executed on 32-bit ARM (or if the user explicitly disables hardware acceleration) and the overall perf there isn't of too much concern. Additionally, even when hardware acceleration is disabled, this gets unrolled by the JIT (Vector128<T>.Count is special and is one of the few places loop unrolling happens today).

If this were "optimal", it would use Vector128<nint> and do 2 or 4 comparisons in a loop (which would then be unrolled, etc), but Vector128<nint> is blocked until #52017 is reviewed and approved.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label May 10, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone May 10, 2021
@JulieLeeMSFT
Copy link
Member

Assigned this to @tannergooding for now.

@bill-poole
Copy link
Author

bill-poole commented May 11, 2021

Notably this software fallback is only executed on 32-bit ARM (or if the user explicitly disables hardware acceleration)

The Vector128<T>.Equals(Vector128<T>) method seems to use the software fallback for either 32-bit or 64-bit ARM. Or is there a path for hardware acceleration for 64-bit ARM that I've missed?

If a PR gets placed up with corresponding numbers, certainly.

Would you like me to submit a PR? If so, what form should the PR take? i.e. would the following update to the Guid.EqualsCore method be sufficient?

private static bool EqualsCore(in Guid left, in Guid right)
{
    var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
    var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));

    return g1.Equals(g2);
}

Would any changes to the Vector128<T>.Equals(Vector128<T>) method also be needed as part of the PR? What about the opportunity to improve the performance of Vector128<T>.Equals(Vector128<T>) on platforms supporting SSE 4.1 using Sse41.TestZ? Should that be raised as a separate issue?

What performance numbers would I need to provide beyond those I've provided in this issue?

This will be my first time submitting a PR to the .NET runtime repo, so any/all guidance is much appreciated.

@tannergooding
Copy link
Member

The Vector128.Equals(Vector128) method seems to use the software fallback for either 32-bit or 64-bit ARM

You're right. That looks to be an oversight and shouldn't be that way. This should really have been a JIT intrinsic over operator == to avoid various issues.

Would you like me to submit a PR? If so, what form should the PR take? i.e. would the following update to the Guid.EqualsCore method be sufficient

I don't have a particular preference. The key is showing the perf numbers, preferably via the existing BDN benchmarks (https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Runtime/Perf.Guid.cs#L36-L42) that this improves everything.

Would any changes to the Vector128.Equals(Vector128) method also be needed as part of the PR?

I think that depends on if the current codegen is sufficient.

What about the opportunity to improve the performance of Vector128.Equals(Vector128) on platforms supporting SSE 4.1 using Sse41.TestZ? Should that be raised as a separate issue?

Probably, yes. Ideally Equals wouldn't be implemented in C# at all and would instead be a JIT intrinsic so we can have a better inlining and codegen experience.

@bill-poole
Copy link
Author

The key is showing the perf numbers, preferably via the existing BDN benchmarks

I've only got the use of an x64 platform. Do I need to run these benchmarks on Arm32/Arm64 and/or with hardware acceleration disabled to provide corresponding performance benchmarks with the PR? Or will providing "before/after" benchmarks only for x86 with SSE2 enabled suffice?

I can't figure out how to disable hardware acceleration when running the benchmarks. Setting the "COMPlus_EnableHWIntrinsic" environment variable to "0" doesn't seem to be having any effect.

I decided to use SharpLab to get the code gen for the software fallback in Vector128<T>.Equals(Vector128<T>) method, where T : int to see how the code gen stacks up against that produced by the existing Guid.EqualsCore method. As can be seen below, it doesn't stack up well.

The code gen for the software fallback in 'Vector128.Equals(Vector128)` is below.

    L0000: sub rsp, 0x48
    L0004: vzeroupper
    L0007: vmovupd xmm0, [rcx]
    L000b: vmovapd [rsp+0x30], xmm0
    L0011: mov eax, 4
    L0016: test eax, eax
    L0018: jbe L0138
    L001e: lea rax, [rsp+0x30]
    L0023: mov r8d, [rax]
    L0026: vmovupd xmm0, [rdx]
    L002a: vmovapd [rsp+0x20], xmm0
    L0030: mov eax, 4
    L0035: test eax, eax
    L0037: jbe L0138
    L003d: lea rax, [rsp+0x20]
    L0042: mov r9d, [rax]
    L0045: cmp r8d, r9d
    L0048: jne L0131
    L004e: vmovupd xmm0, [rcx]
    L0052: vmovapd [rsp+0x30], xmm0
    L0058: mov r8d, 4
    L005e: cmp r8d, 1
    L0062: jbe L0138
    L0068: lea rax, [rsp+0x30]
    L006d: mov r8d, [rax+4]
    L0071: vmovupd xmm0, [rdx]
    L0075: vmovapd [rsp+0x20], xmm0
    L007b: mov r9d, 4
    L0081: cmp r9d, 1
    L0085: jbe L0138
    L008b: lea rax, [rsp+0x20]
    L0090: mov r9d, [rax+4]
    L0094: cmp r8d, r9d
    L0097: jne L0131
    L009d: vmovupd xmm0, [rcx]
    L00a1: vmovapd [rsp+0x30], xmm0
    L00a7: mov r8d, 4
    L00ad: cmp r8d, 2
    L00b1: jbe L0138
    L00b7: lea rax, [rsp+0x30]
    L00bc: mov r8d, [rax+8]
    L00c0: vmovupd xmm0, [rdx]
    L00c4: vmovapd [rsp+0x20], xmm0
    L00ca: mov r9d, 4
    L00d0: cmp r9d, 2
    L00d4: jbe short L0138
    L00d6: lea rax, [rsp+0x20]
    L00db: mov r9d, [rax+8]
    L00df: cmp r8d, r9d
    L00e2: jne short L0131
    L00e4: vmovupd xmm0, [rcx]
    L00e8: vmovapd [rsp+0x30], xmm0
    L00ee: mov r8d, 4
    L00f4: cmp r8d, 3
    L00f8: jbe short L0138
    L00fa: lea rax, [rsp+0x30]
    L00ff: mov r8d, [rax+0xc]
    L0103: vmovupd xmm0, [rdx]
    L0107: vmovapd [rsp+0x20], xmm0
    L010d: mov r9d, 4
    L0113: cmp r9d, 3
    L0117: jbe short L0138
    L0119: lea rax, [rsp+0x20]
    L011e: mov r9d, [rax+0xc]
    L0122: cmp r8d, r9d
    L0125: jne short L0131
    L0127: mov eax, 1
    L012c: add rsp, 0x48
    L0130: ret
    L0131: xor eax, eax
    L0133: add rsp, 0x48
    L0137: ret
    L0138: mov ecx, 0x15
    L013d: call System.ThrowHelper.ThrowArgumentOutOfRangeException(System.ExceptionArgument)
    L0142: int3

The code gen for Guid.EqualsCore is below.

    L0000: mov eax, [rcx]
    L0002: cmp eax, [rdx]
    L0004: jne short L0023
    L0006: mov eax, [rcx+4]
    L0009: cmp eax, [rdx+4]
    L000c: jne short L0023
    L000e: mov eax, [rcx+8]
    L0011: cmp eax, [rdx+8]
    L0014: jne short L0023
    L0016: mov eax, [rcx+0xc]
    L0019: cmp eax, [rdx+0xc]
    L001c: sete al
    L001f: movzx eax, al
    L0022: ret
    L0023: xor eax, eax
    L0025: ret

This means that we can expect the performance of Guid.Equals(Guid) to regress on Arm32/Arm64 and/or where hardware acceleration is disabled. I can see two possible paths forward here:

  1. Update the software fallback in Vector128<T>.Equals(Vector128<T>) with an implementation similar to that of the current Guid.EqualsCore method.
  2. Update Guid.EqualsCore to check whether SSE2 is supported and if so, use Vector128<T>.Equals(Vector128<T>), otherwise use the existing Guid.EqualsCore implementation.

Option 1 has the benefit of improving the performance of the Vector128<T>.Equals(Vector128<T>) software fallback, while option 2 constrains the changes to the Guid.EqualsCore method.

@bill-poole
Copy link
Author

bill-poole commented May 12, 2021

Perhaps the best option would be to update Guid.EqualsCore to invoke the relevant hardware intrinsics directly, rather than using Vector128<T>.Equals(Vector128<T>). This also allows us to add the SSE4.1 optimisation.

private static bool EqualsCore(in Guid left, in Guid right)
{
    if (Sse2.IsSupported)
    {
        var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
        var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));

        if (Sse41.IsSupported)
        {
            var xor = Sse2.Xor(g1, g2);
            return Sse41.TestZ(xor, xor);
        }

        var result = Sse2.CompareEqual(g1, g2);
        return Sse2.MoveMask(result) == 0b1111_1111_1111_1111;
    }
    else
    {
        ref int rA = ref Unsafe.AsRef(in left._a);
        ref int rB = ref Unsafe.AsRef(in right._a);

        // Compare each element

        return rA == rB
            && Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
            && Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
            && Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
    }
}

We can then leave any changes/optimisations to Vector<T>.Equals(Vector<T>), including possibly making it a JIT intrinsic to a separate, independent issue. This option also eliminates any risk of performance regression from using the software fallback in Vector<T>.Equals(Vector<T>) to compare GUIDs.

@bill-poole
Copy link
Author

Strangely though, the JIT is opting not to inline using the above proposed Guid.EqualsCore implementation, whereas it does inline when Guid.EqualsCore invokes Vector128<T>.Equals(Vector128<T>).

Therefore, perhaps it is best we have Guid.EqualsCore use Vector128<T>.Equals(Vector128<T>) to perform the comparison and update Vector128<T>.Equals(Vector128<T>) as follows.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool Equals(Vector128<T> other)
{
    ThrowHelper.ThrowForUnsupportedIntrinsicsVectorBaseType<T>();

    if (Sse.IsSupported && (typeof(T) == typeof(float)))
    {
        var result = Sse.CompareEqual(this.AsSingle(), other.AsSingle());
        return Sse.MoveMask(result) == 0b1111; // We have one bit per element
    }

    if (Sse41.IsSupported)
    {
        var xor = Sse2.Xor(vector.AsByte(), other.AsByte());
        return Sse41.TestZ(xor, xor);
    }

    if (Sse2.IsSupported)
    {
        if (typeof(T) == typeof(double))
        {
            var result = Sse2.CompareEqual(this.AsDouble(), other.AsDouble());
            return Sse2.MoveMask(result) == 0b11; // We have one bit per element
        }
        else
        {
            // Unlike float/double, there are no special values to consider
            // for integral types and we can just do a comparison that all
            // bytes are exactly the same.

            Debug.Assert((typeof(T) != typeof(float)) && (typeof(T) != typeof(double)));
            var result = Sse2.CompareEqual(this.AsByte(), other.AsByte());
            return Sse2.MoveMask(result) == 0b1111_1111_1111_1111; // We have one bit per element
        }
    }

    return SoftwareFallback(in this, other);

    static bool SoftwareFallback(in Vector128<T> vector, Vector128<T> other)
    {
        ref int rA = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(vector));
        ref int rB = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(other));

        return rA == rB
            && Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
            && Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
            && Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
    }
}

This adds the optimised implementation for SSE 4.1 and also improves the performace of the software fallback to use the same implementation as the current Guid.EqualsCore method, so there should be no performance regression on Arm32/Arm64 platforms or where hardware acceleration is disabled.

Guid.EqualsCore then updated as previously proposed:

private static bool EqualsCore(in Guid left, in Guid right)
{
    var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
    var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));

    return g1.Equals(g2);
}

This solution has the advantage of improving the performance of Vector128<T>.Equals(Vector128<T>) by using SSE 4.1 where available and also improving the performance of the software fallback on Arm32/Arm64. It also results in the JIT inlining Guid.Equals(Guid) where SSE 2/4.1 is supported.

@bill-poole
Copy link
Author

Although, we can coax the JIT to inline Guid.Equals(Guid) when SSE 2 or 4.1 is supported, but not when using the software fallback by implementing Guid.EqualsCore using the same approach as taken by Vector128<T>.Equals(Vector128<T>), which is to decorate Guid.EqualsCore with "aggressive inlining" and place the software fallback in a local function.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool EqualsCore(in Guid left, in Guid right)
{
    if (Sse2.IsSupported)
    {
        var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
        var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));

        if (Sse41.IsSupported)
        {
            var xor = Sse2.Xor(g1, g2);
            return Sse41.TestZ(xor, xor);
        }

        var result = Sse2.CompareEqual(g1, g2);
        return Sse2.MoveMask(result) == 0b1111_1111_1111_1111;
    }

    return SoftwareFallback(left, right);

    static bool SoftwareFallback(in Guid left, in Guid right)
    {
        ref int rA = ref Unsafe.AsRef(in left._a);
        ref int rB = ref Unsafe.AsRef(in right._a);

        // Compare each element

        return rA == rB
            && Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
            && Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
            && Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
    }
}

This is the lowest risk, easiest option because it delivers the maximum performance without having to update Vector128<T>.Equals(Vector128<T>).

@KalleOlaviNiemitalo
Copy link

Do those vector intrinsics have alignment requirements that Guid might not satisfy?

@bill-poole
Copy link
Author

I suspect there’s a chance of that because I suspect the ‘Guid’ struct will possibly only be 32-bit aligned because that’s the width of its largest field. I’m not 100% sure of the rules on that though.

But if the performance is better (and verified as such) with this change, does that matter?

@KalleOlaviNiemitalo
Copy link

If incorrect alignment causes the intrinsics to throw exceptions or return incorrect results at run time, then it matters. I don't know whether these intrinsics could do that.

@bill-poole
Copy link
Author

If there’s any misalignment, then it may impact performance, but won’t cause an incorrect result to be returned or an exception to be thrown.

@tannergooding
Copy link
Member

static bool SoftwareFallback(in Vector128<T> vector, Vector128<T> other)
    {
        ref int rA = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(vector));
        ref int rB = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(other));

        return rA == rB
            && Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
            && Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
            && Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
    }

This software fallback is only valid for integral types, it will break for floating-point types.

If alignment is a concern, the correct thing to do is to use Unsafe.ReadUnaligned<Vector128<byte>>(ref Unsafe.As<Guid, byte>(ref Unsafe.AsRef(in Guid)).

It may simply be better to wait for #49397 to be approved at which point Vector128 == Vector128 will be available and I'll have the time to correctly optimize it on all platforms.

@gfoidl
Copy link
Member

gfoidl commented May 12, 2021

@tannergooding should the other also be passed with in?
If not, then for rB the Unsafe.AsRef can be omitted.

@tannergooding
Copy link
Member

tannergooding commented May 12, 2021

In practice, it would be good to not pass either by reference. Parameters passed by reference are considered address taken and that can hurt certain optimizations, particularly with enregistration.

However, Equals is an instance method on a value type and so this is implicitly a byref, and it can't be a static extension method like the other functions.

Edit Notably I also only copied the code verbatim from above. I didn't check it for correctness other than to call out that its broken for floating-point types.

@bill-poole
Copy link
Author

should the other also be passed with in? If not, then for rB the Unsafe.AsRef can be omitted.

Good point. This has resulted from me copying and pasting from the Vector128<T>.Equals(Vector<T>) implementation where the other parameter is not passed with in (as shown below), but then just blinding copying in the proposed SoftwareFallback implementation without thinking about it. Apologies.

static bool SoftwareFallback(in Vector128<T> vector, Vector128<T> other)
{
    /// ...
}

This software fallback is only valid for integral types, it will break for floating-point types.

Given that is the case, then shall we just update Guid.EqualsCore as suggested here? That way, we're not messing with Vector128<T>.Equals(Vector<T>) and we can leave those optimisations to future work already on the roadmap.

@bill-poole
Copy link
Author

In practice, it would be good to not pass either by reference. Parameters passed by reference are considered address taken and that can hurt certain optimizations, particularly with enregistration.

Note that the existing method signature of Guid.EqualsCore passes the two Guid parameters by reference. I've just maintained consistency with the existing method signature. That being said, I've tended to find in these performance benchmarks that passing Guid parameters by reference improves performance over passing by value.

I would have thought that the JIT should keep the Guid parameters in registers, such that passing by reference either makes no difference to performance or possibly may hurt performance as you suggest. But that hasn't proven to be the case with .NET 5. Maybe that is different with .NET 6?

Maybe the JIT has difficulty keeping Guid parameters in registers because they are implemented with 11 fields?

@tannergooding
Copy link
Member

Maybe the JIT has difficulty keeping Guid parameters in registers because they are implemented with 11 fields?

That would be part of it. There are a number of considerations that come into play, including the layout and size of the type as well as whether or not inlining is going to occur.

Given that is the case, then shall we just update Guid.EqualsCore as suggested here?

That might be suitable, but this is also only shaving off 0.2ns, which is roughly equivalent to saving a single CPU cycle and so it may not be worth significant investment over waiting for Vector128 to be updated (which someone else is free to move to be a JIT intrinsic, it just requires touching C++ code).

@bill-poole
Copy link
Author

That might be suitable, but this is also only shaving off 0.2ns

The performance gain I'm seeing is reducing the time approximately from 1.8 ns to 0.3 ns (on my laptop) when the JIT inlines the method. The JIT inlines Guid.Equals when implemented using Vector128<T>.Equals(Vector128<T>) presumably because the method implementation is sufficiently smaller than the existing implementation of Guid.EqualsCore.

I've found that the JIT inlines the method if either Vector128<T>.Equals(Vector128<T>) is used as the implementation of Guid.EqualsCore or by using the implementation here, which only inlines Guid.EqualsCore on platforms supporting SSE 2 or SSE 4.1.

It seems that the bulk of the CPU time spent invoking Guid.Equals is the overhead associated with a non-inlined method invocation with a by-value Guid parameter (approximately 1.3 ns on my laptop). If we consider the CPU time involved only in the method implementation itself (i.e. without the overhead of the non-inlined method invocation), then the 0.2 ns gain is much more substantial.

But as you said, that performance gain is still only 0.2 ns, which is not hugely significant if the non-inlined method invocation overhead of 1.3 ns is unavoidable. But by having a smaller method implementation code gen, we can inline the method, thus avoiding the 1.3 ns method invocation overhead, thus achieving a reduction of close to 1.5 ns - i.e. 1.8 ns to 0.3 ns (about 6x performance gain).

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 20, 2021
@BruceForstall
Copy link
Member

@tannergooding Should this really be marked area-CodeGen-coreclr? And milestone:6.0.0?

It seems like this should be marked in some libraries area (and pushed to .NET 7), and if there is a specific codegen ask, open a more targeted issue covering just the codegen improvements that should be considered.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 23, 2021
@JulieLeeMSFT JulieLeeMSFT modified the milestones: 6.0.0, 7.0.0 Jul 26, 2021
@JulieLeeMSFT
Copy link
Member

Moved to .NET 7 since it is optimization.

@AndyAyersMS
Copy link
Member

Is this covered by #66889 ?

@tannergooding
Copy link
Member

Should be, yes.

@ghost ghost locked as resolved and limited conversation to collaborators May 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.