-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement for Guid.Equals #52296
Comments
See #35654 (comment) |
In addition to #35654 make sure it doesn't regress probably the most popular case - when two guids are different and most likely it is enough to check, let's say, first several bytes for fast out. (but it for sure will regress) |
Good point @EgorBo, I have now benchmarked with equal and unequal GUIDs:
Comparing equal GUIDs using |
Interestingly, re-running the benchmarks for equal GUIDs, I got the following results:
This shows a 7.2x performance improvement for equal GUIDs. I suspect the difference in performance from my last measurement is due to loop alignment, which is rectified in .NET 6 as discussed here. If I'm right, then we're looking at around 7x performance improvement for both equal and unequal GUIDs. |
Wouldn't it be better to tweak JIT to recognize 4 sequential integers comparison pattern and produce SSE2 ? |
What does the codegen for both versions look like? A quarter of a nanosecond is indeed very quick, interesting where is the gain coming from. |
I see the ticket is closed, so probably won't happen. |
@SingleAccretion, based on the implementation of the
It then takes the result of the second instruction and compares it with 65,535 and if equal, returns 'true'. Using SharpLab, the existing
The above implementation seems too complicated to me for what it's doing though, so maybe I didn't use SharpLab correctly. |
You've fallen into the usual trap of Sharplab defaulting to Debug x86 :).
I would think the same, but that wasn't really the goal of my question. There is a substantial performance difference in your benchmarks, and I was asking for the disassembly (note that BDN has a built-in |
Aha! Traps for beginners. Thanks @SingleAccretion, that tip is much appreciated. The updated code gen for the existing
The code gen for the proposed
|
Okay, so it looks like what's happening here is that because the proposed
But if we let the JIT compiler decide for itself whether to inline, it chooses to inline and we get this 7x performance boost. Even if it isn't inlined, it's still slightly faster. |
Note also that if SSE 4.1 is available, then it is possible to use
This is one fewer CPU instruction. |
That codegen should end up being the following even, as I'd expect the second load to be folded
|
Good point. Is that a JIT issue that is already on the radar? |
@tannergooding can you shepherd this PR? |
If a PR gets placed up with corresponding numbers, certainly.
Notably this software fallback is only executed on 32-bit ARM (or if the user explicitly disables hardware acceleration) and the overall perf there isn't of too much concern. Additionally, even when hardware acceleration is disabled, this gets unrolled by the JIT ( If this were "optimal", it would use |
Assigned this to @tannergooding for now. |
The
Would you like me to submit a PR? If so, what form should the PR take? i.e. would the following update to the private static bool EqualsCore(in Guid left, in Guid right)
{
var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));
return g1.Equals(g2);
} Would any changes to the What performance numbers would I need to provide beyond those I've provided in this issue? This will be my first time submitting a PR to the .NET runtime repo, so any/all guidance is much appreciated. |
You're right. That looks to be an oversight and shouldn't be that way. This should really have been a JIT intrinsic over
I don't have a particular preference. The key is showing the perf numbers, preferably via the existing BDN benchmarks (https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Runtime/Perf.Guid.cs#L36-L42) that this improves everything.
I think that depends on if the current codegen is sufficient.
Probably, yes. Ideally |
I've only got the use of an x64 platform. Do I need to run these benchmarks on Arm32/Arm64 and/or with hardware acceleration disabled to provide corresponding performance benchmarks with the PR? Or will providing "before/after" benchmarks only for x86 with SSE2 enabled suffice? I can't figure out how to disable hardware acceleration when running the benchmarks. Setting the "COMPlus_EnableHWIntrinsic" environment variable to "0" doesn't seem to be having any effect. I decided to use SharpLab to get the code gen for the software fallback in The code gen for the software fallback in 'Vector128.Equals(Vector128)` is below.
The code gen for
This means that we can expect the performance of
Option 1 has the benefit of improving the performance of the |
Perhaps the best option would be to update private static bool EqualsCore(in Guid left, in Guid right)
{
if (Sse2.IsSupported)
{
var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));
if (Sse41.IsSupported)
{
var xor = Sse2.Xor(g1, g2);
return Sse41.TestZ(xor, xor);
}
var result = Sse2.CompareEqual(g1, g2);
return Sse2.MoveMask(result) == 0b1111_1111_1111_1111;
}
else
{
ref int rA = ref Unsafe.AsRef(in left._a);
ref int rB = ref Unsafe.AsRef(in right._a);
// Compare each element
return rA == rB
&& Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
&& Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
&& Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
}
} We can then leave any changes/optimisations to |
Strangely though, the JIT is opting not to inline using the above proposed Therefore, perhaps it is best we have [MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool Equals(Vector128<T> other)
{
ThrowHelper.ThrowForUnsupportedIntrinsicsVectorBaseType<T>();
if (Sse.IsSupported && (typeof(T) == typeof(float)))
{
var result = Sse.CompareEqual(this.AsSingle(), other.AsSingle());
return Sse.MoveMask(result) == 0b1111; // We have one bit per element
}
if (Sse41.IsSupported)
{
var xor = Sse2.Xor(vector.AsByte(), other.AsByte());
return Sse41.TestZ(xor, xor);
}
if (Sse2.IsSupported)
{
if (typeof(T) == typeof(double))
{
var result = Sse2.CompareEqual(this.AsDouble(), other.AsDouble());
return Sse2.MoveMask(result) == 0b11; // We have one bit per element
}
else
{
// Unlike float/double, there are no special values to consider
// for integral types and we can just do a comparison that all
// bytes are exactly the same.
Debug.Assert((typeof(T) != typeof(float)) && (typeof(T) != typeof(double)));
var result = Sse2.CompareEqual(this.AsByte(), other.AsByte());
return Sse2.MoveMask(result) == 0b1111_1111_1111_1111; // We have one bit per element
}
}
return SoftwareFallback(in this, other);
static bool SoftwareFallback(in Vector128<T> vector, Vector128<T> other)
{
ref int rA = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(vector));
ref int rB = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(other));
return rA == rB
&& Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
&& Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
&& Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
}
} This adds the optimised implementation for SSE 4.1 and also improves the performace of the software fallback to use the same implementation as the current
private static bool EqualsCore(in Guid left, in Guid right)
{
var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));
return g1.Equals(g2);
} This solution has the advantage of improving the performance of |
Although, we can coax the JIT to inline [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool EqualsCore(in Guid left, in Guid right)
{
if (Sse2.IsSupported)
{
var g1 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(left));
var g2 = Unsafe.As<Guid, Vector128<byte>>(ref Unsafe.AsRef(right));
if (Sse41.IsSupported)
{
var xor = Sse2.Xor(g1, g2);
return Sse41.TestZ(xor, xor);
}
var result = Sse2.CompareEqual(g1, g2);
return Sse2.MoveMask(result) == 0b1111_1111_1111_1111;
}
return SoftwareFallback(left, right);
static bool SoftwareFallback(in Guid left, in Guid right)
{
ref int rA = ref Unsafe.AsRef(in left._a);
ref int rB = ref Unsafe.AsRef(in right._a);
// Compare each element
return rA == rB
&& Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
&& Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
&& Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
}
} This is the lowest risk, easiest option because it delivers the maximum performance without having to update |
Do those vector intrinsics have alignment requirements that Guid might not satisfy? |
I suspect there’s a chance of that because I suspect the ‘Guid’ struct will possibly only be 32-bit aligned because that’s the width of its largest field. I’m not 100% sure of the rules on that though. But if the performance is better (and verified as such) with this change, does that matter? |
If incorrect alignment causes the intrinsics to throw exceptions or return incorrect results at run time, then it matters. I don't know whether these intrinsics could do that. |
If there’s any misalignment, then it may impact performance, but won’t cause an incorrect result to be returned or an exception to be thrown. |
static bool SoftwareFallback(in Vector128<T> vector, Vector128<T> other)
{
ref int rA = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(vector));
ref int rB = ref Unsafe.As<Vector128<T>, int>(ref Unsafe.AsRef(other));
return rA == rB
&& Unsafe.Add(ref rA, 1) == Unsafe.Add(ref rB, 1)
&& Unsafe.Add(ref rA, 2) == Unsafe.Add(ref rB, 2)
&& Unsafe.Add(ref rA, 3) == Unsafe.Add(ref rB, 3);
} This software fallback is only valid for integral types, it will break for floating-point types. If alignment is a concern, the correct thing to do is to use It may simply be better to wait for #49397 to be approved at which point |
@tannergooding should the |
In practice, it would be good to not pass either by reference. Parameters passed by reference are considered address taken and that can hurt certain optimizations, particularly with enregistration. However, Edit Notably I also only copied the code verbatim from above. I didn't check it for correctness other than to call out that its broken for floating-point types. |
Good point. This has resulted from me copying and pasting from the static bool SoftwareFallback(in Vector128<T> vector, Vector128<T> other)
{
/// ...
}
Given that is the case, then shall we just update |
Note that the existing method signature of I would have thought that the JIT should keep the Maybe the JIT has difficulty keeping |
That would be part of it. There are a number of considerations that come into play, including the layout and size of the type as well as whether or not inlining is going to occur.
That might be suitable, but this is also only shaving off 0.2ns, which is roughly equivalent to saving a single CPU cycle and so it may not be worth significant investment over waiting for Vector128 to be updated (which someone else is free to move to be a JIT intrinsic, it just requires touching C++ code). |
The performance gain I'm seeing is reducing the time approximately from 1.8 ns to 0.3 ns (on my laptop) when the JIT inlines the method. The JIT inlines I've found that the JIT inlines the method if either It seems that the bulk of the CPU time spent invoking But as you said, that performance gain is still only 0.2 ns, which is not hugely significant if the non-inlined method invocation overhead of 1.3 ns is unavoidable. But by having a smaller method implementation code gen, we can inline the method, thus avoiding the 1.3 ns method invocation overhead, thus achieving a reduction of close to 1.5 ns - i.e. 1.8 ns to 0.3 ns (about 6x performance gain). |
@tannergooding Should this really be marked It seems like this should be marked in some libraries area (and pushed to .NET 7), and if there is a specific codegen ask, open a more targeted issue covering just the codegen improvements that should be considered. |
Moved to .NET 7 since it is optimization. |
Is this covered by #66889 ? |
Should be, yes. |
Description
The current
Guid.Equals(Guid)
method compares the 128 bits of each GUID with four 32-bit integer comparisons:However, platforms that support SSE2 can do this comparison more efficiently using
Vector128<byte>
:However, I've noticed that the
Vector128<T>.Equals(Vector128<T>)
implementation's software fallback (shown below) compares each byte of the twoVector128<byte>
instances individually, when it could actually test the 128 bits as four 32-bit comparisons (as per the existingGuid.Equals(Guid)
method implementation), or could be even further optimised to detect whether the platform is 64-bit and if so, perform the comparison as two 64-bit comparisons.Alternatively, the
Guid.Equals(Guid)
method could be implemented by comparing twoVector128<int>
values (instead ofVector128<byte>
values):This way, the vectors would be compared int-by-int, instead of byte-by-byte. That being said, the software fallback in
Vector128<T>.Equals(Vector128<T>)
is implemented as a loop, which is slower than the current Guid.Equals(Guid) implementation, which compares the four 32-bit integers of the two GUIDs as four separate comparison statements - i.e. no loop.Therefore, I believe the best outcome would be achieved by implementing Guid.Equals(Guid) using
Vector128<byte>.Equals(Vector128<byte>)
as shown above, but also updating the software fallback of theVector128<T>.Equals(Vector<T>)
method with a more efficient implementation that doesn't involve a loop and compares based on the native width of the platform (i.e. 32 bits or 64 bits) as opposed to the width of T.Alternatively, the
Guid.Equals(Guid)
method could be updated to test whether the platform supports SSE2 and if so, use theVector128<byte>
comparison and if not, fall back to the existing implementation. Or possibly, it could also detect if the platform is 64-bit and if so, compare using two 64-bit integers instead of four 32-bit integers.Data
The performance comparison of the existing
Guid.Equals(Guid)
method implementation versus the implementation shown above usingVector128<byte>
on a machine that supports SSE2 is below.This is a 6.7x performance improvement.
The text was updated successfully, but these errors were encountered: