Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance improvement for Guid.CompareTo #53981

Open
bill-poole opened this issue Jun 10, 2021 · 4 comments
Open

Potential performance improvement for Guid.CompareTo #53981

bill-poole opened this issue Jun 10, 2021 · 4 comments
Labels
Milestone

Comments

@bill-poole
Copy link

bill-poole commented Jun 10, 2021

The Guid.CompareTo(Guid) method currently compares each field of the given Guid in turn. If the given Guid is not equal, then this is quite fast, because typically the first field returns not equal. However, if the given Guid is equal, then it ends up having to check all 11 fields one-by-one, which is quite slow.

There is an opportunity to use SSE 3 to do this much faster by first re-ordering the bytes of the two Guids to be big-endian, then comparing them using SSE 3. The approach is inspired by the SSE-based UUID comparison in the C++ boost library (which represents UUIDs as big endian, meaning they don't have to shuffle the bytes first).

I don't suspect there will be much appetite for this, given the comments/feedback on #52296, but it's an optimisation we use internally with great success, so I figured I'd contribute it in case someone finds it useful.

We find that for equal Guids, the SSE 3 implementation is 7x faster and for unequal Guids, it is over 2x faster.

Method Mean Error StdDev
FastCompareToSame 0.7268 ns 0.0163 ns 0.0152 ns
FastCompareToDifferent 0.7263 ns 0.0155 ns 0.0145 ns
CompareToSame 5.2880 ns 0.0527 ns 0.0467 ns
CompareToDifferent 1.7628 ns 0.0134 ns 0.0119 ns

The catch is that the integer return value is always -1, 0 or 1 with Guid.CompareTo(Guid), but with the SSE implementation, it may be any negative number when the given Guid is less than, and any positive number when the Guid is greater than.

I leave it to your consideration.

/// <summary>
/// Shuffle mask for rearranging the bytes of a little-endian <see cref="Guid"/> so it is big-endian, but
/// with its 16 bytes in reverse order.
/// </summary>
private static readonly Vector128<sbyte> _shuffle_mask 
    = Vector128.Create(15, 14, 13, 12, 11, 10, 9, 8, 6, 7, 4, 5, 0, 1, 2, 3);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int FastCompareTo(this in Guid guid, in Guid other)
{
    if (Ssse3.IsSupported)
    {
        // Shuffle the byte order of the two UUIDs so they are big-endian, but with their 16 bytes
        // in reverse order. The byte-wise comparison between the two UUIDs must be performed from
        // most significant byte to least for each UUID component to produce the same result as that
        // produced by the Guid.CompareTo method.
        var v1 = Ssse3.Shuffle(Unsafe.As<Guid, Vector128<sbyte>>(ref Unsafe.AsRef(guid)), _shuffle_mask);
        var v2 = Ssse3.Shuffle(Unsafe.As<Guid, Vector128<sbyte>>(ref Unsafe.AsRef(other)), _shuffle_mask);

        // Compare whether v1 is greater than v2 and vice versa. We must perform the comparison in both
        // directions so we can establish for each byte in v1 whether it is greater than, less than or
        // equal to its corresponding byte in v2. If a byte in v1_gt_v2 and v2_gt_v1 is zero, then the
        // corresponding bytes in v1 and v2 are equal. If a byte in v1_gt_v2 is -1, then the corresponding
        // byte in v1 is greater than the corresponding byte in v2. If a byte in v2_gt_v1 is -1, then the
        // corresponding byte in v2 is creater than the corresponding byte in v1.
        var v1_gt_v2 = Sse2.CompareGreaterThan(v1, v2);
        var v2_gt_v1 = Sse2.CompareGreaterThan(v2, v1);

        // SSE "compare greater than" operation only has a signed byte instruction (for comparing signed
        // byte vectors). There is no instruction for comparing unsigned byte vectors. However, to produce
        // the same comparison result as Guid.CompareTo, we must piecewise compare the unsigned bytes (in
        // reverse order) of v1 and v2. Therefore, the comparison results in v1_gt_v2 and v2_gt_v1 must
        // be adjusted such that if the sign (i.e. the most signficant bit) of any byte in v1 differs from
        // that of the corresponding byte in v2, then the corresponding result in v1_gt_v2 and v2_gt_v1 is
        // negated - i.e. 'true' (represented with a -1 value) becomes 'false' (represented with a zero value)
        // and vice versa. We therefore XOR v1 and v2 such that the most significant bit of each byte in
        // the result of the XOR operation indicates whether the corresponding result in v1_gt_v2 and v2_gt_v1
        // must be negated.
        var signs_mask = Sse2.Xor(v1, v2);

        // All the information in v1_gt_v2 and v2_gt_v1 is in the most signficant bit of each byte, which
        // will be written (in reverse order) to an unsigned 32-bit integer for each of v1_gt_v2 and v2_gt_v1.
        // Therefore, we can invert/negate the 'true'/'false' result in each of v1_gt_v2 and v2_gt_v1 by
        // XOR'ing each of v1_gt_v2 and v2_gt_v1 with the "signs mask" produced above.
        v1_gt_v2 = Sse2.Xor(v1_gt_v2, signs_mask);
        v2_gt_v1 = Sse2.Xor(v2_gt_v1, signs_mask);

        // Write the most signficiant bit of each byte in v1_gt_v2 and v2_gt_v1 into v1_gt_v2_bits and
        // v2_gt_v1_bits respectively.
        var v1_gt_v2_bits = Sse2.MoveMask(v1_gt_v2);
        var v2_gt_v1_bits = Sse2.MoveMask(v2_gt_v1);

        // Comparing v1_gt_v2_bits and v2_gt_v1_bits performs a bit-wise comparison between each such
        // that if the first bit that is different in v1_gt_v2_bits and v2_gt_v1_bits is zero in 
        // v1_gt_v2_bits and one in v2_gt_v1_bits, then v1_gt_v2_bits is deemed less than v2_gt_v1_bits.
        // Conversely if the first different bit is one in v1_gt_v2_bits and zero in v2_gt_v1_bits, then
        // v1_gt_v2_bits is deemed greater than v2_gt_v1_bits. The maximum value of v1_gt_v2_bits and
        // v2_gt_v1_bits is 65,535, so we can safely subtract v2_gt_v1_bits from v1_gt_v2_bits without
        // having to worry about wrapping occurring due to large numbers.
        return v1_gt_v2_bits - v2_gt_v1_bits;
    }
    else
    {
        // Fall back to the standard Guid.CompareTo method.
        return guid.CompareTo(other);
    }
}

The codegen is:

L0000: vzeroupper
L0003: vmovupd xmm0, [rdx]
L0007: vpshufb xmm0, xmm0, [rcx+8]
L000d: vmovupd xmm1, [r8]
L0012: vpshufb xmm1, xmm1, [rcx+8]
L0018: vpcmpgtb xmm2, xmm1, xmm0
L001c: vpxor xmm3, xmm0, xmm1
L0020: vpxor xmm2, xmm2, xmm3
L0024: vpmovmskb eax, xmm2
L0028: vpcmpgtb xmm0, xmm0, xmm1
L002c: vpxor xmm0, xmm0, xmm3
L0030: vpmovmskb edx, xmm0
L0034: sub edx, eax
L0036: mov eax, edx
L0038: ret
@bill-poole bill-poole added the tenet-performance Performance related issue label Jun 10, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jun 10, 2021
@GrabYourPitchforks
Copy link
Member

Note to area owners: the sample code contained in this issue is a derivative of the code at https://github.com/boostorg/uuid/blob/ca0185f9f2f1c8d58af8c54ce24579e1c6ebf348/include/boost/uuid/detail/uuid_x86.ipp#L98-L130, which carries its own license. Be aware of any potential impact to .NET's third-party-notices file.

@GrabYourPitchforks GrabYourPitchforks added area-System.Runtime and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jun 10, 2021
@ghost
Copy link

ghost commented Jun 10, 2021

Tagging subscribers to this area: @tannergooding
See info in area-owners.md if you want to be subscribed.

Issue Details

The Guid.CompareTo(Guid) method currently compares each field of the given Guid in turn. If the given Guid is not equal, then this is quite fast, because typically the first field returns not equal. However, if the given Guid is equal, then it ends up having to check all 11 fields one-by-one, which is quite slow.

There is an opportunity to use SSE 3 to do this much faster by first re-ordering the bytes of the two Guids to be big-endian, then comparing them using SSE 3. The approach is inspired by the SSE-based UUID comparison in the C++ boost library (which represents UUIDs as big endian, meaning they don't have to shuffle the bytes first).

I don't suspect there will be much appetite for this, given the comments/feedback on #52296, but it's an optimisation we use internally with great success, so I figured I'd contribute it in case someone finds it useful.

We find that for equal Guids, the SSE 3 implementation is 7x faster and for unequal Guids, it is over 2x faster.

Method Mean Error StdDev
FastCompareToSame 0.7268 ns 0.0163 ns 0.0152 ns
FastCompareToDifferent 0.7263 ns 0.0155 ns 0.0145 ns
CompareToSame 5.2880 ns 0.0527 ns 0.0467 ns
CompareToDifferent 1.7628 ns 0.0134 ns 0.0119 ns

The catch is that the integer return value is always -1, 0 or 1 with Guid.CompareTo(Guid), but with the SSE implementation, it may be any negative number when the given Guid is less than, and any positive number when the Guid is greater than.

I leave it to your consideration.

/// <summary>
/// Shuffle mask for rearranging the bytes of a little-endian <see cref="Guid"/> so it is big-endian, but
/// with its 16 bytes in reverse order.
/// </summary>
private static readonly Vector128<sbyte> _shuffle_mask 
    = Vector128.Create(15, 14, 13, 12, 11, 10, 9, 8, 6, 7, 4, 5, 0, 1, 2, 3);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int FastCompareTo(this in Guid guid, in Guid other)
{
    if (Ssse3.IsSupported)
    {
        // Shuffle the byte order of the two UUIDs so they are big-endian, but with their 16 bytes
        // in reverse order. The byte-wise comparison between the two UUIDs must be performed from
        // most significant byte to least for each UUID component to produce the same result as that
        // produced by the Guid.CompareTo method.
        var v1 = Ssse3.Shuffle(Unsafe.As<Guid, Vector128<sbyte>>(ref Unsafe.AsRef(guid)), _shuffle_mask);
        var v2 = Ssse3.Shuffle(Unsafe.As<Guid, Vector128<sbyte>>(ref Unsafe.AsRef(other)), _shuffle_mask);

        // Compare whether v1 is greater than v2 and vice versa. We must perform the comparison in both
        // directions so we can establish for each byte in v1 whether it is greater than, less than or
        // equal to its corresponding byte in v2. If a byte in v1_gt_v2 and v2_gt_v1 is zero, then the
        // corresponding bytes in v1 and v2 are equal. If a byte in v1_gt_v2 is -1, then the corresponding
        // byte in v1 is greater than the corresponding byte in v2. If a byte in v2_gt_v1 is -1, then the
        // corresponding byte in v2 is creater than the corresponding byte in v1.
        var v1_gt_v2 = Sse2.CompareGreaterThan(v1, v2);
        var v2_gt_v1 = Sse2.CompareGreaterThan(v2, v1);

        // SSE "compare greater than" operation only has a signed byte instruction (for comparing signed
        // byte vectors). There is no instruction for comparing unsigned byte vectors. However, to produce
        // the same comparison result as Guid.CompareTo, we must piecewise compare the unsigned bytes (in
        // reverse order) of v1 and v2. Therefore, the comparison results in v1_gt_v2 and v2_gt_v1 must
        // be adjusted such that if the sign (i.e. the most signficant bit) of any byte in v1 differs from
        // that of the corresponding byte in v2, then the corresponding result in v1_gt_v2 and v2_gt_v1 is
        // negated - i.e. 'true' (represented with a -1 value) becomes 'false' (represented with a zero value)
        // and vice versa. We therefore XOR v1 and v2 such that the most significant bit of each byte in
        // the result of the XOR operation indicates whether the corresponding result in v1_gt_v2 and v2_gt_v1
        // must be negated.
        var signs_mask = Sse2.Xor(v1, v2);

        // All the information in v1_gt_v2 and v2_gt_v1 is in the most signficant bit of each byte, which
        // will be written (in reverse order) to an unsigned 32-bit integer for each of v1_gt_v2 and v2_gt_v1.
        // Therefore, we can invert/negate the 'true'/'false' result in each of v1_gt_v2 and v2_gt_v1 by
        // XOR'ing each of v1_gt_v2 and v2_gt_v1 with the "signs mask" produced above.
        v1_gt_v2 = Sse2.Xor(v1_gt_v2, signs_mask);
        v2_gt_v1 = Sse2.Xor(v2_gt_v1, signs_mask);

        // Write the most signficiant bit of each byte in v1_gt_v2 and v2_gt_v1 into v1_gt_v2_bits and
        // v2_gt_v1_bits respectively.
        var v1_gt_v2_bits = Sse2.MoveMask(v1_gt_v2);
        var v2_gt_v1_bits = Sse2.MoveMask(v2_gt_v1);

        // Comparing v1_gt_v2_bits and v2_gt_v1_bits performs a bit-wise comparison between each such
        // that if the first bit that is different in v1_gt_v2_bits and v2_gt_v1_bits is zero in 
        // v1_gt_v2_bits and one in v2_gt_v1_bits, then v1_gt_v2_bits is deemed less than v2_gt_v1_bits.
        // Conversely if the first different bit is one in v1_gt_v2_bits and zero in v2_gt_v1_bits, then
        // v1_gt_v2_bits is deemed greater than v2_gt_v1_bits. The maximum value of v1_gt_v2_bits and
        // v2_gt_v1_bits is 65,535, so we can safely subtract v2_gt_v1_bits from v1_gt_v2_bits without
        // having to worry about wrapping occurring due to large numbers.
        return v1_gt_v2_bits - v2_gt_v1_bits;
    }
    else
    {
        // Fall back to the standard Guid.CompareTo method.
        return guid.CompareTo(other);
    }
}

The codegen is:

L0000: vzeroupper
L0003: vmovupd xmm0, [rdx]
L0007: vpshufb xmm0, xmm0, [rcx+8]
L000d: vmovupd xmm1, [r8]
L0012: vpshufb xmm1, xmm1, [rcx+8]
L0018: vpcmpgtb xmm2, xmm1, xmm0
L001c: vpxor xmm3, xmm0, xmm1
L0020: vpxor xmm2, xmm2, xmm3
L0024: vpmovmskb eax, xmm2
L0028: vpcmpgtb xmm0, xmm0, xmm1
L002c: vpxor xmm0, xmm0, xmm3
L0030: vpmovmskb edx, xmm0
L0034: sub edx, eax
L0036: mov eax, edx
L0038: ret
Author: billpoole-mi
Assignees: -
Labels:

area-System.Runtime, tenet-performance, untriaged

Milestone: -

@tannergooding
Copy link
Member

This falls into the same category as Guid.Equals and should likely wait for #53450 to go in so it can simply create two Vector128 and use the existing optimized/intrinsic functionality

@bill-poole
Copy link
Author

What sequence of existing/proposed Vector128 operations will achieve the proposed logic?

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Jul 12, 2021
@tannergooding tannergooding added this to the Future milestone Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants