Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX2 support to IndexOfAnyValues #78863

Merged
merged 1 commit into from
Jan 18, 2023

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Nov 26, 2022

Easier to review while ignoring whitespace

This change makes IndexOfAnyValues noticeably faster than a regular IndexOf(char) without #78861 :)

Method Toolchain Length Mean Error Ratio
IndexOfAnyAsciiChar main 1 2.270 ns 0.0145 ns 1.00
IndexOfAnyAsciiChar pr 1 2.279 ns 0.0204 ns 1.00
IndexOfAnyAsciiByte main 1 1.787 ns 0.0053 ns 1.00
IndexOfAnyAsciiByte pr 1 1.788 ns 0.0083 ns 1.00
IndexOfAnyByte main 1 1.789 ns 0.0050 ns 1.00
IndexOfAnyByte pr 1 1.790 ns 0.0051 ns 1.00
IndexOfAnyAsciiChar main 16 2.784 ns 0.0247 ns 1.00
IndexOfAnyAsciiChar pr 16 2.782 ns 0.0252 ns 1.00
IndexOfAnyAsciiByte main 16 2.696 ns 0.0117 ns 1.00
IndexOfAnyAsciiByte pr 16 2.364 ns 0.0084 ns 0.88
IndexOfAnyByte main 16 3.887 ns 0.0196 ns 1.00
IndexOfAnyByte pr 16 3.488 ns 0.0214 ns 0.90
IndexOfAnyAsciiChar main 17 3.348 ns 0.0125 ns 1.00
IndexOfAnyAsciiChar pr 17 3.103 ns 0.0639 ns 0.94
IndexOfAnyAsciiByte main 17 3.122 ns 0.0109 ns 1.00
IndexOfAnyAsciiByte pr 17 3.061 ns 0.0124 ns 0.98
IndexOfAnyByte main 17 5.465 ns 0.0762 ns 1.00
IndexOfAnyByte pr 17 4.102 ns 0.0258 ns 0.75
IndexOfAnyAsciiChar main 32 3.381 ns 0.0098 ns 1.00
IndexOfAnyAsciiChar pr 32 3.042 ns 0.0095 ns 0.90
IndexOfAnyAsciiByte main 32 3.261 ns 0.0063 ns 1.00
IndexOfAnyAsciiByte pr 32 3.074 ns 0.0138 ns 0.94
IndexOfAnyByte main 32 5.538 ns 0.0131 ns 1.00
IndexOfAnyByte pr 32 4.149 ns 0.0197 ns 0.75
IndexOfAnyAsciiChar main 1000 60.429 ns 0.2208 ns 1.00
IndexOfAnyAsciiChar pr 1000 34.636 ns 0.1453 ns 0.57
IndexOfAnyAsciiByte main 1000 49.191 ns 0.1576 ns 1.00
IndexOfAnyAsciiByte pr 1000 34.826 ns 0.4630 ns 0.71
IndexOfAnyByte main 1000 83.976 ns 0.2974 ns 1.00
IndexOfAnyByte pr 1000 45.647 ns 0.1477 ns 0.54
IndexOfAnyAsciiChar main 100000 5,865.958 ns 23.1353 ns 1.00
IndexOfAnyAsciiChar pr 100000 3,880.814 ns 35.5076 ns 0.66
IndexOfAnyAsciiByte main 100000 4,753.754 ns 15.7070 ns 1.00
IndexOfAnyAsciiByte pr 100000 3,215.513 ns 10.5250 ns 0.68
IndexOfAnyByte main 100000 7,691.448 ns 38.5081 ns 1.00
IndexOfAnyByte pr 100000 4,488.097 ns 41.8751 ns 0.58

Chars

Method Length Mean Error StdDev
IndexOfAnyAsciiChar 100000 3.936 us 0.0525 us 0.0491 us
IndexOf 100000 4.925 us 0.0186 us 0.0174 us
IndexOfAny2Values 100000 5.754 us 0.0196 us 0.0174 us
IndexOfAny3Values 100000 6.161 us 0.0257 us 0.0228 us
IndexOfAny4Values 100000 7.011 us 0.0162 us 0.0135 us
IndexOfAny5Values 100000 7.743 us 0.0363 us 0.0303 us
IndexOfAnyInRange_Char 100000 5.196 us 0.0143 us 0.0120 us

Bytes

Method Length Mean Error StdDev
IndexOfAnyAsciiByte 100000 3.189 us 0.0171 us 0.0151 us
IndexOfAnyByte 100000 4.484 us 0.0259 us 0.0229 us
IndexOf 100000 1.971 us 0.0186 us 0.0174 us
IndexOfAny2Values 100000 2.233 us 0.0093 us 0.0078 us
IndexOfAny3Values 100000 2.628 us 0.0081 us 0.0076 us
IndexOfAny4Values 100000 2.929 us 0.0155 us 0.0145 us
IndexOfAny5Values 100000 3.270 us 0.0102 us 0.0090 us
IndexOfAnyInRange_Byte 100000 2.187 us 0.0090 us 0.0084 us

@MihaZupan MihaZupan added area-System.Memory tenet-performance Performance related issue labels Nov 26, 2022
@MihaZupan MihaZupan added this to the 8.0.0 milestone Nov 26, 2022
@MihaZupan MihaZupan self-assigned this Nov 26, 2022
@ghost
Copy link

ghost commented Nov 26, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Easier to review while ignoring whitespace

This change makes IndexOfAnyValues noticeably faster than a regular IndexOf(char) without #78861 :)

Method Toolchain Length Mean Error Ratio
IndexOfAnyAsciiChar main 1 2.270 ns 0.0145 ns 1.00
IndexOfAnyAsciiChar pr 1 2.279 ns 0.0204 ns 1.00
IndexOfAnyAsciiByte main 1 1.787 ns 0.0053 ns 1.00
IndexOfAnyAsciiByte pr 1 1.788 ns 0.0083 ns 1.00
IndexOfAnyByte main 1 1.789 ns 0.0050 ns 1.00
IndexOfAnyByte pr 1 1.790 ns 0.0051 ns 1.00
IndexOfAnyAsciiChar main 16 2.784 ns 0.0247 ns 1.00
IndexOfAnyAsciiChar pr 16 2.782 ns 0.0252 ns 1.00
IndexOfAnyAsciiByte main 16 2.696 ns 0.0117 ns 1.00
IndexOfAnyAsciiByte pr 16 2.364 ns 0.0084 ns 0.88
IndexOfAnyByte main 16 3.887 ns 0.0196 ns 1.00
IndexOfAnyByte pr 16 3.488 ns 0.0214 ns 0.90
IndexOfAnyAsciiChar main 17 3.348 ns 0.0125 ns 1.00
IndexOfAnyAsciiChar pr 17 3.103 ns 0.0639 ns 0.94
IndexOfAnyAsciiByte main 17 3.122 ns 0.0109 ns 1.00
IndexOfAnyAsciiByte pr 17 3.061 ns 0.0124 ns 0.98
IndexOfAnyByte main 17 5.465 ns 0.0762 ns 1.00
IndexOfAnyByte pr 17 4.102 ns 0.0258 ns 0.75
IndexOfAnyAsciiChar main 32 3.381 ns 0.0098 ns 1.00
IndexOfAnyAsciiChar pr 32 3.042 ns 0.0095 ns 0.90
IndexOfAnyAsciiByte main 32 3.261 ns 0.0063 ns 1.00
IndexOfAnyAsciiByte pr 32 3.074 ns 0.0138 ns 0.94
IndexOfAnyByte main 32 5.538 ns 0.0131 ns 1.00
IndexOfAnyByte pr 32 4.149 ns 0.0197 ns 0.75
IndexOfAnyAsciiChar main 1000 60.429 ns 0.2208 ns 1.00
IndexOfAnyAsciiChar pr 1000 34.636 ns 0.1453 ns 0.57
IndexOfAnyAsciiByte main 1000 49.191 ns 0.1576 ns 1.00
IndexOfAnyAsciiByte pr 1000 34.826 ns 0.4630 ns 0.71
IndexOfAnyByte main 1000 83.976 ns 0.2974 ns 1.00
IndexOfAnyByte pr 1000 45.647 ns 0.1477 ns 0.54
IndexOfAnyAsciiChar main 100000 5,865.958 ns 23.1353 ns 1.00
IndexOfAnyAsciiChar pr 100000 3,880.814 ns 35.5076 ns 0.66
IndexOfAnyAsciiByte main 100000 4,753.754 ns 15.7070 ns 1.00
IndexOfAnyAsciiByte pr 100000 3,215.513 ns 10.5250 ns 0.68
IndexOfAnyByte main 100000 7,691.448 ns 38.5081 ns 1.00
IndexOfAnyByte pr 100000 4,488.097 ns 41.8751 ns 0.58

Chars

Method Length Mean Error StdDev
IndexOfAnyAsciiChar 100000 3.936 us 0.0525 us 0.0491 us
IndexOf 100000 4.925 us 0.0186 us 0.0174 us
IndexOfAny2Values 100000 5.754 us 0.0196 us 0.0174 us
IndexOfAny3Values 100000 6.161 us 0.0257 us 0.0228 us
IndexOfAny4Values 100000 7.011 us 0.0162 us 0.0135 us
IndexOfAny5Values 100000 7.743 us 0.0363 us 0.0303 us
IndexOfAnyInRange_Char 100000 5.196 us 0.0143 us 0.0120 us

Bytes

Method Length Mean Error StdDev
IndexOfAnyAsciiByte 100000 3.189 us 0.0171 us 0.0151 us
IndexOfAnyByte 100000 4.484 us 0.0259 us 0.0229 us
IndexOf 100000 1.971 us 0.0186 us 0.0174 us
IndexOfAny2Values 100000 2.233 us 0.0093 us 0.0078 us
IndexOfAny3Values 100000 2.628 us 0.0081 us 0.0076 us
IndexOfAny4Values 100000 2.929 us 0.0155 us 0.0145 us
IndexOfAny5Values 100000 3.270 us 0.0102 us 0.0090 us
IndexOfAnyInRange_Byte 100000 2.187 us 0.0090 us 0.0084 us
Author: MihaZupan
Assignees: MihaZupan
Labels:

area-System.Memory, tenet-performance

Milestone: 8.0.0

@dakersnar
Copy link
Contributor

This change makes IndexOfAnyValues noticeably faster than a regular IndexOf(char) without #78861 :)

For context, does this PR have any relation to the changes made in #78861, or is this a completely independent optimization?

@MihaZupan
Copy link
Member Author

Completely independent, I just happened to have worked on both at a similar point in time.

@MihaZupan
Copy link
Member Author

Any thoughts on this PR @dotnet/area-system-memory?

@dakersnar
Copy link
Contributor

Sorry, this has been on my TODO list. Can you provide a similar explanation to what you did here: #78861 (comment)?

@MihaZupan
Copy link
Member Author

This file (IndexOfAnyAsciiSearcher.cs) already contains a bunch of vectorization code for Ssse3 and Arm64 through Vector128.

This PR effectively copy-pastes the existing Vector128 logic to add Vector256 paths. Everything is pretty much the same, just processing 2x chars/bytes per iteration.

Taking IndexOfAnyVectorized as an example. Currently it's written as

if (searchSpaceLength > 16)
{
    // Process the input with 2 Vector128s (16 chars) at a time.
}

// Process the last 1-16 chars with 2 overlapped Vector128s.

return NotFound;

and after this PR, it's written as

if (searchSpaceLength > 16)
{
    if (Avx2.IsSupported)
    {
        if (searchSpaceLength > 32)
        {
            // Process the input with 2 Vector256s (32 chars) at a time.
        }

        // Process the last 1-32 chars with 2 overlapped Vector256s.
        
        return NotFound;
    }
    else
    {
        // Process the input with 2 Vector128s (16 chars) at a time.
    }
}

// Process the last 1-16 chars with 2 overlapped Vector128s.

return NotFound;

Copy link
Contributor

@dakersnar dakersnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and thank you for the explanation. I assume all the helper methods are just copy pasted from the Vector128 implementations?

@@ -360,7 +360,7 @@ static int LastIndexOfAnyExceptReferenceImpl(ReadOnlySpan<char> searchSpace, Rea
private static class IndexOfAnyValuesTestHelper
{
private const int MaxNeedleLength = 10;
private const int MaxHaystackLength = 40;
private const int MaxHaystackLength = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this number come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are used by the test when running a few million random inputs through the implementation to check random edge cases.
This number should be large enough so that we test all sorts of code paths - e.g. 100 is enough for

// Process the input with 2 Vector256s (32 chars) at a time.
+
// Process the input with 2 Vector256s (32 chars) at a time.
+
// Process the last 1-32 chars with 2 overlapped Vector256s.

so if the "2 Vector256s at a time" step left something broken, we'd catch it.

On the other hand it doesn't make sense to increase it too much as we'd just end up testing fewer variants of short inputs - and there's nothing in the code that would behave fundamentally differently for inputs of 1k vs 10k characters.

@MihaZupan
Copy link
Member Author

MihaZupan commented Jan 18, 2023

I assume all the helper methods are just copy pasted from the Vector128 implementations?

Yes*, with the exception that we call this FixUpPackedVector256Mask helper method that accounts for how AVX2 does packing (it's more like 2x 128bit instead of 1x 256bit). We do the same thing in PackedSpanHelpers.

@MihaZupan
Copy link
Member Author

The build failure is known according to Build Analysis

@MihaZupan MihaZupan merged commit 4e0195e into dotnet:main Jan 18, 2023
mdh1418 pushed a commit to mdh1418/runtime that referenced this pull request Jan 24, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants