Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light up IndexOfAnyAsciiSearcher for AVX512 #93222

Open
stephentoub opened this issue Oct 9, 2023 · 12 comments · Fixed by #103710
Open

Light up IndexOfAnyAsciiSearcher for AVX512 #93222

stephentoub opened this issue Oct 9, 2023 · 12 comments · Fixed by #103710
Labels
area-System.Runtime avx512 Related to the AVX-512 architecture
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Oct 9, 2023

The implementation that backs many IndexOfAny searches with SearchValues today only leverages Vector128/256:
https://github.com/dotnet/runtime/blob/a5461cbc8ed20e0981fd4e846a180f35b07dcc0a/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs
We should teach it about Vector512.

(@MihaZupan has a prototype.)

@stephentoub stephentoub added this to the 9.0.0 milestone Oct 9, 2023
@ghost
Copy link

ghost commented Oct 9, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

The implementation that backs many IndexOfAny searches with SearchValues today only leverages Vector128/256:
https://github.com/dotnet/runtime/blob/a5461cbc8ed20e0981fd4e846a180f35b07dcc0a/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs
We should teach it about Vector512.

Author: stephentoub
Assignees: -
Labels:

area-System.Runtime

Milestone: 9.0.0

@stephentoub stephentoub changed the title Ligh-up IndexOfAnyAsciiSearcher for AVX512 Light-up IndexOfAnyAsciiSearcher for AVX512 Oct 9, 2023
@stephentoub stephentoub changed the title Light-up IndexOfAnyAsciiSearcher for AVX512 Light up IndexOfAnyAsciiSearcher for AVX512 Oct 9, 2023
@MihaZupan
Copy link
Member

For reference, the implementation could look something like this: MihaZupan/runtime@searchvalues-state...MihaZupan:runtime:searchvalues-avx512 (might need some cleanup after #91937).

With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).

@Ruihan-Yin
Copy link
Contributor

Hi @stephentoub @MihaZupan
This is @Ruihan-Yin, from the .net team at Intel, would like to follow up on this issue.

From previous conversation,

For reference, the implementation could look something like this: MihaZupan/[email protected]:runtime:searchvalues-avx512 (might need some cleanup after #91937).

With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).

As there is already a prototype, I would like to know if there is any plan to submit a PR based on the existing works. Or, if no, will it be okay for us to take the prototype and continue the work to form a PR?

@MihaZupan
Copy link
Member

MihaZupan commented Jan 22, 2024

Hey @Ruihan-Yin, I've pushed an updated version of my change that compiles again to MihaZupan/runtime@main...MihaZupan:runtime:searchvalues-ascii-avx512.

Benchmark results
BenchmarkDotNet v0.13.11, Ubuntu 22.04.3 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.100-preview.5.23303.2
  [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-SVIKBC : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-FCGAUB : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

IterationCount=100  LaunchCount=3  WarmupCount=15
Method Toolchain Length MatchAtStart Mean Error Ratio
IndexOfAny main 33 False 4.351 ns 0.0173 ns 1.00
IndexOfAny pr 33 False 3.695 ns 0.0002 ns 0.85
IndexOfAnyExcept main 33 False 4.033 ns 0.0303 ns 1.00
IndexOfAnyExcept pr 33 False 3.928 ns 0.0113 ns 0.98
IndexOfAny main 33 True 4.313 ns 0.2266 ns 1.00
IndexOfAny pr 33 True 6.948 ns 0.8691 ns 1.77
IndexOfAnyExcept main 33 True 2.774 ns 0.0114 ns 1.00
IndexOfAnyExcept pr 33 True 4.419 ns 0.0026 ns 1.59
IndexOfAny main 64 False 4.326 ns 0.0072 ns 1.00
IndexOfAny pr 64 False 4.385 ns 0.2552 ns 1.01
IndexOfAnyExcept main 64 False 3.830 ns 0.0040 ns 1.00
IndexOfAnyExcept pr 64 False 3.894 ns 0.0023 ns 1.02
IndexOfAny main 64 True 3.611 ns 0.0019 ns 1.00
IndexOfAny pr 64 True 4.370 ns 0.0140 ns 1.21
IndexOfAnyExcept main 64 True 2.741 ns 0.0008 ns 1.00
IndexOfAnyExcept pr 64 True 4.412 ns 0.0017 ns 1.61
IndexOfAny main 128 False 7.623 ns 0.2039 ns 1.00
IndexOfAny pr 128 False 7.769 ns 0.7353 ns 0.99
IndexOfAnyExcept main 128 False 6.232 ns 0.0022 ns 1.00
IndexOfAnyExcept pr 128 False 5.690 ns 0.1762 ns 0.90
IndexOfAny main 128 True 3.729 ns 0.0382 ns 1.00
IndexOfAny pr 128 True 6.264 ns 0.8090 ns 1.63
IndexOfAnyExcept main 128 True 2.736 ns 0.0009 ns 1.00
IndexOfAnyExcept pr 128 True 4.086 ns 0.1197 ns 1.48
IndexOfAny main 1000 False 37.310 ns 0.0296 ns 1.00
IndexOfAny pr 1000 False 23.893 ns 0.0971 ns 0.64
IndexOfAnyExcept main 1000 False 40.196 ns 0.0707 ns 1.00
IndexOfAnyExcept pr 1000 False 27.997 ns 0.0461 ns 0.70
IndexOfAny main 1000 True 3.871 ns 0.0502 ns 1.00
IndexOfAny pr 1000 True 3.980 ns 0.0115 ns 1.03
IndexOfAnyExcept main 1000 True 2.740 ns 0.0011 ns 1.00
IndexOfAnyExcept pr 1000 True 4.197 ns 0.1730 ns 1.47

I avoided opening a PR as the regressions for early matches are quite noticeable.
If you have any ideas on how to improve on this, please feel free to continue the work.

@Ruihan-Yin
Copy link
Contributor

Hi @MihaZupan , thanks for the update!
Just a few more questions to understand the data better:

  1. In the presented numbers, what is the MatchAtStart/Early Match referring to?
  2. Do we know any possible reason for the regression, like any red flag shown in profiler?

@MihaZupan
Copy link
Member

The benchmark numbers above are for this:

public class IndexOfAnyAsciiBenchmark
{
    private static readonly SearchValues<char> s_chars = SearchValues.Create("abcABC123");
    private string _charBuffer;
    private string _charBufferExcept;

    [Params(8, 16, 32, 64, 128, 1000, 10_000)]
    public int Length;

    [Params(true, false)]
    public bool MatchAtStart;

    [GlobalSetup]
    public void Setup()
    {
        var charBuffer = new char[Length];
        var charBufferExcept = new char[Length];
        charBuffer.AsSpan().Fill('\n');
        charBufferExcept.AsSpan().Fill('b');

        if (MatchAtStart)
        {
            charBuffer[0] = 'b';
            charBufferExcept[0] = '\n';
        }

        _charBuffer = new string(charBuffer);
        _charBufferExcept = new string(charBufferExcept);
    }

    [Benchmark]
    public int IndexOfAny() => _charBuffer.AsSpan().IndexOfAny(s_chars);

    [Benchmark]
    public int IndexOfAnyExcept() => _charBufferExcept.AsSpan().IndexOfAnyExcept(s_chars);
}

Do we know any possible reason for the regression, like any red flag shown in profiler?

I haven't looked at this under a profiler, but I can get the codegen for you if that'd help.
The set of operations performed in the early match case is essentially the same, just with a wider vector.

@Ruihan-Yin
Copy link
Contributor

Thanks for sharing!

@MihaZupan
Copy link
Member

MihaZupan commented Jan 24, 2024

Example diffs for the Length = 64, MatchAtStart = true case: https://www.diffchecker.com/fSyCBb1H/
A bit nicer codegen if I force the helpers to be inlined even in the cold block: https://www.diffchecker.com/tu3uIa4f/

Method Toolchain Length MatchAtStart Mean Error Ratio
IndexOfAnyExcept main 64 True 2.741 ns 0.0008 ns 1.00
IndexOfAnyExcept pr 64 True 4.412 ns 0.0017 ns 1.61

Part of the issue here is how the search logic is structured.

Essentially it looks like this in main

if (length < 8)
{
    // Scalar loop
    return NotFound;
}

if (length > 16)
{
    if (length > 32)
    {
        // Process the input in chunks of 32 characters (Vector256)
    }

    // Process the last [1-32] characters by loading two overlapping vectors (Vector256)
    return NotFound;
}

// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;

and like this in pr with AVX512

if (length < 8)
{
    // Scalar loop
    return NotFound;
}

if (length > 16)
{
    if (length > 32)
    {
        if (length > 64)
        {
            // Process the input in chunks of 64 characters (Vector512)
        }

        // Process the last [1-64] characters by loading two overlapping vectors (Vector512)
        return NotFound;
    }

    // Process the last [16-32] characters by loading two overlapping vectors (Vector256)
    return NotFound;
}

// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;

For the Length = 64 case, we go from finding a match in the regular search loop to finding a match in the overlapped fallback.
Loading the overlapped input & calculating the result from an overlapped match is slightly more expensive, which is enough to show up in this case where we're talking about ns differences.

@MihaZupan
Copy link
Member

MihaZupan commented Jan 24, 2024

(sorry about the spam)

With a small tweak, the results are much closer (varies quite a bit between runs, but still).

Method Toolchain Length Mean Error Ratio
IndexOfAny_Char main 64 3.641 ns 0.0034 ns 1.00
IndexOfAny_Char pr 64 4.381 ns 0.0682 ns 1.20
IndexOfAnyExcept_Char main 64 2.861 ns 0.1486 ns 1.00
IndexOfAnyExcept_Char pr 64 3.685 ns 0.0031 ns 1.29
IndexOfAny_Char main 128 4.160 ns 0.3170 ns 1.00
IndexOfAny_Char pr 128 4.053 ns 0.0011 ns 0.99
IndexOfAnyExcept_Char main 128 2.747 ns 0.0021 ns 1.00
IndexOfAnyExcept_Char pr 128 3.159 ns 0.0097 ns 1.15

I'll test how this impacts the Regex benchmarks I initially mentioned.

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Mar 18, 2024
@MihaZupan
Copy link
Member

Reran the numbers with main...MihaZupan:runtime:searchvalues-ascii-avx512-4
Looks like a ~0.5 - 1 ns regression for early matches, and a speedup to ~1.5x in throughput for longer inputs.
Regex is seeing more regressions than improvements.

Basic IndexOfAny benchmarks
Method Toolchain Length MatchAtStart Mean Error Ratio
IndexOfAny_Char main 33 True 3.469 ns 0.0609 ns 1.00
IndexOfAny_Char pr 33 True 4.335 ns 0.0757 ns 1.26
IndexOfAnyExcept_Char main 33 True 2.475 ns 0.0010 ns 1.00
IndexOfAnyExcept_Char pr 33 True 3.715 ns 0.0022 ns 1.50
IndexOfAny_Char main 65 True 3.401 ns 0.0504 ns 1.00
IndexOfAny_Char pr 65 True 4.096 ns 0.1228 ns 1.20
IndexOfAnyExcept_Char main 65 True 2.475 ns 0.0016 ns 1.00
IndexOfAnyExcept_Char pr 65 True 2.872 ns 0.0020 ns 1.16
IndexOfAny_Char main 10000 False 328.756 ns 0.1378 ns 1.00
IndexOfAny_Char pr 10000 False 210.471 ns 0.1152 ns 0.64
IndexOfAnyExcept_Char main 10000 False 374.419 ns 0.3652 ns 1.00
IndexOfAnyExcept_Char pr 10000 False 260.572 ns 0.0642 ns 0.70
LastIndexOfAny_Char main 10000 False 327.438 ns 0.1851 ns 1.00
LastIndexOfAny_Char pr 10000 False 210.978 ns 0.0945 ns 0.64
LastIndexOfAnyExcept_Char main 10000 False 384.517 ns 0.1233 ns 1.00
LastIndexOfAnyExcept_Char pr 10000 False 261.597 ns 0.0735 ns 0.68
Sherlock Regex benchmarks

https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L129

Toolchain Pattern Mean Error Ratio
main (?i)Sher[a-z]+|Hol[a-z]+ 103,522.91 ns 1,600.349 ns 1.00
pr (?i)Sher[a-z]+|Hol[a-z]+ 104,615.26 ns 2,585.538 ns 1.01
main (?i)Sherlock|Holmes|Watson 94,673.68 ns 4,766.057 ns 1.01
pr (?i)Sherlock|Holmes|Watson 95,724.15 ns 4,285.540 ns 1.02
main (?i)Sherlock|(...)er|John|Baker [49] 289,495.97 ns 4,839.174 ns 1.00
pr (?i)Sherlock|(...)er|John|Baker [49] 284,341.73 ns 6,128.478 ns 0.98
main (?i)the 378,116.92 ns 1,339.989 ns 1.00
pr (?i)the 379,105.12 ns 620.316 ns 1.00
main (?m)^Sherlock(...)rlock Holmes$ [37] 82,288.25 ns 3,928.339 ns 1.01
pr (?m)^Sherlock(...)rlock Holmes$ [37] 82,778.06 ns 3,832.797 ns 1.01
main (?s).* 53.28 ns 0.135 ns 1.00
pr (?s).* 52.75 ns 0.128 ns 0.99
main [^\\n]* 735,174.34 ns 22,699.109 ns 1.00
pr [^\\n]* 739,572.06 ns 28,992.773 ns 1.01
main [a-q][^u-z]{13}x 33,777.02 ns 801.044 ns 1.00
pr [a-q][^u-z]{13}x 33,651.23 ns 744.776 ns 1.00
main [a-zA-Z]+ing 4,645,316.04 ns 16,973.212 ns 1.00
pr [a-zA-Z]+ing 4,841,529.48 ns 8,956.256 ns 1.04
main \b\w+n\b 9,815,880.28 ns 33,565.510 ns 1.00
pr \b\w+n\b 11,018,320.07 ns 286,473.844 ns 1.12
main \p{L} 10,517,664.75 ns 8,946.311 ns 1.00
pr \p{L} 11,916,466.49 ns 409,511.232 ns 1.13
main \p{Ll} 10,673,143.82 ns 331,458.549 ns 1.00
pr \p{Ll} 10,787,613.37 ns 13,621.084 ns 1.01
main \p{Lu} 554,702.13 ns 13,089.124 ns 1.00
pr \p{Lu} 436,503.07 ns 1,401.226 ns 0.79
main \s[a-zA-Z]{0,12}ing\s 4,969,805.46 ns 14,676.201 ns 1.00
pr \s[a-zA-Z]{0,12}ing\s 5,237,948.60 ns 41,554.425 ns 1.05
main \w+ 5,147,282.40 ns 11,110.191 ns 1.00
pr \w+ 5,464,592.92 ns 14,554.391 ns 1.06
main \w+\s+Holmes 3,986,310.25 ns 17,035.155 ns 1.00
pr \w+\s+Holmes 4,256,361.76 ns 22,193.135 ns 1.07
main \w+\s+Holmes\s+\w+ 4,161,642.67 ns 10,531.738 ns 1.00
pr \w+\s+Holmes\s+\w+ 4,425,879.41 ns 6,260.327 ns 1.06
main Holmes.{0,25}(...).{0,25}Holmes [39] 68,435.58 ns 445.729 ns 1.00
pr Holmes.{0,25}(...).{0,25}Holmes [39] 69,696.18 ns 437.339 ns 1.02
main Sher[a-z]+|Hol[a-z]+ 65,242.66 ns 814.983 ns 1.00
pr Sher[a-z]+|Hol[a-z]+ 65,292.98 ns 691.412 ns 1.00
main Sherlock\s+Holmes 82,699.64 ns 3,829.205 ns 1.01
pr Sherlock\s+Holmes 86,351.23 ns 352.296 ns 1.05
main Sherlock|Holmes|Watson 74,601.68 ns 510.511 ns 1.00
pr Sherlock|Holmes|Watson 74,047.63 ns 292.165 ns 0.99
main Sherlock|Holm(...)er|John|Baker [45] 202,258.67 ns 1,713.216 ns 1.00
pr Sherlock|Holm(...)er|John|Baker [45] 212,293.86 ns 670.855 ns 1.05
main the\s+\w+ 458,843.06 ns 5,086.162 ns 1.00
pr the\s+\w+ 448,289.28 ns 1,171.229 ns 0.98

@JulieLeeMSFT
Copy link
Member

@MihaZupan, we are reaching preview 7 snap soon. Do you plan to finsih this work for .NET 9?

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jun 19, 2024
@MihaZupan
Copy link
Member

Change is being reverted in #104688

@MihaZupan MihaZupan reopened this Jul 10, 2024
@MihaZupan MihaZupan removed their assignment Jul 10, 2024
@MihaZupan MihaZupan removed the in-pr There is an active PR which will close this issue when it is merged label Jul 10, 2024
@MihaZupan MihaZupan modified the milestones: 9.0.0, 10.0.0 Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime avx512 Related to the AVX-512 architecture
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants