Light up IndexOfAnyAsciiSearcher for AVX512 #93222

stephentoub · 2023-10-09T14:00:58Z

The implementation that backs many IndexOfAny searches with SearchValues today only leverages Vector128/256:
https://github.com/dotnet/runtime/blob/a5461cbc8ed20e0981fd4e846a180f35b07dcc0a/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs
We should teach it about Vector512.

(@MihaZupan has a prototype.)

The text was updated successfully, but these errors were encountered:

ghost · 2023-10-09T14:01:06Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

The implementation that backs many IndexOfAny searches with SearchValues today only leverages Vector128/256:
https://github.com/dotnet/runtime/blob/a5461cbc8ed20e0981fd4e846a180f35b07dcc0a/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs
We should teach it about Vector512.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Runtime`
Milestone:	9.0.0

MihaZupan · 2023-10-09T14:13:55Z

For reference, the implementation could look something like this: MihaZupan/runtime@searchvalues-state...MihaZupan:runtime:searchvalues-avx512 (might need some cleanup after #91937).

With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).

Ruihan-Yin · 2024-01-18T19:50:02Z

Hi @stephentoub @MihaZupan
This is @Ruihan-Yin, from the .net team at Intel, would like to follow up on this issue.

From previous conversation,

For reference, the implementation could look something like this: MihaZupan/[email protected]:runtime:searchvalues-avx512 (might need some cleanup after #91937).

With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).

As there is already a prototype, I would like to know if there is any plan to submit a PR based on the existing works. Or, if no, will it be okay for us to take the prototype and continue the work to form a PR?

MihaZupan · 2024-01-22T22:04:12Z

Hey @Ruihan-Yin, I've pushed an updated version of my change that compiles again to MihaZupan/runtime@main...MihaZupan:runtime:searchvalues-ascii-avx512.

Benchmark results

BenchmarkDotNet v0.13.11, Ubuntu 22.04.3 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.100-preview.5.23303.2
  [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-SVIKBC : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-FCGAUB : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

IterationCount=100  LaunchCount=3  WarmupCount=15

Method	Toolchain	Length	MatchAtStart	Mean	Error	Ratio
IndexOfAny	main	33	False	4.351 ns	0.0173 ns	1.00
IndexOfAny	pr	33	False	3.695 ns	0.0002 ns	0.85

IndexOfAnyExcept	main	33	False	4.033 ns	0.0303 ns	1.00
IndexOfAnyExcept	pr	33	False	3.928 ns	0.0113 ns	0.98

IndexOfAny	main	33	True	4.313 ns	0.2266 ns	1.00
IndexOfAny	pr	33	True	6.948 ns	0.8691 ns	1.77

IndexOfAnyExcept	main	33	True	2.774 ns	0.0114 ns	1.00
IndexOfAnyExcept	pr	33	True	4.419 ns	0.0026 ns	1.59

IndexOfAny	main	64	False	4.326 ns	0.0072 ns	1.00
IndexOfAny	pr	64	False	4.385 ns	0.2552 ns	1.01

IndexOfAnyExcept	main	64	False	3.830 ns	0.0040 ns	1.00
IndexOfAnyExcept	pr	64	False	3.894 ns	0.0023 ns	1.02

IndexOfAny	main	64	True	3.611 ns	0.0019 ns	1.00
IndexOfAny	pr	64	True	4.370 ns	0.0140 ns	1.21

IndexOfAnyExcept	main	64	True	2.741 ns	0.0008 ns	1.00
IndexOfAnyExcept	pr	64	True	4.412 ns	0.0017 ns	1.61

IndexOfAny	main	128	False	7.623 ns	0.2039 ns	1.00
IndexOfAny	pr	128	False	7.769 ns	0.7353 ns	0.99

IndexOfAnyExcept	main	128	False	6.232 ns	0.0022 ns	1.00
IndexOfAnyExcept	pr	128	False	5.690 ns	0.1762 ns	0.90

IndexOfAny	main	128	True	3.729 ns	0.0382 ns	1.00
IndexOfAny	pr	128	True	6.264 ns	0.8090 ns	1.63

IndexOfAnyExcept	main	128	True	2.736 ns	0.0009 ns	1.00
IndexOfAnyExcept	pr	128	True	4.086 ns	0.1197 ns	1.48

IndexOfAny	main	1000	False	37.310 ns	0.0296 ns	1.00
IndexOfAny	pr	1000	False	23.893 ns	0.0971 ns	0.64

IndexOfAnyExcept	main	1000	False	40.196 ns	0.0707 ns	1.00
IndexOfAnyExcept	pr	1000	False	27.997 ns	0.0461 ns	0.70

IndexOfAny	main	1000	True	3.871 ns	0.0502 ns	1.00
IndexOfAny	pr	1000	True	3.980 ns	0.0115 ns	1.03

IndexOfAnyExcept	main	1000	True	2.740 ns	0.0011 ns	1.00
IndexOfAnyExcept	pr	1000	True	4.197 ns	0.1730 ns	1.47

I avoided opening a PR as the regressions for early matches are quite noticeable.
If you have any ideas on how to improve on this, please feel free to continue the work.

Ruihan-Yin · 2024-01-24T19:14:02Z

Hi @MihaZupan , thanks for the update!
Just a few more questions to understand the data better:

In the presented numbers, what is the MatchAtStart/Early Match referring to?
Do we know any possible reason for the regression, like any red flag shown in profiler?

MihaZupan · 2024-01-24T19:30:12Z

The benchmark numbers above are for this:

public class IndexOfAnyAsciiBenchmark
{
    private static readonly SearchValues<char> s_chars = SearchValues.Create("abcABC123");
    private string _charBuffer;
    private string _charBufferExcept;

    [Params(8, 16, 32, 64, 128, 1000, 10_000)]
    public int Length;

    [Params(true, false)]
    public bool MatchAtStart;

    [GlobalSetup]
    public void Setup()
    {
        var charBuffer = new char[Length];
        var charBufferExcept = new char[Length];
        charBuffer.AsSpan().Fill('\n');
        charBufferExcept.AsSpan().Fill('b');

        if (MatchAtStart)
        {
            charBuffer[0] = 'b';
            charBufferExcept[0] = '\n';
        }

        _charBuffer = new string(charBuffer);
        _charBufferExcept = new string(charBufferExcept);
    }

    [Benchmark]
    public int IndexOfAny() => _charBuffer.AsSpan().IndexOfAny(s_chars);

    [Benchmark]
    public int IndexOfAnyExcept() => _charBufferExcept.AsSpan().IndexOfAnyExcept(s_chars);
}

Do we know any possible reason for the regression, like any red flag shown in profiler?

I haven't looked at this under a profiler, but I can get the codegen for you if that'd help.
The set of operations performed in the early match case is essentially the same, just with a wider vector.

Ruihan-Yin · 2024-01-24T19:39:21Z

Thanks for sharing!

MihaZupan · 2024-01-24T20:03:42Z

Example diffs for the Length = 64, MatchAtStart = true case: https://www.diffchecker.com/fSyCBb1H/
A bit nicer codegen if I force the helpers to be inlined even in the cold block: https://www.diffchecker.com/tu3uIa4f/

Method	Toolchain	Length	MatchAtStart	Mean	Error	Ratio
IndexOfAnyExcept	main	64	True	2.741 ns	0.0008 ns	1.00
IndexOfAnyExcept	pr	64	True	4.412 ns	0.0017 ns	1.61

Part of the issue here is how the search logic is structured.

Essentially it looks like this in main

if (length < 8)
{
    // Scalar loop
    return NotFound;
}

if (length > 16)
{
    if (length > 32)
    {
        // Process the input in chunks of 32 characters (Vector256)
    }

    // Process the last [1-32] characters by loading two overlapping vectors (Vector256)
    return NotFound;
}

// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;

and like this in pr with AVX512

if (length < 8)
{
    // Scalar loop
    return NotFound;
}

if (length > 16)
{
    if (length > 32)
    {
        if (length > 64)
        {
            // Process the input in chunks of 64 characters (Vector512)
        }

        // Process the last [1-64] characters by loading two overlapping vectors (Vector512)
        return NotFound;
    }

    // Process the last [16-32] characters by loading two overlapping vectors (Vector256)
    return NotFound;
}

// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;

For the Length = 64 case, we go from finding a match in the regular search loop to finding a match in the overlapped fallback.
Loading the overlapped input & calculating the result from an overlapped match is slightly more expensive, which is enough to show up in this case where we're talking about ns differences.

MihaZupan · 2024-01-24T20:35:26Z

(sorry about the spam)

With a small tweak, the results are much closer (varies quite a bit between runs, but still).

Method	Toolchain	Length	Mean	Error	Ratio
IndexOfAny_Char	main	64	3.641 ns	0.0034 ns	1.00
IndexOfAny_Char	pr	64	4.381 ns	0.0682 ns	1.20

IndexOfAnyExcept_Char	main	64	2.861 ns	0.1486 ns	1.00
IndexOfAnyExcept_Char	pr	64	3.685 ns	0.0031 ns	1.29

IndexOfAny_Char	main	128	4.160 ns	0.3170 ns	1.00
IndexOfAny_Char	pr	128	4.053 ns	0.0011 ns	0.99

IndexOfAnyExcept_Char	main	128	2.747 ns	0.0021 ns	1.00
IndexOfAnyExcept_Char	pr	128	3.159 ns	0.0097 ns	1.15

I'll test how this impacts the Regex benchmarks I initially mentioned.

MihaZupan · 2024-04-21T15:51:59Z

Reran the numbers with main...MihaZupan:runtime:searchvalues-ascii-avx512-4
Looks like a ~0.5 - 1 ns regression for early matches, and a speedup to ~1.5x in throughput for longer inputs.
Regex is seeing more regressions than improvements.

Basic IndexOfAny benchmarks

Method	Toolchain	Length	MatchAtStart	Mean	Error	Ratio
IndexOfAny_Char	main	33	True	3.469 ns	0.0609 ns	1.00
IndexOfAny_Char	pr	33	True	4.335 ns	0.0757 ns	1.26

IndexOfAnyExcept_Char	main	33	True	2.475 ns	0.0010 ns	1.00
IndexOfAnyExcept_Char	pr	33	True	3.715 ns	0.0022 ns	1.50



IndexOfAny_Char	main	65	True	3.401 ns	0.0504 ns	1.00
IndexOfAny_Char	pr	65	True	4.096 ns	0.1228 ns	1.20

IndexOfAnyExcept_Char	main	65	True	2.475 ns	0.0016 ns	1.00
IndexOfAnyExcept_Char	pr	65	True	2.872 ns	0.0020 ns	1.16


IndexOfAny_Char	main	10000	False	328.756 ns	0.1378 ns	1.00
IndexOfAny_Char	pr	10000	False	210.471 ns	0.1152 ns	0.64

IndexOfAnyExcept_Char	main	10000	False	374.419 ns	0.3652 ns	1.00
IndexOfAnyExcept_Char	pr	10000	False	260.572 ns	0.0642 ns	0.70

LastIndexOfAny_Char	main	10000	False	327.438 ns	0.1851 ns	1.00
LastIndexOfAny_Char	pr	10000	False	210.978 ns	0.0945 ns	0.64

LastIndexOfAnyExcept_Char	main	10000	False	384.517 ns	0.1233 ns	1.00
LastIndexOfAnyExcept_Char	pr	10000	False	261.597 ns	0.0735 ns	0.68

Sherlock Regex benchmarks

https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L129

Toolchain	Pattern	Mean	Error	Ratio
main	(?i)Sher[a-z]+\|Hol[a-z]+	103,522.91 ns	1,600.349 ns	1.00
pr	(?i)Sher[a-z]+\|Hol[a-z]+	104,615.26 ns	2,585.538 ns	1.01

main	(?i)Sherlock\|Holmes\|Watson	94,673.68 ns	4,766.057 ns	1.01
pr	(?i)Sherlock\|Holmes\|Watson	95,724.15 ns	4,285.540 ns	1.02

main	(?i)Sherlock\|(...)er\|John\|Baker [49]	289,495.97 ns	4,839.174 ns	1.00
pr	(?i)Sherlock\|(...)er\|John\|Baker [49]	284,341.73 ns	6,128.478 ns	0.98

main	(?i)the	378,116.92 ns	1,339.989 ns	1.00
pr	(?i)the	379,105.12 ns	620.316 ns	1.00

main	(?m)^Sherlock(...)rlock Holmes$ [37]	82,288.25 ns	3,928.339 ns	1.01
pr	(?m)^Sherlock(...)rlock Holmes$ [37]	82,778.06 ns	3,832.797 ns	1.01

main	(?s).*	53.28 ns	0.135 ns	1.00
pr	(?s).*	52.75 ns	0.128 ns	0.99

main	[^\\n]*	735,174.34 ns	22,699.109 ns	1.00
pr	[^\\n]*	739,572.06 ns	28,992.773 ns	1.01

main	[a-q][^u-z]{13}x	33,777.02 ns	801.044 ns	1.00
pr	[a-q][^u-z]{13}x	33,651.23 ns	744.776 ns	1.00

main	[a-zA-Z]+ing	4,645,316.04 ns	16,973.212 ns	1.00
pr	[a-zA-Z]+ing	4,841,529.48 ns	8,956.256 ns	1.04

main	\b\w+n\b	9,815,880.28 ns	33,565.510 ns	1.00
pr	\b\w+n\b	11,018,320.07 ns	286,473.844 ns	1.12

main	\p{L}	10,517,664.75 ns	8,946.311 ns	1.00
pr	\p{L}	11,916,466.49 ns	409,511.232 ns	1.13

main	\p{Ll}	10,673,143.82 ns	331,458.549 ns	1.00
pr	\p{Ll}	10,787,613.37 ns	13,621.084 ns	1.01

main	\p{Lu}	554,702.13 ns	13,089.124 ns	1.00
pr	\p{Lu}	436,503.07 ns	1,401.226 ns	0.79

main	\s[a-zA-Z]{0,12}ing\s	4,969,805.46 ns	14,676.201 ns	1.00
pr	\s[a-zA-Z]{0,12}ing\s	5,237,948.60 ns	41,554.425 ns	1.05

main	\w+	5,147,282.40 ns	11,110.191 ns	1.00
pr	\w+	5,464,592.92 ns	14,554.391 ns	1.06

main	\w+\s+Holmes	3,986,310.25 ns	17,035.155 ns	1.00
pr	\w+\s+Holmes	4,256,361.76 ns	22,193.135 ns	1.07

main	\w+\s+Holmes\s+\w+	4,161,642.67 ns	10,531.738 ns	1.00
pr	\w+\s+Holmes\s+\w+	4,425,879.41 ns	6,260.327 ns	1.06

main	Holmes.{0,25}(...).{0,25}Holmes [39]	68,435.58 ns	445.729 ns	1.00
pr	Holmes.{0,25}(...).{0,25}Holmes [39]	69,696.18 ns	437.339 ns	1.02

main	Sher[a-z]+\|Hol[a-z]+	65,242.66 ns	814.983 ns	1.00
pr	Sher[a-z]+\|Hol[a-z]+	65,292.98 ns	691.412 ns	1.00

main	Sherlock\s+Holmes	82,699.64 ns	3,829.205 ns	1.01
pr	Sherlock\s+Holmes	86,351.23 ns	352.296 ns	1.05

main	Sherlock\|Holmes\|Watson	74,601.68 ns	510.511 ns	1.00
pr	Sherlock\|Holmes\|Watson	74,047.63 ns	292.165 ns	0.99

main	Sherlock\|Holm(...)er\|John\|Baker [45]	202,258.67 ns	1,713.216 ns	1.00
pr	Sherlock\|Holm(...)er\|John\|Baker [45]	212,293.86 ns	670.855 ns	1.05

main	the\s+\w+	458,843.06 ns	5,086.162 ns	1.00
pr	the\s+\w+	448,289.28 ns	1,171.229 ns	0.98

JulieLeeMSFT · 2024-06-18T17:51:28Z

@MihaZupan, we are reaching preview 7 snap soon. Do you plan to finsih this work for .NET 9?

MihaZupan · 2024-07-10T18:00:35Z

Change is being reverted in #104688

stephentoub added the area-System.Runtime label Oct 9, 2023

stephentoub added this to the 9.0.0 milestone Oct 9, 2023

stephentoub mentioned this issue Oct 9, 2023

Intel architecture improvements for .NET 9 #93196

Closed

33 tasks

stephentoub changed the title ~~Ligh-up IndexOfAnyAsciiSearcher for AVX512~~ Light-up IndexOfAnyAsciiSearcher for AVX512 Oct 9, 2023

stephentoub changed the title ~~Light-up IndexOfAnyAsciiSearcher for AVX512~~ Light up IndexOfAnyAsciiSearcher for AVX512 Oct 9, 2023

BruceForstall added the avx512 Related to the AVX-512 architecture label Mar 18, 2024

JulieLeeMSFT assigned MihaZupan Jun 18, 2024

MihaZupan mentioned this issue Jun 19, 2024

Add Avx512 support to IndexOfAnyAsciiSearcher #103710

Merged

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Jun 19, 2024

MihaZupan closed this as completed in #103710 Jun 22, 2024

MihaZupan reopened this Jul 10, 2024

MihaZupan removed their assignment Jul 10, 2024

MihaZupan removed the in-pr There is an active PR which will close this issue when it is merged label Jul 10, 2024

MihaZupan modified the milestones: 9.0.0, 10.0.0 Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Light up IndexOfAnyAsciiSearcher for AVX512 #93222

Light up IndexOfAnyAsciiSearcher for AVX512 #93222

stephentoub commented Oct 9, 2023 •

edited

Loading

ghost commented Oct 9, 2023

MihaZupan commented Oct 9, 2023

Ruihan-Yin commented Jan 18, 2024

MihaZupan commented Jan 22, 2024 •

edited

Loading

Ruihan-Yin commented Jan 24, 2024

MihaZupan commented Jan 24, 2024

Ruihan-Yin commented Jan 24, 2024

MihaZupan commented Jan 24, 2024 •

edited

Loading

MihaZupan commented Jan 24, 2024 •

edited

Loading

MihaZupan commented Apr 21, 2024

JulieLeeMSFT commented Jun 18, 2024

MihaZupan commented Jul 10, 2024

Light up IndexOfAnyAsciiSearcher for AVX512 #93222

Light up IndexOfAnyAsciiSearcher for AVX512 #93222

Comments

stephentoub commented Oct 9, 2023 • edited Loading

ghost commented Oct 9, 2023

MihaZupan commented Oct 9, 2023

Ruihan-Yin commented Jan 18, 2024

MihaZupan commented Jan 22, 2024 • edited Loading

Ruihan-Yin commented Jan 24, 2024

MihaZupan commented Jan 24, 2024

Ruihan-Yin commented Jan 24, 2024

MihaZupan commented Jan 24, 2024 • edited Loading

MihaZupan commented Jan 24, 2024 • edited Loading

MihaZupan commented Apr 21, 2024

JulieLeeMSFT commented Jun 18, 2024

MihaZupan commented Jul 10, 2024

stephentoub commented Oct 9, 2023 •

edited

Loading

MihaZupan commented Jan 22, 2024 •

edited

Loading

MihaZupan commented Jan 24, 2024 •

edited

Loading

MihaZupan commented Jan 24, 2024 •

edited

Loading