Vectorize `find_end` #4943

AlexGuteniev · 2024-09-09T20:07:53Z

📜 Overview

Similar to Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745 in the opposite direction
The algorithm is not very similar though. I have not attempted the one-index pcmpestri instruction and variable step, but it looked like multiple-indices pcmpestrm and fixed step would work better for this direction, because the former would return more partial false matches than in forward direction, as the match with higher index is more likely to be partial.
- If that's not convincing enough, I can try the other approach
- pcmpestrm approach could have been made more efficient if DevCom-10689455 is fixed, but honestly I don't hope for that much
Also restructured control flow a vit in __std_search_impl, the place where _Match_1st_16 is introduced. I basically implemented the same thing from scratch, and the new attempt gave clearer control flow implementation. They are not shared though, as _First1 > _Stop1 check is needed for variable step, but not needed for fixed step, so here's still some variation.
The types larger than 2 bytes were excluded from test and benchmark -- same way as we don't test replace with smaller elements.
The benchmark patterns list is expanded to test small and large needle closer to the other end. I deliberately left coverage for search + closer to start and find_end + closer to end, It is useful to test the case where startup cost contributes more.
Also added couple of "evil" patterns in benchmark. The goal for them is not to make them too much worse. I leave up to maintainers to decide if this goal is achieved

👿 The Evil case

Unlike many other algorithms, search algorithm run time highly depends on the data. It is possible to craft data which takes way more time than typical data of the same length.

The worst case could be haystack of a single repeating value, and the needle with the same repeating values and another value in the end. This is bad for plain search, although Boyer-Moore or similar algorithms are expected to handle that better.

⚠️ The results for find_end are much worse for that case ⚠️

⚠️ The results for search are also worse after vectorization for such case, but not that much as find_end ⚠️

It look possible to detect the evil case (by the amount of matched beginnings per some amount of data) and switch to another algorithm, or to modify the whole algorithm to be less affected. It can be done in this or subsequent PR.

I doubt how much the Evil case is important.

🏁 Benchmark results

Legend for the benchmark parameter:

STL/benchmarks/src/search.cpp

Lines 71 to 77 in 60825c4

    
           /* 0. Small, closer to end   */ {common_src_data, "aliquet"sv}, 
        
           /* 1. Large, closer to end   */ {common_src_data, "aliquet malesuada"sv}, 
        
           /* 2. Small, closer to begin */ {common_src_data, "pulvinar"sv}, 
        
           /* 3. Large, closer to begin */ {common_src_data, "dapibus elit interdum"sv}, 
        
           /* 4. Small, evil */ {fill_pattern_view<3000, false>, fill_pattern_view<7, true>}, 
        
           /* 5. Large, evli */ {fill_pattern_view<3000, false>, fill_pattern_view<20, true>},

Benchmark	main	this
classic_find_endstd::uint8_t/0	102 ns	22.4 ns
classic_find_endstd::uint8_t/1	107 ns	23.8 ns
classic_find_endstd::uint8_t/2	1447 ns	298 ns
classic_find_endstd::uint8_t/3	1734 ns	384 ns
classic_find_endstd::uint8_t/4	5489 ns	4998 ns
classic_find_endstd::uint8_t/5	17246 ns	15661 ns
classic_find_endstd::uint16_t/0	103 ns	39.2 ns
classic_find_endstd::uint16_t/1	103 ns	44.7 ns
classic_find_endstd::uint16_t/2	1444 ns	630 ns
classic_find_endstd::uint16_t/3	1740 ns	822 ns
classic_find_endstd::uint16_t/4	6487 ns	9816 ns
classic_find_endstd::uint16_t/5	14878 ns	20222 ns
ranges_find_endstd::uint8_t/0	101 ns	22.9 ns
ranges_find_endstd::uint8_t/1	109 ns	24.9 ns
ranges_find_endstd::uint8_t/2	1451 ns	303 ns
ranges_find_endstd::uint8_t/3	1767 ns	390 ns
ranges_find_endstd::uint8_t/4	5363 ns	5087 ns
ranges_find_endstd::uint8_t/5	16254 ns	15942 ns
ranges_find_endstd::uint16_t/0	103 ns	39.4 ns
ranges_find_endstd::uint16_t/1	104 ns	46.1 ns
ranges_find_endstd::uint16_t/2	1457 ns	633 ns
ranges_find_endstd::uint16_t/3	1728 ns	811 ns
ranges_find_endstd::uint16_t/4	5343 ns	9832 ns
ranges_find_endstd::uint16_t/5	14883 ns	20222 ns

Re-testong of the search PR against then-main as there are more cases in the benchmark:

Benchmark	main	this
c_strstr/0	190 ns	188 ns
c_strstr/1	219 ns	212 ns
c_strstr/2	14.2 ns	13.8 ns
c_strstr/3	10.7 ns	9.92 ns
c_strstr/4	1478 ns	1416 ns
c_strstr/5	17965 ns	16897 ns
classic_searchstd::uint8_t/0	2193 ns	275 ns
classic_searchstd::uint8_t/1	2455 ns	305 ns
classic_searchstd::uint8_t/2	144 ns	30.6 ns
classic_searchstd::uint8_t/3	66.9 ns	17.0 ns
classic_searchstd::uint8_t/4	5315 ns	1439 ns
classic_searchstd::uint8_t/5	15060 ns	11813 ns
classic_searchstd::uint16_t/0	1460 ns	519 ns
classic_searchstd::uint16_t/1	1606 ns	571 ns
classic_searchstd::uint16_t/2	130 ns	56.1 ns
classic_searchstd::uint16_t/3	56.9 ns	27.1 ns
classic_searchstd::uint16_t/4	5342 ns	7312 ns
classic_searchstd::uint16_t/5	28964 ns	20472 ns
ranges_searchstd::uint8_t/0	2102 ns	275 ns
ranges_searchstd::uint8_t/1	2384 ns	301 ns
ranges_searchstd::uint8_t/2	147 ns	30.5 ns
ranges_searchstd::uint8_t/3	76.2 ns	17.0 ns
ranges_searchstd::uint8_t/4	5325 ns	1438 ns
ranges_searchstd::uint8_t/5	15573 ns	11803 ns
ranges_searchstd::uint16_t/0	1482 ns	519 ns
ranges_searchstd::uint16_t/1	1634 ns	571 ns
ranges_searchstd::uint16_t/2	155 ns	55.4 ns
ranges_searchstd::uint16_t/3	70.0 ns	28.1 ns
ranges_searchstd::uint16_t/4	5339 ns	7338 ns
ranges_searchstd::uint16_t/5	22691 ns	20377 ns
search_default_searcherstd::uint8_t/0	1963 ns	273 ns
search_default_searcherstd::uint8_t/1	2182 ns	301 ns
search_default_searcherstd::uint8_t/2	147 ns	30.4 ns
search_default_searcherstd::uint8_t/3	60.4 ns	16.6 ns
search_default_searcherstd::uint8_t/4	5816 ns	1441 ns
search_default_searcherstd::uint8_t/5	20702 ns	11753 ns
search_default_searcherstd::uint16_t/0	2443 ns	519 ns
search_default_searcherstd::uint16_t/1	2671 ns	607 ns
search_default_searcherstd::uint16_t/2	204 ns	55.6 ns
search_default_searcherstd::uint16_t/3	92.1 ns	27.4 ns
search_default_searcherstd::uint16_t/4	5676 ns	7294 ns
search_default_searcherstd::uint16_t/5	30609 ns	20342 ns

AlexGuteniev · 2024-09-10T06:44:39Z

Also added couple of "evil" patterns in benchmark. The goal for them is not to make them too much worse. I leave up to maintainers to decide if this goal is achieved

I can look into improving the "evil" case results by detecting it and switching strategy

Resolved adjacent add/edit conflict in stl/src/vector_algorithms.cpp.

stl/src/vector_algorithms.cpp

benchmarks/src/search.cpp

stl/src/vector_algorithms.cpp

benchmarks/src/search.cpp

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-10-22T19:50:41Z

Thanks! 😸 I pushed some fixes, please double-check.

Final perf results on my 5950X look good. The regressions for the highly pathological cases aren't too bad, and I don't think we need to add extra implementation complexity to avoid them.

Benchmark	Before	After	Speedup	Notes
`classic_find_end<std::uint8_t>/0`	76.8 ns	19.1 ns	4.02	Small, closer to end
`classic_find_end<std::uint8_t>/1`	73.7 ns	21.6 ns	3.41	Large, closer to end
`classic_find_end<std::uint8_t>/2`	1311 ns	195 ns	6.72	Small, closer to begin
`classic_find_end<std::uint8_t>/3`	1518 ns	282 ns	5.38	Large, closer to begin
`classic_find_end<std::uint8_t>/4`	5427 ns	6847 ns	0.79	Small, evil
`classic_find_end<std::uint8_t>/5`	16368 ns	19204 ns	0.85	Large, evil
`classic_find_end<std::uint16_t>/0`	80.3 ns	29.9 ns	2.69	Small, closer to end
`classic_find_end<std::uint16_t>/1`	76.6 ns	34.2 ns	2.24	Large, closer to end
`classic_find_end<std::uint16_t>/2`	1318 ns	453 ns	2.91	Small, closer to begin
`classic_find_end<std::uint16_t>/3`	1539 ns	579 ns	2.66	Large, closer to begin
`classic_find_end<std::uint16_t>/4`	8954 ns	13588 ns	0.66	Small, evil
`classic_find_end<std::uint16_t>/5`	27326 ns	24983 ns	1.09	Large, evil
`ranges_find_end<std::uint8_t>/0`	80.1 ns	19.4 ns	4.13	Small, closer to end
`ranges_find_end<std::uint8_t>/1`	76.0 ns	21.9 ns	3.47	Large, closer to end
`ranges_find_end<std::uint8_t>/2`	1322 ns	196 ns	6.74	Small, closer to begin
`ranges_find_end<std::uint8_t>/3`	1557 ns	282 ns	5.52	Large, closer to begin
`ranges_find_end<std::uint8_t>/4`	8970 ns	6845 ns	1.31	Small, evil
`ranges_find_end<std::uint8_t>/5`	27365 ns	19232 ns	1.42	Large, evil
`ranges_find_end<std::uint16_t>/0`	80.7 ns	30.4 ns	2.65	Small, closer to end
`ranges_find_end<std::uint16_t>/1`	76.1 ns	35.0 ns	2.17	Large, closer to end
`ranges_find_end<std::uint16_t>/2`	1318 ns	442 ns	2.98	Small, closer to begin
`ranges_find_end<std::uint16_t>/3`	1538 ns	570 ns	2.70	Large, closer to begin
`ranges_find_end<std::uint16_t>/4`	8953 ns	13605 ns	0.66	Small, evil
`ranges_find_end<std::uint16_t>/5`	27362 ns	25015 ns	1.09	Large, evil

AlexGuteniev · 2024-10-22T20:05:15Z

I observe that you dropped parens in (a == b) ? c : d here, but not in bitset from stream optimization. Is there a rule?

StephanTLavavej · 2024-10-22T20:14:56Z

Nah, I just wasn't consistently complaining. 😹 When an expression in a conditional operator is especially long then I don't find extra parens to be as objectionable.

StephanTLavavej · 2024-10-23T19:09:22Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-10-24T14:22:38Z

Thanks for finding a way to make this faster! 🔍 🔎 😹

Vectorize find_end

5de387c

AlexGuteniev requested a review from a team as a code owner September 9, 2024 20:07

StephanTLavavej added the performance Must go faster label Sep 9, 2024

StephanTLavavej self-assigned this Sep 9, 2024

Merge branch 'main' into find_end

60825c4

Resolved adjacent add/edit conflict in stl/src/vector_algorithms.cpp.

This comment was marked as resolved.

Sign in to view

AlexGuteniev and others added 11 commits October 17, 2024 23:09

Merge branch 'microsoft:main' into find_end

46a62fe

Merge branch 'main' into find_end

312141a

Fix comments.

b76dcd5

Use new-style CodeQL suppressions.

e664a8a

unsigned => unsigned int

47edc8b

Add _M_ARM64EC guards.

1290d77

The benchmark doesn't need <span>.

d85673c

Fix vectorized return for empty needles.

18862b9

Drop unnecessary parens.

cb3f947

Move _Size_el_2 within _Size_bytes_2 <= 16.

9193cdd

Simplify: _Size_el_2 is just _Count2 (avoiding truncation warnings).

70c55a6

StephanTLavavej reviewed Oct 22, 2024

View reviewed changes

StephanTLavavej approved these changes Oct 22, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 22, 2024

StephanTLavavej mentioned this pull request Oct 22, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Oct 23, 2024

StephanTLavavej merged commit d15fd49 into microsoft:main Oct 24, 2024
39 checks passed

StephanTLavavej changed the title ~~Vectorize find_end~~ Vectorize find_end Oct 24, 2024

AlexGuteniev deleted the find_end branch October 24, 2024 14:24

StephanTLavavej mentioned this pull request Oct 25, 2024

Revert product changes for find_end vectorization #5041

Merged

AlexGuteniev mentioned this pull request Oct 26, 2024

Vectorize find_end, make sure ASan passes #5042

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `find_end` #4943

Vectorize `find_end` #4943

AlexGuteniev commented Sep 9, 2024 •

edited

Loading

AlexGuteniev commented Sep 10, 2024

This comment was marked as resolved.

StephanTLavavej commented Oct 22, 2024

AlexGuteniev commented Oct 22, 2024

StephanTLavavej commented Oct 22, 2024

StephanTLavavej commented Oct 23, 2024

StephanTLavavej commented Oct 24, 2024

	/* 0. Small, closer to end */ {common_src_data, "aliquet"sv},
	/* 1. Large, closer to end */ {common_src_data, "aliquet malesuada"sv},
	/* 2. Small, closer to begin */ {common_src_data, "pulvinar"sv},
	/* 3. Large, closer to begin */ {common_src_data, "dapibus elit interdum"sv},

	/* 4. Small, evil */ {fill_pattern_view<3000, false>, fill_pattern_view<7, true>},
	/* 5. Large, evli */ {fill_pattern_view<3000, false>, fill_pattern_view<20, true>},

Vectorize find_end #4943

Vectorize find_end #4943

Conversation

AlexGuteniev commented Sep 9, 2024 • edited Loading

📜 Overview

👿 The Evil case

🏁 Benchmark results

AlexGuteniev commented Sep 10, 2024

This comment was marked as resolved.

StephanTLavavej commented Oct 22, 2024

AlexGuteniev commented Oct 22, 2024

StephanTLavavej commented Oct 22, 2024

StephanTLavavej commented Oct 23, 2024

StephanTLavavej commented Oct 24, 2024

Vectorize `find_end` #4943

Vectorize `find_end` #4943

AlexGuteniev commented Sep 9, 2024 •

edited

Loading