Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize find_end #4943

Merged
merged 13 commits into from
Oct 24, 2024
Merged

Vectorize find_end #4943

merged 13 commits into from
Oct 24, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Sep 9, 2024

📜 Overview

  • Similar to Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745 in the opposite direction
  • The algorithm is not very similar though. I have not attempted the one-index pcmpestri instruction and variable step, but it looked like multiple-indices pcmpestrm and fixed step would work better for this direction, because the former would return more partial false matches than in forward direction, as the match with higher index is more likely to be partial.
    • If that's not convincing enough, I can try the other approach
    • pcmpestrm approach could have been made more efficient if DevCom-10689455 is fixed, but honestly I don't hope for that much
  • Also restructured control flow a vit in __std_search_impl, the place where _Match_1st_16 is introduced. I basically implemented the same thing from scratch, and the new attempt gave clearer control flow implementation. They are not shared though, as _First1 > _Stop1 check is needed for variable step, but not needed for fixed step, so here's still some variation.
  • The types larger than 2 bytes were excluded from test and benchmark -- same way as we don't test replace with smaller elements.
  • The benchmark patterns list is expanded to test small and large needle closer to the other end. I deliberately left coverage for search + closer to start and find_end + closer to end, It is useful to test the case where startup cost contributes more.
  • Also added couple of "evil" patterns in benchmark. The goal for them is not to make them too much worse. I leave up to maintainers to decide if this goal is achieved

👿 The Evil case

Unlike many other algorithms, search algorithm run time highly depends on the data. It is possible to craft data which takes way more time than typical data of the same length.

The worst case could be haystack of a single repeating value, and the needle with the same repeating values and another value in the end. This is bad for plain search, although Boyer-Moore or similar algorithms are expected to handle that better.

⚠️ The results for find_end are much worse for that case ⚠️

⚠️ The results for search are also worse after vectorization for such case, but not that much as find_end ⚠️

It look possible to detect the evil case (by the amount of matched beginnings per some amount of data) and switch to another algorithm, or to modify the whole algorithm to be less affected. It can be done in this or subsequent PR.

I doubt how much the Evil case is important.

🏁 Benchmark results

Legend for the benchmark parameter:

/* 0. Small, closer to end */ {common_src_data, "aliquet"sv},
/* 1. Large, closer to end */ {common_src_data, "aliquet malesuada"sv},
/* 2. Small, closer to begin */ {common_src_data, "pulvinar"sv},
/* 3. Large, closer to begin */ {common_src_data, "dapibus elit interdum"sv},
/* 4. Small, evil */ {fill_pattern_view<3000, false>, fill_pattern_view<7, true>},
/* 5. Large, evli */ {fill_pattern_view<3000, false>, fill_pattern_view<20, true>},

Benchmark main this
classic_find_endstd::uint8_t/0 102 ns 22.4 ns
classic_find_endstd::uint8_t/1 107 ns 23.8 ns
classic_find_endstd::uint8_t/2 1447 ns 298 ns
classic_find_endstd::uint8_t/3 1734 ns 384 ns
classic_find_endstd::uint8_t/4 5489 ns 4998 ns
classic_find_endstd::uint8_t/5 17246 ns 15661 ns
classic_find_endstd::uint16_t/0 103 ns 39.2 ns
classic_find_endstd::uint16_t/1 103 ns 44.7 ns
classic_find_endstd::uint16_t/2 1444 ns 630 ns
classic_find_endstd::uint16_t/3 1740 ns 822 ns
classic_find_endstd::uint16_t/4 6487 ns 9816 ns
classic_find_endstd::uint16_t/5 14878 ns 20222 ns
ranges_find_endstd::uint8_t/0 101 ns 22.9 ns
ranges_find_endstd::uint8_t/1 109 ns 24.9 ns
ranges_find_endstd::uint8_t/2 1451 ns 303 ns
ranges_find_endstd::uint8_t/3 1767 ns 390 ns
ranges_find_endstd::uint8_t/4 5363 ns 5087 ns
ranges_find_endstd::uint8_t/5 16254 ns 15942 ns
ranges_find_endstd::uint16_t/0 103 ns 39.4 ns
ranges_find_endstd::uint16_t/1 104 ns 46.1 ns
ranges_find_endstd::uint16_t/2 1457 ns 633 ns
ranges_find_endstd::uint16_t/3 1728 ns 811 ns
ranges_find_endstd::uint16_t/4 5343 ns 9832 ns
ranges_find_endstd::uint16_t/5 14883 ns 20222 ns

Re-testong of the search PR against then-main as there are more cases in the benchmark:

Benchmark main this
c_strstr/0 190 ns 188 ns
c_strstr/1 219 ns 212 ns
c_strstr/2 14.2 ns 13.8 ns
c_strstr/3 10.7 ns 9.92 ns
c_strstr/4 1478 ns 1416 ns
c_strstr/5 17965 ns 16897 ns
classic_searchstd::uint8_t/0 2193 ns 275 ns
classic_searchstd::uint8_t/1 2455 ns 305 ns
classic_searchstd::uint8_t/2 144 ns 30.6 ns
classic_searchstd::uint8_t/3 66.9 ns 17.0 ns
classic_searchstd::uint8_t/4 5315 ns 1439 ns
classic_searchstd::uint8_t/5 15060 ns 11813 ns
classic_searchstd::uint16_t/0 1460 ns 519 ns
classic_searchstd::uint16_t/1 1606 ns 571 ns
classic_searchstd::uint16_t/2 130 ns 56.1 ns
classic_searchstd::uint16_t/3 56.9 ns 27.1 ns
classic_searchstd::uint16_t/4 5342 ns 7312 ns
classic_searchstd::uint16_t/5 28964 ns 20472 ns
ranges_searchstd::uint8_t/0 2102 ns 275 ns
ranges_searchstd::uint8_t/1 2384 ns 301 ns
ranges_searchstd::uint8_t/2 147 ns 30.5 ns
ranges_searchstd::uint8_t/3 76.2 ns 17.0 ns
ranges_searchstd::uint8_t/4 5325 ns 1438 ns
ranges_searchstd::uint8_t/5 15573 ns 11803 ns
ranges_searchstd::uint16_t/0 1482 ns 519 ns
ranges_searchstd::uint16_t/1 1634 ns 571 ns
ranges_searchstd::uint16_t/2 155 ns 55.4 ns
ranges_searchstd::uint16_t/3 70.0 ns 28.1 ns
ranges_searchstd::uint16_t/4 5339 ns 7338 ns
ranges_searchstd::uint16_t/5 22691 ns 20377 ns
search_default_searcherstd::uint8_t/0 1963 ns 273 ns
search_default_searcherstd::uint8_t/1 2182 ns 301 ns
search_default_searcherstd::uint8_t/2 147 ns 30.4 ns
search_default_searcherstd::uint8_t/3 60.4 ns 16.6 ns
search_default_searcherstd::uint8_t/4 5816 ns 1441 ns
search_default_searcherstd::uint8_t/5 20702 ns 11753 ns
search_default_searcherstd::uint16_t/0 2443 ns 519 ns
search_default_searcherstd::uint16_t/1 2671 ns 607 ns
search_default_searcherstd::uint16_t/2 204 ns 55.6 ns
search_default_searcherstd::uint16_t/3 92.1 ns 27.4 ns
search_default_searcherstd::uint16_t/4 5676 ns 7294 ns
search_default_searcherstd::uint16_t/5 30609 ns 20342 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner September 9, 2024 20:07
@StephanTLavavej StephanTLavavej added the performance Must go faster label Sep 9, 2024
@StephanTLavavej StephanTLavavej self-assigned this Sep 9, 2024
@AlexGuteniev
Copy link
Contributor Author

  • Also added couple of "evil" patterns in benchmark. The goal for them is not to make them too much worse. I leave up to maintainers to decide if this goal is achieved

I can look into improving the "evil" case results by detecting it and switching strategy

Resolved adjacent add/edit conflict in stl/src/vector_algorithms.cpp.
@StephanTLavavej

This comment was marked as resolved.

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
benchmarks/src/search.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Show resolved Hide resolved
benchmarks/src/search.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

Thanks! 😸 I pushed some fixes, please double-check.

Final perf results on my 5950X look good. The regressions for the highly pathological cases aren't too bad, and I don't think we need to add extra implementation complexity to avoid them.

Benchmark Before After Speedup Notes
classic_find_end<std::uint8_t>/0 76.8 ns 19.1 ns 4.02 Small, closer to end
classic_find_end<std::uint8_t>/1 73.7 ns 21.6 ns 3.41 Large, closer to end
classic_find_end<std::uint8_t>/2 1311 ns 195 ns 6.72 Small, closer to begin
classic_find_end<std::uint8_t>/3 1518 ns 282 ns 5.38 Large, closer to begin
classic_find_end<std::uint8_t>/4 5427 ns 6847 ns 0.79 Small, evil
classic_find_end<std::uint8_t>/5 16368 ns 19204 ns 0.85 Large, evil
classic_find_end<std::uint16_t>/0 80.3 ns 29.9 ns 2.69 Small, closer to end
classic_find_end<std::uint16_t>/1 76.6 ns 34.2 ns 2.24 Large, closer to end
classic_find_end<std::uint16_t>/2 1318 ns 453 ns 2.91 Small, closer to begin
classic_find_end<std::uint16_t>/3 1539 ns 579 ns 2.66 Large, closer to begin
classic_find_end<std::uint16_t>/4 8954 ns 13588 ns 0.66 Small, evil
classic_find_end<std::uint16_t>/5 27326 ns 24983 ns 1.09 Large, evil
ranges_find_end<std::uint8_t>/0 80.1 ns 19.4 ns 4.13 Small, closer to end
ranges_find_end<std::uint8_t>/1 76.0 ns 21.9 ns 3.47 Large, closer to end
ranges_find_end<std::uint8_t>/2 1322 ns 196 ns 6.74 Small, closer to begin
ranges_find_end<std::uint8_t>/3 1557 ns 282 ns 5.52 Large, closer to begin
ranges_find_end<std::uint8_t>/4 8970 ns 6845 ns 1.31 Small, evil
ranges_find_end<std::uint8_t>/5 27365 ns 19232 ns 1.42 Large, evil
ranges_find_end<std::uint16_t>/0 80.7 ns 30.4 ns 2.65 Small, closer to end
ranges_find_end<std::uint16_t>/1 76.1 ns 35.0 ns 2.17 Large, closer to end
ranges_find_end<std::uint16_t>/2 1318 ns 442 ns 2.98 Small, closer to begin
ranges_find_end<std::uint16_t>/3 1538 ns 570 ns 2.70 Large, closer to begin
ranges_find_end<std::uint16_t>/4 8953 ns 13605 ns 0.66 Small, evil
ranges_find_end<std::uint16_t>/5 27362 ns 25015 ns 1.09 Large, evil

@StephanTLavavej StephanTLavavej removed their assignment Oct 22, 2024
@AlexGuteniev
Copy link
Contributor Author

I observe that you dropped parens in (a == b) ? c : d here, but not in bitset from stream optimization. Is there a rule?

@StephanTLavavej
Copy link
Member

Nah, I just wasn't consistently complaining. 😹 When an expression in a conditional operator is especially long then I don't find extra parens to be as objectionable.

@StephanTLavavej StephanTLavavej self-assigned this Oct 23, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit d15fd49 into microsoft:main Oct 24, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Thanks for finding a way to make this faster! 🔍 🔎 😹

@StephanTLavavej StephanTLavavej changed the title Vectorize find_end Vectorize find_end Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants