Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745

Merged
merged 49 commits into from
Sep 9, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Jun 24, 2024

Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.

For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.

In benchmark results 0 is small needle, 1 is large needle.

Benchmark main ths
c_strstr/0 186 ns 184 ns
c_strstr/1 213 ns 213 ns
classic_searchstd::uint8_t/0 2045 ns 270 ns
classic_searchstd::uint8_t/1 2221 ns 302 ns
classic_searchstd::uint16_t/0 1588 ns 531 ns
classic_searchstd::uint16_t/1 1766 ns 586 ns
ranges_searchstd::uint8_t/0 1748 ns 268 ns
ranges_searchstd::uint8_t/1 1989 ns 306 ns
ranges_searchstd::uint16_t/0 1673 ns 585 ns
ranges_searchstd::uint16_t/1 1843 ns 600 ns
search_default_searcherstd::uint8_t/0 1494 ns 269 ns
search_default_searcherstd::uint8_t/1 1626 ns 309 ns
search_default_searcherstd::uint16_t/0 2002 ns 528 ns
search_default_searcherstd::uint16_t/1 2286 ns 599 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner June 24, 2024 12:12
@AlexGuteniev
Copy link
Contributor Author

The previous attempt was #4654 and it ended up being just memcmp removal; see #4654 (comment)

Resolved conflicts in xutility.
@StephanTLavavej

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/functional Outdated Show resolved Hide resolved
benchmarks/src/search.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:

Benchmark main PR Speedup (Old/New)
c_strstr/0 142 ns 143 ns 0.99
c_strstr/1 157 ns 162 ns 0.97
classic_search<std::uint8_t>/0 1976 ns 160 ns 12.35
classic_search<std::uint8_t>/1 2153 ns 175 ns 12.30
classic_search<std::uint16_t>/0 1432 ns 310 ns 4.62
classic_search<std::uint16_t>/1 1557 ns 344 ns 4.53
ranges_search<std::uint8_t>/0 1561 ns 160 ns 9.76
ranges_search<std::uint8_t>/1 1689 ns 176 ns 9.60
ranges_search<std::uint16_t>/0 1594 ns 311 ns 5.13
ranges_search<std::uint16_t>/1 1747 ns 345 ns 5.06
search_default_searcher<std::uint8_t>/0 1660 ns 160 ns 10.38
search_default_searcher<std::uint8_t>/1 1796 ns 174 ns 10.32
search_default_searcher<std::uint16_t>/0 2222 ns 309 ns 7.19
search_default_searcher<std::uint16_t>/1 2421 ns 345 ns 7.02
Benchmark main PR Speedup (Old/New)
classic_search<std::uint32_t>/0 1970 ns 1979 ns 1.00
classic_search<std::uint32_t>/1 2151 ns 2148 ns 1.00
classic_search<std::uint64_t>/0 1423 ns 1387 ns 1.03
classic_search<std::uint64_t>/1 1566 ns 1527 ns 1.03
ranges_search<std::uint32_t>/0 1591 ns 1611 ns 0.99
ranges_search<std::uint32_t>/1 1729 ns 1760 ns 0.98
ranges_search<std::uint64_t>/0 1605 ns 1543 ns 1.04
ranges_search<std::uint64_t>/1 1761 ns 1691 ns 1.04
search_default_searcher<std::uint32_t>/0 2234 ns 1609 ns 1.39
search_default_searcher<std::uint32_t>/1 2408 ns 1752 ns 1.37
search_default_searcher<std::uint64_t>/0 1620 ns 2193 ns 0.74
search_default_searcher<std::uint64_t>/1 1761 ns 2366 ns 0.74

Aside from c_strstr which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

@StephanTLavavej StephanTLavavej removed their assignment Sep 5, 2024
@AlexGuteniev
Copy link
Contributor Author

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.

I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.

@StephanTLavavej
Copy link
Member

Thanks, makes sense!

@StephanTLavavej StephanTLavavej self-assigned this Sep 9, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit e931261 into microsoft:main Sep 9, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

🔍 🕵️ 🔎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants