Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri` #4745

AlexGuteniev · 2024-06-24T12:12:43Z

Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.

For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.

In benchmark results 0 is small needle, 1 is large needle.

Benchmark	main	ths
c_strstr/0	186 ns	184 ns
c_strstr/1	213 ns	213 ns
classic_searchstd::uint8_t/0	2045 ns	270 ns
classic_searchstd::uint8_t/1	2221 ns	302 ns
classic_searchstd::uint16_t/0	1588 ns	531 ns
classic_searchstd::uint16_t/1	1766 ns	586 ns
ranges_searchstd::uint8_t/0	1748 ns	268 ns
ranges_searchstd::uint8_t/1	1989 ns	306 ns
ranges_searchstd::uint16_t/0	1673 ns	585 ns
ranges_searchstd::uint16_t/1	1843 ns	600 ns
search_default_searcherstd::uint8_t/0	1494 ns	269 ns
search_default_searcherstd::uint8_t/1	1626 ns	309 ns
search_default_searcherstd::uint16_t/0	2002 ns	528 ns
search_default_searcherstd::uint16_t/1	2286 ns	599 ns

… to be named.

Who's a good search? You are! Yes you!

…pred`. `_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`. `_Equal_rev_pred` is called by ranges `search`/`find_end`. This doesn't affect `equal` etc.

This reverts commit 72a0d29.

might restore one or both later

AlexGuteniev · 2024-06-24T13:31:36Z

The previous attempt was #4654 and it ended up being just memcmp removal; see #4654 (comment)

stl/src/vector_algorithms.cpp

Resolved conflicts in xutility.

stl/inc/xutility

stl/inc/functional

benchmarks/src/search.cpp

stl/src/vector_algorithms.cpp

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/src/vector_algorithms.cpp

StephanTLavavej · 2024-09-05T23:01:30Z

Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:

Benchmark	`main`	PR	Speedup (Old/New)
`c_strstr/0`	142 ns	143 ns	0.99
`c_strstr/1`	157 ns	162 ns	0.97
`classic_search<std::uint8_t>/0`	1976 ns	160 ns	12.35
`classic_search<std::uint8_t>/1`	2153 ns	175 ns	12.30
`classic_search<std::uint16_t>/0`	1432 ns	310 ns	4.62
`classic_search<std::uint16_t>/1`	1557 ns	344 ns	4.53
`ranges_search<std::uint8_t>/0`	1561 ns	160 ns	9.76
`ranges_search<std::uint8_t>/1`	1689 ns	176 ns	9.60
`ranges_search<std::uint16_t>/0`	1594 ns	311 ns	5.13
`ranges_search<std::uint16_t>/1`	1747 ns	345 ns	5.06
`search_default_searcher<std::uint8_t>/0`	1660 ns	160 ns	10.38
`search_default_searcher<std::uint8_t>/1`	1796 ns	174 ns	10.32
`search_default_searcher<std::uint16_t>/0`	2222 ns	309 ns	7.19
`search_default_searcher<std::uint16_t>/1`	2421 ns	345 ns	7.02

Benchmark	`main`	PR	Speedup (Old/New)
`classic_search<std::uint32_t>/0`	1970 ns	1979 ns	1.00
`classic_search<std::uint32_t>/1`	2151 ns	2148 ns	1.00
`classic_search<std::uint64_t>/0`	1423 ns	1387 ns	1.03
`classic_search<std::uint64_t>/1`	1566 ns	1527 ns	1.03
`ranges_search<std::uint32_t>/0`	1591 ns	1611 ns	0.99
`ranges_search<std::uint32_t>/1`	1729 ns	1760 ns	0.98
`ranges_search<std::uint64_t>/0`	1605 ns	1543 ns	1.04
`ranges_search<std::uint64_t>/1`	1761 ns	1691 ns	1.04
`search_default_searcher<std::uint32_t>/0`	2234 ns	1609 ns	1.39
`search_default_searcher<std::uint32_t>/1`	2408 ns	1752 ns	1.37
`search_default_searcher<std::uint64_t>/0`	1620 ns	2193 ns	0.74
`search_default_searcher<std::uint64_t>/1`	1761 ns	2366 ns	0.74

Aside from c_strstr which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

AlexGuteniev · 2024-09-06T03:21:35Z

I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.

I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.

I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.

StephanTLavavej · 2024-09-06T03:25:44Z

Thanks, makes sense!

StephanTLavavej · 2024-09-09T07:01:46Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-09-09T19:29:59Z

🔍 🕵️ 🔎

AlexGuteniev and others added 28 commits May 5, 2024 13:47

vectorize search

73c96da

very tail fix

0c17a53

I 🧡 ADL

11c05ee

unify ipsum

d4fcc96

-newline

da5cf2e

strstr for competition

da157b1

missing progress

772c513

coverage

2c6c329

these tests are too long

81a6000

missing include

0b59b2e

default_searcher

f2806c5

ADL again

15e54a9

avoid memcmp in fallback

26646fe

partial review comment

0c473a4

Merge branch 'main' into search

3452fcc

Internal static assert sizeof(_Ty1) == sizeof(_Ty2).

629afd4

Use += and + instead of _RANGES next.

a24e6eb

Style: Return _Ptr_res1 instead of _Ptr_last1 when they're equal.

9d07a40

Style: In <algorithm> and <functional>, _Ptr_last1 doesn't need…

d57f9b6

… to be named.

Restore top-level constness for _UFirst2.

e51b98d

Benchmark classic search().

d4462a5

Simplify last_known_good_search().

95ba820

Who's a good search? You are! Yes you!

Revert vectorized implementation.

72a0d29

Drop memcmp paths from _Equal_rev_pred_unchecked and `_Equal_rev_…

38b32d6

…pred`. `_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`. `_Equal_rev_pred` is called by ranges `search`/`find_end`. This doesn't affect `equal` etc.

Merge remote-tracking branch 'upstream/main' into search

1e16233

Revert "Revert vectorized implementation."

f269d6c

This reverts commit 72a0d29.

drop 4 and 8 bytes search optimization for now

dc7eb5b

might restore one or both later

SSE4.2 madness

0926486

AlexGuteniev requested a review from a team as a code owner June 24, 2024 12:12

frederick-vs-ja reviewed Aug 28, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Show resolved Hide resolved

Merge branch 'main' into search

93cdcf0

Resolved conflicts in xutility.

This comment was marked as resolved.

Sign in to view

StephanTLavavej added 9 commits September 5, 2024 07:46

Avoid truncation warnings in _First1 + _Count2.

96a4d58

Style and comment nitpicks.

2a239a7

Benchmark: Use a constexpr array of string_view.

3bc1d56

Add const.

c1aaba7

Don't help the compiler - let it deduce _Ty.

6276567

Drop inconsistent _CSTD.

05e435d

input_needle is guaranteed non-empty here.

e7ec67a

Avoid permanently modifying the haystack.

9d11dcc

Bugfix: Use an unaligned load from _First2.

abae4ed

StephanTLavavej reviewed Sep 5, 2024

View reviewed changes

StephanTLavavej approved these changes Sep 5, 2024

View reviewed changes

StephanTLavavej removed their assignment Sep 5, 2024

AlexGuteniev added 2 commits September 7, 2024 15:18

_Count2 is more natural than _Last2

e96407b

-hiding

e0c843d

AlexGuteniev requested a review from StephanTLavavej September 7, 2024 16:05

StephanTLavavej approved these changes Sep 7, 2024

View reviewed changes

StephanTLavavej self-assigned this Sep 9, 2024

StephanTLavavej merged commit e931261 into microsoft:main Sep 9, 2024
39 checks passed

AlexGuteniev deleted the search branch September 9, 2024 19:30

AlexGuteniev mentioned this pull request Sep 9, 2024

Vectorize find_end #4943

Merged

StephanTLavavej mentioned this pull request Sep 11, 2024

Maintainer priorities #4700

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri` #4745

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri` #4745

AlexGuteniev commented Jun 24, 2024 •

edited

Loading

AlexGuteniev commented Jun 24, 2024

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Sep 5, 2024

AlexGuteniev commented Sep 6, 2024

StephanTLavavej commented Sep 6, 2024

StephanTLavavej commented Sep 9, 2024

StephanTLavavej commented Sep 9, 2024

Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745

Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745

Conversation

AlexGuteniev commented Jun 24, 2024 • edited Loading

AlexGuteniev commented Jun 24, 2024

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Sep 5, 2024

AlexGuteniev commented Sep 6, 2024

StephanTLavavej commented Sep 6, 2024

StephanTLavavej commented Sep 9, 2024

StephanTLavavej commented Sep 9, 2024

🔍 🕵️ 🔎

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri` #4745

Vectorize `std::search` of 1 and 2 bytes elements with `pcmpestri` #4745

AlexGuteniev commented Jun 24, 2024 •

edited

Loading