Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize find_first_of for 4 and 8 byte elements #4587

Merged
merged 23 commits into from
Apr 19, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Apr 13, 2024

The specialized approach shuffles the needle so that every element is on every place and compares it with the haystack part. Then ors all comparison results to find the lowest index match, then if there's a match returns it.

The generic approach finds a haystack element in the needle, it is like find with reversed haystack/needle, but instead of vpmovmskb and checking that index and computing offset, we have vptest, as we don't need the index.

Looks like we will have way better results if specialized approach is generalized,, instead of having different and simpler generic approach. But the complexity would grow.

If only the generic approach is used, the complexity would be reduced, but the performance for small needles would be worse, the generic approach for small needles and 64-bit elements even loses to the scalar implementation.

Benchmark results

Before:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/2/3           2.84 ns         2.19 ns    527058824
bm<uint32_t>/7/4           13.5 ns         11.4 ns    112000000
bm<uint32_t>/9/3           13.3 ns         8.06 ns    128000000
bm<uint32_t>/22/5          48.9 ns         39.1 ns     33185185
bm<uint32_t>/58/2          58.8 ns         41.4 ns     24888889
bm<uint32_t>/102/4          135 ns          105 ns     10000000
bm<uint32_t>/325/1         15.4 ns         12.7 ns    112000000
bm<uint32_t>/1011/11       3720 ns         3125 ns       640000
bm<uint32_t>/3056/7        6076 ns         4688 ns       280000
bm<uint64_t>/2/3           2.78 ns         2.15 ns    560000000
bm<uint64_t>/7/4           13.5 ns         10.8 ns    100000000
bm<uint64_t>/9/3           13.6 ns         10.3 ns    100000000
bm<uint64_t>/22/5          49.4 ns         31.2 ns     32000000
bm<uint64_t>/58/2          59.7 ns         45.8 ns     35840000
bm<uint64_t>/102/4          132 ns         89.1 ns     12800000
bm<uint64_t>/325/1         42.5 ns         32.2 ns     40727273
bm<uint64_t>/1011/11       3611 ns         2846 ns       560000
bm<uint64_t>/3056/7        6175 ns         2084 ns       757262

After:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bm<uint32_t>/2/3           2.99 ns         1.90 ns    560000000
bm<uint32_t>/7/4           14.1 ns         6.58 ns    182857143
bm<uint32_t>/9/3           4.34 ns         1.67 ns   1000000000
bm<uint32_t>/22/5          7.78 ns         3.92 ns    358400000
bm<uint32_t>/58/2          4.80 ns         2.49 ns    670690059
bm<uint32_t>/102/4         12.0 ns         5.01 ns    321126400
bm<uint32_t>/325/1         15.7 ns         7.08 ns    218536585
bm<uint32_t>/1011/11       1002 ns          507 ns      2560000
bm<uint32_t>/3056/7         522 ns          455 ns      2986667
bm<uint64_t>/2/3           2.84 ns         1.48 ns   1000000000
bm<uint64_t>/7/4           13.8 ns         6.44 ns    344615385
bm<uint64_t>/9/3           4.93 ns         2.86 ns    995555556
bm<uint64_t>/22/5          26.5 ns         13.7 ns    136533334
bm<uint64_t>/58/2          10.1 ns         4.94 ns    344615385
bm<uint64_t>/102/4         20.8 ns         10.8 ns    100000000
bm<uint64_t>/325/1         38.0 ns         17.4 ns     74666667
bm<uint64_t>/1011/11       1313 ns         1094 ns      1000000
bm<uint64_t>/3056/7        3186 ns         1496 ns       814545

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner April 13, 2024 20:29
@StephanTLavavej StephanTLavavej added the performance Must go faster label Apr 13, 2024
@StephanTLavavej StephanTLavavej self-assigned this Apr 13, 2024
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 14, 2024

Looks like we will have way better results if specialized approach is generalized, instead of having different and simpler generic approach. But the complexity would grow.

On the other hand, despite having more code, it would be an uniform approach.

stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
AlexGuteniev and others added 6 commits April 15, 2024 07:34
`__fallback` => `_Fallback`
`__shuffle_step` => `_Shuffle_step`
`__shuffle_impl` => `_Shuffle_impl`
`__pcmpestri_impl` => `_Impl_pcmpestri`
`__4_8_impl` => `_Impl_4_8`
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Show resolved Hide resolved
@StephanTLavavej StephanTLavavej removed their assignment Apr 15, 2024
@StephanTLavavej StephanTLavavej self-assigned this Apr 18, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@AlexGuteniev AlexGuteniev deleted the 48 branch April 18, 2024 19:33
@AlexGuteniev AlexGuteniev restored the 48 branch April 18, 2024 19:34
@AlexGuteniev AlexGuteniev reopened this Apr 18, 2024
@StephanTLavavej StephanTLavavej merged commit 1b06c52 into microsoft:main Apr 19, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for finding so many ways to improve performance! 🕵️ 💡 🚀

@AlexGuteniev AlexGuteniev deleted the 48 branch April 19, 2024 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants