Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vector_algorithms.cpp: find, find_last, count: make AVX2 path avoid SSE path and (for some types) fallback #4570

Merged
merged 4 commits into from
Apr 12, 2024

Conversation

AlexGuteniev
Copy link
Contributor

Performance wise, it is slightly better on specific cases that I added to the benchmark.

Before:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
bm<uint8_t, Op::FindSized>/8021/3056         39.8 ns         26.1 ns     49777778
bm<uint8_t, Op::FindSized>/63/62             5.46 ns         4.17 ns    389565217
bm<uint8_t, Op::FindSized>/31/30             4.91 ns         3.74 ns    471578947
bm<uint8_t, Op::FindSized>/15/14             4.21 ns         3.10 ns    448000000
bm<uint8_t, Op::FindSized>/7/6               2.35 ns         1.67 ns    814545455
bm<uint8_t, Op::FindUnsized>/8021/3056       79.2 ns         55.1 ns     26352941
bm<uint8_t, Op::FindUnsized>/63/62           2.11 ns         1.69 ns    896000000
bm<uint8_t, Op::FindUnsized>/31/30           1.65 ns         1.08 ns    995555556
bm<uint8_t, Op::FindUnsized>/15/14           1.41 ns        0.828 ns   1000000000
bm<uint8_t, Op::FindUnsized>/7/6             1.41 ns        0.797 ns   1000000000
bm<uint8_t, Op::Count>/8021/3056              101 ns         71.7 ns     26352941
bm<uint8_t, Op::Count>/63/62                 8.54 ns         5.71 ns    320000000
bm<uint8_t, Op::Count>/31/30                 8.99 ns         6.14 ns    224000000
bm<uint8_t, Op::Count>/15/14                 8.24 ns         5.19 ns    280000000
bm<uint8_t, Op::Count>/7/6                   3.29 ns         2.20 ns    597333333
bm<uint16_t, Op::FindSized>/8021/3056        78.6 ns         62.2 ns     21853659
bm<uint16_t, Op::FindSized>/63/62            4.47 ns         2.83 ns    597333333
bm<uint16_t, Op::FindSized>/31/30            3.77 ns         2.19 ns    471578947
bm<uint16_t, Op::FindSized>/15/14            3.11 ns         2.17 ns    640000000
bm<uint16_t, Op::FindSized>/7/6              2.56 ns         1.85 ns    896000000
bm<uint16_t, Op::Count>/8021/3056             187 ns          146 ns      8960000
bm<uint16_t, Op::Count>/63/62                5.43 ns         3.41 ns    426666667
bm<uint16_t, Op::Count>/31/30                4.69 ns         3.43 ns    446505005
bm<uint16_t, Op::Count>/15/14                4.02 ns         2.70 ns    497777778
bm<uint16_t, Op::Count>/7/6                  3.54 ns         2.49 ns    746666667
bm<uint32_t, Op::FindSized>/8021/3056         151 ns          112 ns     10000000
bm<uint32_t, Op::FindSized>/63/62            4.95 ns         3.81 ns    426666667
bm<uint32_t, Op::FindSized>/31/30            3.52 ns         2.47 ns   1000000000
bm<uint32_t, Op::FindSized>/15/14            3.09 ns         2.04 ns    597333333
bm<uint32_t, Op::FindSized>/7/6              2.47 ns         1.99 ns    689230769
bm<uint32_t, Op::Count>/8021/3056             365 ns          253 ns      8960000
bm<uint32_t, Op::Count>/63/62                5.74 ns         4.34 ns    298666667
bm<uint32_t, Op::Count>/31/30                4.77 ns         3.05 ns    358400000
bm<uint32_t, Op::Count>/15/14                3.83 ns         2.54 ns    497777778
bm<uint32_t, Op::Count>/7/6                  2.93 ns         2.01 ns    814545455
bm<uint64_t, Op::FindSized>/8021/3056         290 ns          202 ns      6892308
bm<uint64_t, Op::FindSized>/63/62            7.47 ns         5.13 ns    320000000
bm<uint64_t, Op::FindSized>/31/30            4.53 ns         2.87 ns    560000000
bm<uint64_t, Op::FindSized>/15/14            3.31 ns         2.44 ns   1000000000
bm<uint64_t, Op::FindSized>/7/6              2.83 ns         1.88 ns    896000000
bm<uint64_t, Op::Count>/8021/3056             790 ns          580 ns      2800000
bm<uint64_t, Op::Count>/63/62                8.74 ns         6.14 ns    203636364
bm<uint64_t, Op::Count>/31/30                5.68 ns         3.81 ns    389565217
bm<uint64_t, Op::Count>/15/14                4.79 ns         2.89 ns    448000000
bm<uint64_t, Op::Count>/7/6                  4.04 ns         2.52 ns    527058824

After:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
bm<uint8_t, Op::FindSized>/8021/3056         41.9 ns         39.4 ns     34461538
bm<uint8_t, Op::FindSized>/63/62             3.06 ns         2.91 ns    527058824
bm<uint8_t, Op::FindSized>/31/30             10.9 ns         10.0 ns    100000000
bm<uint8_t, Op::FindSized>/15/14             4.30 ns         4.04 ns    344615385
bm<uint8_t, Op::FindSized>/7/6               2.37 ns         2.18 ns    746666667
bm<uint8_t, Op::FindUnsized>/8021/3056       74.1 ns         68.8 ns     19063830
bm<uint8_t, Op::FindUnsized>/63/62           2.13 ns         1.99 ns    689230769
bm<uint8_t, Op::FindUnsized>/31/30           1.65 ns         1.62 ns    995555556
bm<uint8_t, Op::FindUnsized>/15/14           1.41 ns         1.40 ns    995555556
bm<uint8_t, Op::FindUnsized>/7/6             1.43 ns         1.30 ns    995555556
bm<uint8_t, Op::Count>/8021/3056             97.6 ns         72.4 ns     14451613
bm<uint8_t, Op::Count>/63/62                 3.77 ns         2.70 ns    497777778
bm<uint8_t, Op::Count>/31/30                 9.84 ns         7.01 ns    182857143
bm<uint8_t, Op::Count>/15/14                 9.27 ns         6.20 ns    199111111
bm<uint8_t, Op::Count>/7/6                   3.06 ns         2.10 ns    527058824
bm<uint16_t, Op::FindSized>/8021/3056        81.4 ns         54.3 ns     25600000
bm<uint16_t, Op::FindSized>/63/62            3.07 ns         2.17 ns    640000000
bm<uint16_t, Op::FindSized>/31/30            2.73 ns         1.82 ns    814545455
bm<uint16_t, Op::FindSized>/15/14            3.11 ns         1.93 ns    527058824
bm<uint16_t, Op::FindSized>/7/6              2.60 ns         1.71 ns    640000000
bm<uint16_t, Op::Count>/8021/3056             200 ns          138 ns     11200000
bm<uint16_t, Op::Count>/63/62                3.99 ns         3.10 ns    448000000
bm<uint16_t, Op::Count>/31/30                3.48 ns         2.54 ns    597333333
bm<uint16_t, Op::Count>/15/14                4.00 ns         2.46 ns    407272727
bm<uint16_t, Op::Count>/7/6                  3.29 ns         2.39 ns    746666667
bm<uint32_t, Op::FindSized>/8021/3056         148 ns          112 ns     11200000
bm<uint32_t, Op::FindSized>/63/62            4.00 ns         2.78 ns    640000000
bm<uint32_t, Op::FindSized>/31/30            2.81 ns         2.24 ns    689230769
bm<uint32_t, Op::FindSized>/15/14            2.68 ns         1.86 ns    597333333
bm<uint32_t, Op::FindSized>/7/6              2.44 ns         1.86 ns    746666667
bm<uint32_t, Op::Count>/8021/3056             365 ns          286 ns      6400000
bm<uint32_t, Op::Count>/63/62                4.56 ns         2.77 ns    389565217
bm<uint32_t, Op::Count>/31/30                3.60 ns         2.55 ns    471578947
bm<uint32_t, Op::Count>/15/14                2.84 ns         2.00 ns    640000000
bm<uint32_t, Op::Count>/7/6                  3.11 ns         2.32 ns    497777778
bm<uint64_t, Op::FindSized>/8021/3056         291 ns          199 ns      8145455
bm<uint64_t, Op::FindSized>/63/62            6.87 ns         5.17 ns    344615385
bm<uint64_t, Op::FindSized>/31/30            3.87 ns         2.88 ns    471578947
bm<uint64_t, Op::FindSized>/15/14            2.92 ns         1.88 ns    896000000
bm<uint64_t, Op::FindSized>/7/6              2.74 ns         1.95 ns    746666667
bm<uint64_t, Op::Count>/8021/3056             759 ns          472 ns      2715152
bm<uint64_t, Op::Count>/63/62                8.98 ns         5.49 ns    190638298
bm<uint64_t, Op::Count>/31/30                4.82 ns         3.56 ns    527058824
bm<uint64_t, Op::Count>/15/14                3.56 ns         2.62 ns    560000000
bm<uint64_t, Op::Count>/7/6                  3.08 ns         2.06 ns    689230769

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner April 10, 2024 12:27
@AlexGuteniev AlexGuteniev changed the title vector_algorithms.cpp: Make AVX2 path avoid SSE path and fallback vector_algorithms.cpp: find, find_last, count: make AVX2 path avoid SSE path and fallback Apr 10, 2024
@AlexGuteniev AlexGuteniev changed the title vector_algorithms.cpp: find, find_last, count: make AVX2 path avoid SSE path and fallback vector_algorithms.cpp: find, find_last, count: make AVX2 path avoid SSE path and (potentially) fallback Apr 10, 2024
@AlexGuteniev AlexGuteniev changed the title vector_algorithms.cpp: find, find_last, count: make AVX2 path avoid SSE path and (potentially) fallback vector_algorithms.cpp: find, find_last, count: make AVX2 path avoid SSE path and (for some types) fallback Apr 10, 2024
@StephanTLavavej StephanTLavavej added the performance Must go faster label Apr 10, 2024
@StephanTLavavej StephanTLavavej self-assigned this Apr 10, 2024
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
benchmarks/src/find_and_count.cpp Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

Thanks, this looks awesome! I pushed an essentially stylistic change to how we deal with _Traits::_Shift - I think the result is clearer, but please meow if you disagree. 😻

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 886ef10 into microsoft:main Apr 12, 2024
35 checks passed
@AlexGuteniev AlexGuteniev deleted the find_avoid_fallback branch April 12, 2024 18:24
@StephanTLavavej
Copy link
Member

Thanks for continuing to refine these vectorized implementations! 📈 ⛰️ 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants