-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize std::search
of 1 and 2 bytes elements with pcmpestri
#4745
Conversation
Who's a good search? You are! Yes you!
…pred`. `_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`. `_Equal_rev_pred` is called by ranges `search`/`find_end`. This doesn't affect `equal` etc.
This reverts commit 72a0d29.
might restore one or both later
The previous attempt was #4654 and it ended up being just |
Resolved conflicts in xutility.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:
Aside from I am mildly confused as to why performance for |
I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again. |
Thanks, makes sense! |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
🔍 🕵️ 🔎 |
Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.
For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.
In benchmark results 0 is small needle, 1 is large needle.