Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<xutility>: optimize _Find_unchecked for data sizes other than 1 #2379

Closed
AlexGuteniev opened this issue Dec 5, 2021 · 4 comments · Fixed by #2434
Closed

<xutility>: optimize _Find_unchecked for data sizes other than 1 #2379

AlexGuteniev opened this issue Dec 5, 2021 · 4 comments · Fixed by #2434
Labels
fixed Something works now, yay! performance Must go faster

Comments

@AlexGuteniev
Copy link
Contributor

_Find_unchecked is the implementation of std::find.

Currently it is optimized for 8 bit size array elements by using memchr.

It is possible to expand this optimization for 16 bit size array elements by using wmemchr.

For greater size operands, like 32 or 64 bit, it is possible to implement the optimization either, but it takes some manual implementation.

Assuming 32 bit element, the comparison using SSE2 can be made with _mm_cmpeq_epi32. Then with _mm_movemask_epi8, mask can be extracted. If it is nonzero, countr_zero may be used to determine the position of the first match. This only requires SSE2, which is x64 baseline.

With AVX2, there are 256-bit variables available and 64 bit data sizes are possible.

The implementations should probably go to vector_algorithm.cpp

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Dec 5, 2021

Note there's a PR currently in review in the same area: #2380

@AlexGuteniev
Copy link
Contributor Author

Correction, we can't rely on wmemchr as it is not vectorized and thus slow.
This is reported as DevCom-1614562.
Need to wait till that is fixed, or we can implement 16-bit version on our own as well.

@AlexGuteniev
Copy link
Contributor Author

Inspected memchr under the debugger.
Apparently it implements exactly the algorithm I suggest.
But it is just SSE2, and we deserve AVX2 already.
Besides, it suffers from DevCom-1615707, which is relevant for ranges find with unreachable_sentinel, and may be relevant elsewhere.

So, looks like should implement a custom solution even for size 1.

@AlexGuteniev
Copy link
Contributor Author

Relevant test: std\tests\Dev11_0316853_find_memchr_optimization
Also ranges find test: std\tests\P0896R4_ranges_alg_find
Also parallel find test: std\tests\P0024R2_parallel_algorithms_find

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this issue Dec 18, 2021
New vector search to resolve microsoft#2379 and test that it fixes microsoft#2431 find
@StephanTLavavej StephanTLavavej added the fixed Something works now, yay! label Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants