-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mismatch
vectorization
#4495
mismatch
vectorization
#4495
Conversation
Thanks @frederick-vs-ja for #4138, especially for the coverage |
GH 4146 originally started this convention.
This comment was marked as resolved.
This comment was marked as resolved.
/azp run STL-ASan-CI |
Azure Pipelines successfully started running 1 pipeline(s). |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
also lets test test instead of explaining it
This comment was marked as resolved.
This comment was marked as resolved.
stl/src/vector_algorithms.cpp
Outdated
} else if (_Use_sse2()) { | ||
const size_t _Count_bytes_sse = (_Count * sizeof(_Ty)) & ~size_t{0xF}; | ||
|
||
for (; _Result != _Count_bytes_sse; _Result += 0x10) { | ||
const __m128i _Elem1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First1_ch + _Result)); | ||
const __m128i _Elem2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First2_ch + _Result)); | ||
const auto _Bingo = | ||
static_cast<unsigned int>(_mm_movemask_epi8(_Traits::_Cmp_sse(_Elem1, _Elem2))) ^ 0xFFFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐞 Bug: This has _Use_sse2()
as the guard for _Traits::_Cmp_sse()
, which potentially needs more:
STL/stl/src/vector_algorithms.cpp
Lines 1830 to 1836 in 8e2d724
static __m128i _Cmp_sse(const __m128i _Lhs, const __m128i _Rhs) noexcept { | |
return _mm_cmpeq_epi64(_Lhs, _Rhs); // SSE4.1 | |
} | |
static bool _Sse_available() noexcept { | |
return _Use_sse42(); // for pcmpeqq on _Cmp_sse | |
} |
I'll fix this occurrence, but this has been repeatedly hazardous, so I'll also file a followup issue.
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Thanks for improving the performance of this useful algorithm! 🚀 🏃 🐆 |
Not only
mismatch
, also useful in futurelexicographical_compare
Tail masking
A novel for
vector_algorithms.cpp
thing here is also using AVX2 mask for tail and not yielding to SSE2. This allowsUnfortunately, SSE2 doesn't have usable mask instructions, and AVX2 has 4-byte granularity as its finest mask, so still have at most 3 byte tails with AVX2 and up to 15 byte tails with SSE2.
I think other vector algorithms can benefit from this approach too.
If ASan hisses at this, this would be ASan bug, as AVX2 masked loads/stores are well documented and empirically proved to skip reading/writing fairly: no data races created, no exceptions on inaccessible memory, etc.
Integer indexing
Other vector algorithms use pointer increments instead of index. This allows instructions with memory operand avoid complex indexing, whereas keeping everything else in loops the same. The compiler may perform this optimization itself, but may not, so pointer increment is useful.
With two arrays algorithms, where both arrays are processed in the same direction, the situation is ambiguous. Integer indexing requires one increment, whereas pointer indexing takes two increments, so integer indexing saves an instruction. The compiler may perform an optimization to make one pointer relative to the other, and use that pointer instead of using index, to make some of accesses simple-indexed. Surprising only MSVC does this optimization, clang and gcc do not. This optimization cannot be done manually, as it is UB in C++, and is very squirrelly, so only the compiler can do this.
As the situation on index vs pointer is ambiguous here, and the difference is expected to be small, I just used the simpler approach, that is using index.
Benchmark
Before:
After: