Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mismatch vectorization #4495

Merged
merged 37 commits into from
Mar 28, 2024
Merged

mismatch vectorization #4495

merged 37 commits into from
Mar 28, 2024

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 21, 2024

Not only mismatch, also useful in future lexicographical_compare

Tail masking

A novel for vector_algorithms.cpp thing here is also using AVX2 mask for tail and not yielding to SSE2. This allows

  • Not entering SSE2 code path, avoiding its startup cost
  • Reducing maximum tail size, making the algorithms more efficient

Unfortunately, SSE2 doesn't have usable mask instructions, and AVX2 has 4-byte granularity as its finest mask, so still have at most 3 byte tails with AVX2 and up to 15 byte tails with SSE2.

I think other vector algorithms can benefit from this approach too.

If ASan hisses at this, this would be ASan bug, as AVX2 masked loads/stores are well documented and empirically proved to skip reading/writing fairly: no data races created, no exceptions on inaccessible memory, etc.

Integer indexing

Other vector algorithms use pointer increments instead of index. This allows instructions with memory operand avoid complex indexing, whereas keeping everything else in loops the same. The compiler may perform this optimization itself, but may not, so pointer increment is useful.

With two arrays algorithms, where both arrays are processed in the same direction, the situation is ambiguous. Integer indexing requires one increment, whereas pointer indexing takes two increments, so integer indexing saves an instruction. The compiler may perform an optimization to make one pointer relative to the other, and use that pointer instead of using index, to make some of accesses simple-indexed. Surprising only MSVC does this optimization, clang and gcc do not. This optimization cannot be done manually, as it is UB in C++, and is very squirrelly, so only the compiler can do this.

As the situation on index vs pointer is ambiguous here, and the difference is expected to be small, I just used the simpler approach, that is using index.

Benchmark

Before:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t>/8/3              4.05 ns         4.08 ns    172307692
bm<uint8_t>/24/22            11.8 ns         12.0 ns     64000000
bm<uint8_t>/105/-1           43.7 ns         43.9 ns     16000000
bm<uint8_t>/4021/3056         989 ns          984 ns       746667
bm<uint16_t>/8/3             4.77 ns         4.76 ns    144516129
bm<uint16_t>/24/22           8.91 ns         9.07 ns     89600000
bm<uint16_t>/105/-1          41.6 ns         41.7 ns     17230769
bm<uint16_t>/4021/3056        983 ns          977 ns       640000
bm<uint32_t>/8/3             4.59 ns         4.55 ns    154482759
bm<uint32_t>/24/22           18.7 ns         18.8 ns     37333333
bm<uint32_t>/105/-1          88.9 ns         87.9 ns      7466667
bm<uint32_t>/4021/3056       2353 ns         2354 ns       298667
bm<uint64_t>/8/3             3.94 ns         4.01 ns    179200000
bm<uint64_t>/24/22           12.0 ns         12.2 ns     64000000
bm<uint64_t>/105/-1          44.5 ns         44.5 ns     15448276
bm<uint64_t>/4021/3056       1020 ns         1025 ns       746667

After:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t>/8/3              5.57 ns         5.58 ns    112000000
bm<uint8_t>/24/22            5.56 ns         5.58 ns    112000000
bm<uint8_t>/105/-1           9.80 ns         9.77 ns     64000000
bm<uint8_t>/4021/3056         109 ns          110 ns      6400000
bm<uint16_t>/8/3             4.80 ns         4.76 ns    144516129
bm<uint16_t>/24/22           5.84 ns         5.72 ns    112000000
bm<uint16_t>/105/-1          9.10 ns         9.21 ns     74666667
bm<uint16_t>/4021/3056        126 ns          126 ns      5600000
bm<uint32_t>/8/3             4.55 ns         4.55 ns    154482759
bm<uint32_t>/24/22           5.71 ns         5.78 ns    100000000
bm<uint32_t>/105/-1          17.3 ns         17.3 ns     40727273
bm<uint32_t>/4021/3056        247 ns          246 ns      2800000
bm<uint64_t>/8/3             4.56 ns         4.55 ns    154482759
bm<uint64_t>/24/22           7.35 ns         7.32 ns     89600000
bm<uint64_t>/105/-1          20.1 ns         19.9 ns     34461538
bm<uint64_t>/4021/3056        507 ns          502 ns      1120000

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 21, 2024 09:07
@AlexGuteniev
Copy link
Contributor Author

Thanks @frederick-vs-ja for #4138, especially for the coverage

@StephanTLavavej StephanTLavavej self-assigned this Mar 21, 2024
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 21, 2024
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej
Copy link
Member

/azp run STL-ASan-CI

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StephanTLavavej

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/xutility Outdated Show resolved Hide resolved
stl/inc/algorithm Outdated Show resolved Hide resolved
Comment on lines 2142 to 2149
} else if (_Use_sse2()) {
const size_t _Count_bytes_sse = (_Count * sizeof(_Ty)) & ~size_t{0xF};

for (; _Result != _Count_bytes_sse; _Result += 0x10) {
const __m128i _Elem1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First1_ch + _Result));
const __m128i _Elem2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First2_ch + _Result));
const auto _Bingo =
static_cast<unsigned int>(_mm_movemask_epi8(_Traits::_Cmp_sse(_Elem1, _Elem2))) ^ 0xFFFF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐞 Bug: This has _Use_sse2() as the guard for _Traits::_Cmp_sse(), which potentially needs more:

static __m128i _Cmp_sse(const __m128i _Lhs, const __m128i _Rhs) noexcept {
return _mm_cmpeq_epi64(_Lhs, _Rhs); // SSE4.1
}
static bool _Sse_available() noexcept {
return _Use_sse42(); // for pcmpeqq on _Cmp_sse
}

I'll fix this occurrence, but this has been repeatedly hazardous, so I'll also file a followup issue.

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit ffd735a into microsoft:main Mar 28, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for improving the performance of this useful algorithm! 🚀 🏃 🐆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants