`mismatch` vectorization #4495

AlexGuteniev · 2024-03-21T09:07:45Z

Not only mismatch, also useful in future lexicographical_compare

Tail masking

A novel for vector_algorithms.cpp thing here is also using AVX2 mask for tail and not yielding to SSE2. This allows

Not entering SSE2 code path, avoiding its startup cost
Reducing maximum tail size, making the algorithms more efficient

Unfortunately, SSE2 doesn't have usable mask instructions, and AVX2 has 4-byte granularity as its finest mask, so still have at most 3 byte tails with AVX2 and up to 15 byte tails with SSE2.

I think other vector algorithms can benefit from this approach too.

If ASan hisses at this, this would be ASan bug, as AVX2 masked loads/stores are well documented and empirically proved to skip reading/writing fairly: no data races created, no exceptions on inaccessible memory, etc.

Integer indexing

Other vector algorithms use pointer increments instead of index. This allows instructions with memory operand avoid complex indexing, whereas keeping everything else in loops the same. The compiler may perform this optimization itself, but may not, so pointer increment is useful.

With two arrays algorithms, where both arrays are processed in the same direction, the situation is ambiguous. Integer indexing requires one increment, whereas pointer indexing takes two increments, so integer indexing saves an instruction. The compiler may perform an optimization to make one pointer relative to the other, and use that pointer instead of using index, to make some of accesses simple-indexed. Surprising only MSVC does this optimization, clang and gcc do not. This optimization cannot be done manually, as it is UB in C++, and is very squirrelly, so only the compiler can do this.

As the situation on index vs pointer is ambiguous here, and the difference is expected to be small, I just used the simpler approach, that is using index.

Benchmark

Before:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t>/8/3              4.05 ns         4.08 ns    172307692
bm<uint8_t>/24/22            11.8 ns         12.0 ns     64000000
bm<uint8_t>/105/-1           43.7 ns         43.9 ns     16000000
bm<uint8_t>/4021/3056         989 ns          984 ns       746667
bm<uint16_t>/8/3             4.77 ns         4.76 ns    144516129
bm<uint16_t>/24/22           8.91 ns         9.07 ns     89600000
bm<uint16_t>/105/-1          41.6 ns         41.7 ns     17230769
bm<uint16_t>/4021/3056        983 ns          977 ns       640000
bm<uint32_t>/8/3             4.59 ns         4.55 ns    154482759
bm<uint32_t>/24/22           18.7 ns         18.8 ns     37333333
bm<uint32_t>/105/-1          88.9 ns         87.9 ns      7466667
bm<uint32_t>/4021/3056       2353 ns         2354 ns       298667
bm<uint64_t>/8/3             3.94 ns         4.01 ns    179200000
bm<uint64_t>/24/22           12.0 ns         12.2 ns     64000000
bm<uint64_t>/105/-1          44.5 ns         44.5 ns     15448276
bm<uint64_t>/4021/3056       1020 ns         1025 ns       746667

After:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
bm<uint8_t>/8/3              5.57 ns         5.58 ns    112000000
bm<uint8_t>/24/22            5.56 ns         5.58 ns    112000000
bm<uint8_t>/105/-1           9.80 ns         9.77 ns     64000000
bm<uint8_t>/4021/3056         109 ns          110 ns      6400000
bm<uint16_t>/8/3             4.80 ns         4.76 ns    144516129
bm<uint16_t>/24/22           5.84 ns         5.72 ns    112000000
bm<uint16_t>/105/-1          9.10 ns         9.21 ns     74666667
bm<uint16_t>/4021/3056        126 ns          126 ns      5600000
bm<uint32_t>/8/3             4.55 ns         4.55 ns    154482759
bm<uint32_t>/24/22           5.71 ns         5.78 ns    100000000
bm<uint32_t>/105/-1          17.3 ns         17.3 ns     40727273
bm<uint32_t>/4021/3056        247 ns          246 ns      2800000
bm<uint64_t>/8/3             4.56 ns         4.55 ns    154482759
bm<uint64_t>/24/22           7.35 ns         7.32 ns     89600000
bm<uint64_t>/105/-1          20.1 ns         19.9 ns     34461538
bm<uint64_t>/4021/3056        507 ns          502 ns      1120000

AlexGuteniev · 2024-03-21T10:19:36Z

Thanks @frederick-vs-ja for #4138, especially for the coverage

stl/src/vector_algorithms.cpp

GH 4146 originally started this convention.

StephanTLavavej · 2024-03-21T23:55:52Z

/azp run STL-ASan-CI

azure-pipelines · 2024-03-21T23:56:02Z

Azure Pipelines successfully started running 1 pipeline(s).

also lets test test instead of explaining it

stl/inc/xutility

stl/inc/algorithm

StephanTLavavej · 2024-03-27T18:55:49Z

stl/src/vector_algorithms.cpp

+        } else if (_Use_sse2()) {
+            const size_t _Count_bytes_sse = (_Count * sizeof(_Ty)) & ~size_t{0xF};
+
+            for (; _Result != _Count_bytes_sse; _Result += 0x10) {
+                const __m128i _Elem1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First1_ch + _Result));
+                const __m128i _Elem2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First2_ch + _Result));
+                const auto _Bingo =
+                    static_cast<unsigned int>(_mm_movemask_epi8(_Traits::_Cmp_sse(_Elem1, _Elem2))) ^ 0xFFFF;


🐞 Bug: This has _Use_sse2() as the guard for _Traits::_Cmp_sse(), which potentially needs more:

STL/stl/src/vector_algorithms.cpp

Lines 1830 to 1836 in 8e2d724

static __m128i _Cmp_sse(const __m128i _Lhs, const __m128i _Rhs) noexcept {

return _mm_cmpeq_epi64(_Lhs, _Rhs); // SSE4.1

}

static bool _Sse_available() noexcept {

return _Use_sse42(); // for pcmpeqq on _Cmp_sse

}

I'll fix this occurrence, but this has been repeatedly hazardous, so I'll also file a followup issue.

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

StephanTLavavej · 2024-03-27T23:31:54Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-03-28T16:36:05Z

Thanks for improving the performance of this useful algorithm! 🚀 🏃 🐆

mismatch vectorization

9c4e4c2

AlexGuteniev requested a review from a team as a code owner March 21, 2024 09:07

AlexGuteniev added 4 commits March 21, 2024 11:17

format

4c008fd

more format

b69c04c

wrong step

38633a0

ADL

ac2786d

AlexGuteniev added 4 commits March 21, 2024 12:46

types

2d0cff9

constant

f45dbbd

let's have sized functions as usual

b0d6ece

format

4fd7b43

AlexGuteniev mentioned this pull request Mar 21, 2024

benchmark: it is built with /Ob1, so vector algorithm dispatcher is noticeable #4496

Open

AlexGuteniev added 4 commits March 21, 2024 16:17

inline, see microsoft#4496

fef5885

transition

2bd29b0

format

5b1dc1b

better format

abc7e96

StephanTLavavej self-assigned this Mar 21, 2024

StephanTLavavej added the performance Must go faster label Mar 21, 2024

Alcaro reviewed Mar 21, 2024

View reviewed changes

stl/src/vector_algorithms.cpp Show resolved Hide resolved

StephanTLavavej added 2 commits March 21, 2024 16:05

Merge branch 'main' into mismatch

8a897d7

_Mismatch => __std_mismatch_impl for consistency.

9c92384

GH 4146 originally started this convention.

This comment was marked as resolved.

Sign in to view

ASan provocation!

57d8aee

AlexGuteniev added 3 commits March 24, 2024 07:48

really different

1e6b258

expand comment on overrun

a0b05c8

off by one

5253b71

also lets test test instead of explaining it

This comment was marked as resolved.

Sign in to view

AlexGuteniev and others added 16 commits March 24, 2024 13:30

Whole range

b9e8065

Fix preprocessor comments.

3b01b63

Style: inline before return type.

4771515

Drop unnecessary _STD move() calls.

69ccaae

Store size_t _Pos, cast to each difference type.

929fcfc

Bugfix: Guard with _Traits::_Sse_available().

bdd374e

Style: alignas before static constexpr.

1284aa5

Add an empty line, then clang-format suppression can be removed.

b0fbdca

Save 200 bytes by using a sliding window.

afa152d

Mark _Avx2_tail_mask_32 as noexcept.

d0a7ce1

Add const.

0d0e423

Drop unnecessary std::.

90e5f37

Include more headers.

82c651b

PadsSizes => PadSizes

d69c3d3

Comment cleanups.

eab5917

Move the definition of test_mismatch_containers(), no other changes.

f52e2d9

StephanTLavavej reviewed Mar 27, 2024

View reviewed changes

StephanTLavavej approved these changes Mar 27, 2024

View reviewed changes

StephanTLavavej mentioned this pull request Mar 27, 2024

vector_algorithms.cpp: Remove the distinction between SSE2 and SSE4.2 #4536

Closed

StephanTLavavej assigned StephanTLavavej and unassigned StephanTLavavej Mar 27, 2024

StephanTLavavej merged commit ffd735a into microsoft:main Mar 28, 2024
35 checks passed

AlexGuteniev deleted the mismatch branch March 28, 2024 16:45

StephanTLavavej mentioned this pull request Mar 28, 2024

vector_algorithms.cpp: Fix ARM64EC build break #4538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mismatch` vectorization #4495

`mismatch` vectorization #4495

AlexGuteniev commented Mar 21, 2024 •

edited

Loading

AlexGuteniev commented Mar 21, 2024

This comment was marked as resolved.

StephanTLavavej commented Mar 21, 2024

azure-pipelines bot commented Mar 21, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej Mar 27, 2024

StephanTLavavej commented Mar 27, 2024

StephanTLavavej commented Mar 28, 2024

	static __m128i _Cmp_sse(const __m128i _Lhs, const __m128i _Rhs) noexcept {
	return _mm_cmpeq_epi64(_Lhs, _Rhs); // SSE4.1
	}

	static bool _Sse_available() noexcept {
	return _Use_sse42(); // for pcmpeqq on _Cmp_sse
	}

mismatch vectorization #4495

mismatch vectorization #4495

Conversation

AlexGuteniev commented Mar 21, 2024 • edited Loading

Tail masking

Integer indexing

Benchmark

AlexGuteniev commented Mar 21, 2024

This comment was marked as resolved.

StephanTLavavej commented Mar 21, 2024

azure-pipelines bot commented Mar 21, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej Mar 27, 2024

Choose a reason for hiding this comment

StephanTLavavej commented Mar 27, 2024

StephanTLavavej commented Mar 28, 2024

`mismatch` vectorization #4495

`mismatch` vectorization #4495

AlexGuteniev commented Mar 21, 2024 •

edited

Loading