Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX2 vectorization for very large bitsets #4422

Merged
merged 6 commits into from
Feb 29, 2024

Conversation

AlexGuteniev
Copy link
Contributor

Not sure if it worth merging due to complexity growth, and noticeable improvement only for larger bitsets

Turned out that the existing vectorization is fine for bitsets beyond 256 bits (32 bytes), and AVX2 upgrade is harmful.
For very large bitsets the improvement is noticeable.

Results

Before:
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    211 ns          206 ns      2800000
BM_bitset_to_string<64, char>                   2662 ns         2668 ns       263529
BM_bitset_to_string<512, char>                  3730 ns         3767 ns       186667
BM_bitset_to_string_large_single<char>           230 ns          230 ns      2986667
BM_bitset_to_string<7, wchar_t>                  131 ns          131 ns      5600000
BM_bitset_to_string<64, wchar_t>                2488 ns         2511 ns       298667
BM_bitset_to_string<512, wchar_t>               4360 ns         4349 ns       154483
BM_bitset_to_string_large_single<wchar_t>        306 ns          300 ns      2240000

After:
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_bitset_to_string<15, char>                    224 ns          223 ns      2800000
BM_bitset_to_string<64, char>                   2696 ns         2668 ns       263529
BM_bitset_to_string<512, char>                  3313 ns         3278 ns       224000
BM_bitset_to_string_large_single<char>           157 ns          157 ns      4072727
BM_bitset_to_string<7, wchar_t>                  128 ns          126 ns      4977778
BM_bitset_to_string<64, wchar_t>                2421 ns         2400 ns       280000
BM_bitset_to_string<512, wchar_t>               3245 ns         3223 ns       203636
BM_bitset_to_string_large_single<wchar_t>        209 ns          209 ns      3446154

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner February 24, 2024 11:53
@StephanTLavavej StephanTLavavej added the performance Must go faster label Feb 24, 2024
@StephanTLavavej StephanTLavavej self-assigned this Feb 24, 2024
@StephanTLavavej StephanTLavavej removed their assignment Feb 25, 2024
@StephanTLavavej StephanTLavavej self-assigned this Feb 25, 2024
char _Tmp[32];
_mm256_storeu_si256(reinterpret_cast<__m256i*>(_Tmp), _Elems);
const char* const _Tmpd = _Tmp + (32 - _Size_bits);
memcpy(_Dest, _Tmpd, _Size_bits);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could take advantage of at least remaining of 32-bit unit available due to using array of units in bitset.
Here we can use AVX2 masked store _mm256_maskstore_epi32, and and 4-byte memcpy above.
I'm not sure if the gain worth doing this, as the improvement only applies to the tail part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: this applies to the above memcpy only, To write more here we should over-reserve string, which though seems also feasible.

@AlexGuteniev
Copy link
Contributor Author

Created DevCom-10601346 based on suboptimal AVX2 codegen.
This change might give better results if the issue is fixed.

Note `bits < 64 && str.size() != N` preparing to handle
values of N that aren't evenly divisible by 64.

Note `b.template to_string<wchar_t>()` disambiguation.
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
stl/src/vector_algorithms.cpp Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 8b081e2 into microsoft:main Feb 29, 2024
35 checks passed
@StephanTLavavej
Copy link
Member

Thanks for keeping those vector units busy! 📈 🎉 😹

@AlexGuteniev AlexGuteniev deleted the bitset branch February 29, 2024 22:04
@AlexGuteniev
Copy link
Contributor Author

Created DevCom-10601346 based on suboptimal AVX2 codegen.
This change might give better results if the issue is fixed.

No, DevCom-10601346 fix didn't make the noticeable improvement overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants