-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX2 vectorization for very large bitsets #4422
Conversation
char _Tmp[32]; | ||
_mm256_storeu_si256(reinterpret_cast<__m256i*>(_Tmp), _Elems); | ||
const char* const _Tmpd = _Tmp + (32 - _Size_bits); | ||
memcpy(_Dest, _Tmpd, _Size_bits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could take advantage of at least remaining of 32-bit unit available due to using array of units in bitset.
Here we can use AVX2 masked store _mm256_maskstore_epi32
, and and 4-byte memcpy
above.
I'm not sure if the gain worth doing this, as the improvement only applies to the tail part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: this applies to the above memcpy
only, To write more here we should over-reserve string, which though seems also feasible.
Created DevCom-10601346 based on suboptimal AVX2 codegen. |
Note `bits < 64 && str.size() != N` preparing to handle values of N that aren't evenly divisible by 64. Note `b.template to_string<wchar_t>()` disambiguation.
I'm speculatively mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Thanks for keeping those vector units busy! 📈 🎉 😹 |
No, DevCom-10601346 fix didn't make the noticeable improvement overall. |
Not sure if it worth merging due to complexity growth, and noticeable improvement only for larger bitsets
Turned out that the existing vectorization is fine for bitsets beyond 256 bits (32 bytes), and AVX2 upgrade is harmful.
For very large bitsets the improvement is noticeable.
Results