Refactor SIMD code and add support for SSE2 and NEON #32
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi from a Martin to another!
I came to work on Opal since it's used in STAR, and I'm trying to improve the portability of some bioinformatics tools to new platforms (namely Aarch64 which is becoming more prevalent with the new Mac M1 chip). Even though it looks you're not in the field anymore, I hope you can find some time to review and merge this PR!
The current way SIMD was handled was not so compatible with NEON because some SSE/AVX intrinsics do not have strict equivalents in NEON, so I removed the macro-based aliasing and made all SIMD implementations in the
Simd<T>
templates. Doing so also let me add an emulation layer for SSE2 in the event SSE4.1 is not available. In theory, it should even be possible to add a "fake" SIMD implementation with just one lane to be used as a fallback on some other machines that way, but getting NEON support should already cover the vast majority of modern Arm machines.Since the code was common between
Simd<T>
andSimdSw<T>
(except for the signedness of min/max operations on char vectors), I merged the 6 different template implementations into just 4, and added the remaining SIMD operations like load/store/bitwise-and to the template as well.Test
The NEON code was tested on the Raspberry Pi 4: