Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SIMD code and add support for SSE2 and NEON #32

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

althonos
Copy link

Hi from a Martin to another!

I came to work on Opal since it's used in STAR, and I'm trying to improve the portability of some bioinformatics tools to new platforms (namely Aarch64 which is becoming more prevalent with the new Mac M1 chip). Even though it looks you're not in the field anymore, I hope you can find some time to review and merge this PR!

The current way SIMD was handled was not so compatible with NEON because some SSE/AVX intrinsics do not have strict equivalents in NEON, so I removed the macro-based aliasing and made all SIMD implementations in the Simd<T> templates. Doing so also let me add an emulation layer for SSE2 in the event SSE4.1 is not available. In theory, it should even be possible to add a "fake" SIMD implementation with just one lane to be used as a fallback on some other machines that way, but getting NEON support should already cover the vast majority of modern Arm machines.

Since the code was common between Simd<T> and SimdSw<T> (except for the signedness of min/max operations on char vectors), I merged the 6 different template implementations into just 4, and added the remaining SIMD operations like load/store/bitwise-and to the template as well.

Test

The NEON code was tested on the Raspberry Pi 4:

$ uname -a
Linux chloroplast 5.10.0-17-arm64 #1 SMP Debian 5.10.136-1 (2022-08-13) aarch64 GNU/Linux

$ ./test
Starting Opal!
Using ARM NEON!
Time: 7.257904
Maximum: 573

Starting normal!
Time: 10.553968
Maximum: 573

Times faster: 1.454134

$ ./opal_aligner -x1 ../test_data/query/O74807.fasta ../test_data/db/uniprot_sprot15.fasta
Using SW alignment mode.
Reading query fasta file...
Read query sequence, 110 residues.

Reading database fasta file...
Read 15 database sequences, 4491 residues total.
Whole database read: 15 database sequences, 4491 residues in total.

Comparing query to database...
Finished!

#<i>: <score> (<query start>, <target start>) (<query end>, <target end>)
#0: 155 (?, ?) (106, 161)
#1: 178 (?, ?) (109, 229)
#2: 169 (?, ?) (107, 314)
#3: 119 (?, ?) (108, 145)
#4: 152 (?, ?) (109, 437)
#5: 68 (?, ?) (102, 58)
#6: 143 (?, ?) (106, 214)
#7: 151 (?, ?) (105, 181)
#8: 182 (?, ?) (109, 304)
#9: 94 (?, ?) (84, 74)
#10: 126 (?, ?) (87, 126)
#11: 169 (?, ?) (108, 398)
#12: 138 (?, ?) (99, 239)
#13: 181 (?, ?) (106, 927)
#14: 108 (?, ?) (97, 84)

Cpu time of searching: 0.01
GCUPS (giga cell updates per second): 0.07

@althonos
Copy link
Author

althonos commented May 30, 2023

I also fixed a segmentation fault in NW with full alignment mode, where the substitution matrix may have been negatively indexed after reaching the sequence edges, see althonos/pyopal#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant