-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use NEON on ARM hardware via sse2neon.h #578
Conversation
b9ff5c2
to
1032406
Compare
I'm not sure why the 32-bit Windows build is failing. The compiler flags and defaults have stayed the same for x86. Unfortunately I don't have a good way to replicate that case locally. |
7e6b7aa
to
4c3d0e7
Compare
The autoconf changes were adapted from: https://github.com/glennrp/libpng/blob/libpng16/configure.ac
I've fixed a configuration problem with 32bit win builds in tests, now the appveyor build passes. |
@albarrentine anything I can do to make it easier to review this? |
compiles on Intel Macs with default options ✅ . There are some builds out there that are currently using |
Thanks for testing @albarrentine, I've now split the flag into the old |
LGTM, appreciated! |
@reisub it's been reported in #592 that this PR caused a regression with I have confirmed the failure on a machine I have, the log simply says: * Suite libpostal_trie_tests:
.free(): invalid next size (fast) Would you mind running the installer on a stock-standard x86 ubuntu machine (using the default install instructions) to see if you can confirm that |
More detail is provided in #597 |
Hey @missinglink, I can reproduce the issue locally on an x86_64 machine running Debian 12. @brianmacy you seem to have a good understanding of the issue, any chance you could look at this? |
Personally, I would just rip out all the code that attempts to support
SSE/NEON. Then when someone has the time we can just implement real
vector/matrix classes from the C++ STL that will leverage whatever is
available on the platform.
…On Tue, Jul 4, 2023 at 3:46 PM Dino Kovač ***@***.***> wrote:
Hey @missinglink <https://github.com/missinglink>, I can reproduce the
issue locally running on an x86_64 machine running Debian 12.
However the issues is outside of my area of expertise and unfortunately I
don't have time to look into this soon.
@brianmacy <https://github.com/brianmacy> you seem to have a good
understanding of the issue, any chance you could look at this?
—
Reply to this email directly, view it on GitHub
<#578 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF6OZVCGX4FHTTYFZOQTHS3XORXJLANCNFSM5TTR2XUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The tests pass on Mac x86, and it's a bit weird that the NEON patch would affect Ubuntu x86 which should just be using the exact same code as before. Agreed that it's probably better to pull out at least the NEON portion if it's causing issues (maybe just remove the sse2neon header and just disable SSE on those architectures by default - most production usage is on Linux x86 and this patch is mainly used to make libpostal build on newer Macs). The primary usage of SSE in libpostal is simply to support a faster vectorized Not going to pull in C++ and STL. This is a C-only library with practically no dependencies, and that's done on purpose. If a little #ifdef/automake magic needs to be sprinkled throughout for various platforms, so be it. Maybe I'm unclear on this, but it doesn't seem like memory alignment and SSE vectorization should impact the trie at all. All of the uses of aligned memory allocation in libpostal are confined to an explicit "*_new_aligned" which is not used in trie.c, trie_search.c, or trie_utils.c and the vectorized instructions only apply for double_array/double_matrix which are not used by the trie at all. It only uses uint32_array for the base and check structures as well as an unsigned char array for storing the tails for storing the tails, none of which use vectorization. However, aligned_new and vectorized exp are used by the crf_context, which is the next test to be run after trie, so I'm wondering if this is some kind of output buffering issue and it's in fact failing on the crf_context rather than the trie. Someone who has this image readily set up, can you try running the tests under valgrind and confirm where the memory issue is occurring? |
@reisub I've moved us over from Travis to Github Actions so it's easy to test PRs again. Tried a few different ideas in #632 but it did look like this pull request was the main issue. As mentioned in #592 disabling SSE didn't seem to be the issue re: breaking Ubuntu, but I've put in some changes related to SSE and #597 anyway. If you'd like to resubmit a PR and ensure that the build passes (optionally can run tests on ARM for Mac/Linux with Github Actions as part of the build matrix), I'd be happy to merge something. Until then the README guidance for disabling SSE on Mac ARM architectures remains in place. After running a few tries at the Github Actions build with and without these changes, I ran a diff on the output between the minimal non-working version and the minimal working version (both with SSE disabled to eliminate that variable) and did note one strange difference which was that the version including this PR's changes for some reason wasn't compiling file_utils.c as part of the tests. Nothing's immediately jumping out from these changes but perhaps it will to a fresher pair of eyes. The relevant part of the diff was:
|
I've added https://github.com/DLTcollab/sse2neon so the code stays almost the same and the configure process will pick the best flags based on architecture.
The default config will use either SSE2 or NEON (based on processor architecture).
The hardware optimizations can be disabled by passing
--disable-sse2
and--disable-neon
to./configure
.I was able to run the test suite on one of my projects that uses libpostal extensively with no problems. For my use case there was no speedup so I couldn't verify that the NEON instructions speed anything up.
Resolves #433.
Resolves #551.