Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NEON on ARM hardware via sse2neon.h #578

Merged
merged 2 commits into from
Jul 11, 2022

Conversation

reisub
Copy link
Contributor

@reisub reisub commented Apr 17, 2022

I've added https://github.com/DLTcollab/sse2neon so the code stays almost the same and the configure process will pick the best flags based on architecture.

The default config will use either SSE2 or NEON (based on processor architecture).
The hardware optimizations can be disabled by passing --disable-sse2 and --disable-neon to ./configure.

I was able to run the test suite on one of my projects that uses libpostal extensively with no problems. For my use case there was no speedup so I couldn't verify that the NEON instructions speed anything up.

Resolves #433.
Resolves #551.

@reisub reisub force-pushed the sse2neon branch 2 times, most recently from b9ff5c2 to 1032406 Compare April 17, 2022 17:31
@reisub
Copy link
Contributor Author

reisub commented Apr 19, 2022

I'm not sure why the 32-bit Windows build is failing. The compiler flags and defaults have stayed the same for x86.
Maybe a completely separate issue?

Unfortunately I don't have a good way to replicate that case locally.

@reisub reisub force-pushed the sse2neon branch 2 times, most recently from 7e6b7aa to 4c3d0e7 Compare June 16, 2022 10:46
@reisub reisub marked this pull request as draft June 16, 2022 10:51
@reisub
Copy link
Contributor Author

reisub commented Jun 16, 2022

I've fixed a configuration problem with 32bit win builds in tests, now the appveyor build passes.

@reisub reisub marked this pull request as ready for review June 16, 2022 13:56
@reisub
Copy link
Contributor Author

reisub commented Jun 20, 2022

@albarrentine anything I can do to make it easier to review this?

@albarrentine
Copy link
Contributor

compiles on Intel Macs with default options ✅ . There are some builds out there that are currently using --disable-sse2 (ARM builds other than Mac M1), so think I'd prefer to keep that flag name to maintain compatibility, unless there's some autoconf fu to alias it so that --disable-sse2 still works for people pulling from Github

@reisub
Copy link
Contributor Author

reisub commented Jul 5, 2022

Thanks for testing @albarrentine, I've now split the flag into the old --disable-sse2 and --disable-neon.
That should provide enough control, while keeping backwards compatibility and being consistent.
Do you see any issues with that setup?

@albarrentine albarrentine merged commit 544d510 into openvenues:master Jul 11, 2022
@albarrentine
Copy link
Contributor

LGTM, appreciated!

@missinglink
Copy link
Contributor

@reisub it's been reported in #592 that this PR caused a regression with make check on x86 linux machines.

I have confirmed the failure on a machine I have, the log simply says:

* Suite libpostal_trie_tests:
.free(): invalid next size (fast)

Would you mind running the installer on a stock-standard x86 ubuntu machine (using the default install instructions) to see if you can confirm that make check errors and, if so, what might be causing it?

@missinglink
Copy link
Contributor

More detail is provided in #597

@reisub
Copy link
Contributor Author

reisub commented Jul 4, 2023

Hey @missinglink, I can reproduce the issue locally on an x86_64 machine running Debian 12.
However the issue is outside of my area of expertise and unfortunately I don't have time to look into this soon.

@brianmacy you seem to have a good understanding of the issue, any chance you could look at this?

@brianmacy
Copy link

brianmacy commented Jul 4, 2023 via email

@albarrentine
Copy link
Contributor

The tests pass on Mac x86, and it's a bit weird that the NEON patch would affect Ubuntu x86 which should just be using the exact same code as before.

Agreed that it's probably better to pull out at least the NEON portion if it's causing issues (maybe just remove the sse2neon header and just disable SSE on those architectures by default - most production usage is on Linux x86 and this patch is mainly used to make libpostal build on newer Macs). The primary usage of SSE in libpostal is simply to support a faster vectorized exp, which is useful for Conditional Random Fields since they make heavier use of the exponential. However, in practice it's more than fast enough without SSE enabled (i.e. just using exp from math.h and whatever optimizations the compiler may add). Often for machine learning it can be useful to trade a tiny bit of numerical precision for speed, so something like the fast approximate non-vectorized versions from Paul Mineiro's approach could also do the trick and may be a way to sidestep the issue entirely, pending thorough testing: https://github.com/etheory/fastapprox/blob/master/fastapprox/src/fastexp.h

Not going to pull in C++ and STL. This is a C-only library with practically no dependencies, and that's done on purpose. If a little #ifdef/automake magic needs to be sprinkled throughout for various platforms, so be it.

Maybe I'm unclear on this, but it doesn't seem like memory alignment and SSE vectorization should impact the trie at all. All of the uses of aligned memory allocation in libpostal are confined to an explicit "*_new_aligned" which is not used in trie.c, trie_search.c, or trie_utils.c and the vectorized instructions only apply for double_array/double_matrix which are not used by the trie at all. It only uses uint32_array for the base and check structures as well as an unsigned char array for storing the tails for storing the tails, none of which use vectorization. However, aligned_new and vectorized exp are used by the crf_context, which is the next test to be run after trie, so I'm wondering if this is some kind of output buffering issue and it's in fact failing on the crf_context rather than the trie. Someone who has this image readily set up, can you try running the tests under valgrind and confirm where the memory issue is occurring?

@albarrentine
Copy link
Contributor

albarrentine commented Jul 7, 2023

@reisub I've moved us over from Travis to Github Actions so it's easy to test PRs again.

Tried a few different ideas in #632 but it did look like this pull request was the main issue. As mentioned in #592 disabling SSE didn't seem to be the issue re: breaking Ubuntu, but I've put in some changes related to SSE and #597 anyway.

If you'd like to resubmit a PR and ensure that the build passes (optionally can run tests on ARM for Mac/Linux with Github Actions as part of the build matrix), I'd be happy to merge something. Until then the README guidance for disabling SSE on Mac ARM architectures remains in place.

After running a few tries at the Github Actions build with and without these changes, I ran a diff on the output between the minimal non-working version and the minimal working version (both with SSE disabled to eliminate that variable) and did note one strange difference which was that the version including this PR's changes for some reason wasn't compiling file_utils.c as part of the tests. Nothing's immediately jumping out from these changes but perhaps it will to a fresher pair of eyes. The relevant part of the diff was:

gcc -DHAVE_CONFIG_H -I.. -I/usr/local/include -Wfloat-equal -Wpointer-arith -std=gnu99 -DLIBPOSTAL_DATA_DIR='"/home/runner/work/libpostal/libpostal/data/libpostal"' -g -O3 -Wfloat-equal -Wpointer-arith -std=gnu99 -DLIBPOSTAL_DATA_DIR='"/home/runner/work/libpostal/libpostal/data/libpostal"' -g -MT ../src/test_libpostal-file_utils.o -MD -MP -MF ../src/.deps/test_libpostal-file_utils.Tpo -c -o ../src/test_libpostal-file_utils.o test -f '../src/file_utils.c' || echo './'../src/file_utils.c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make not working on M1 Mac By default SSE2 flags wrongly set on ARM CPU
4 participants