-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuning for Apple chips #2814
Comments
There is no specific code for Apple's desktop-to-be processor over there. As far as internets tell - there is ISA profile present already.
There is no specific support for big.LITTLE configuration, apple or no apples. If you find apple's accelerate framework outperforming openblas, describe regression here, as usual. As hardware becomes available more people will actually use it and report what they find unfair. |
Well, that is why I said microarchitectural tuning, not architectural tuning ;) Accelerate will obviously be a good point of comparison. |
Does |
feature detection is in the
|
Thanks for the data, rough first draft is in #2816 |
Apple's M1 appears to offer Intel AMX-like capabilities accessed via an arm64 ISA extension. This extension is in use by Apple's own Accelerate framework. Prototype code for using these matrix intrinsics may be found here: https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f |
Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor. |
Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking. |
Ulp. You're right. I scanned the gist too quickly. That's not prototype code, rather a documentation effort. |
Fair enough regarding the IP concern. |
If anyone could benchmark rosetta2 roughly and tell what works best from x86 world https://developer.apple.com/documentation/apple_silicon/about_the_rosetta_translation_environment#3616843 |
@brada4 as I understand it Rosetta is more like a runtime x86 emulation environment to make x86 binaries run at all - I do not see how this would provide any insight compared to benchmarking the existing ARMV8 (and potentially thunderx2) kernels and working from there. |
That emulated x86 will be around for 3-5 years (looking at "smooth" ppc to x86 transition years ago) |
Hello everyone, I had some fun the past days benchmarking R and Python, lately also compiled with openblas. See threads here https://twitter.com/dngman/status/1342580260815200257?s=20 and https://twitter.com/fxcoudert/status/1342598509418176514?s=20 If there is anything that I can do to accelerate this effort e.g. with testing etc. please let me know. |
One trivial change to try (if you can spare the time) would be to edit |
@martin-frbg is there a quick way to run a standard benchmark in openblas? I see there's a benchmark directory but not much info on how to use that… |
No decent framework currently - just a bunch of individual files either inherited from GotoBLAS or inspired by those. Run |
I've run the benchmark suite on OpenBLAS (develop branch) compiled for:
For comparison, I've included results for veclib, where these benchmarks compiled cleanly. Many did not and some tests segfaulted. You'll also note that many of the tests appear to have underruns. I've not yet had the opportunity to dig in to understand why this happened. Results: Next up: Try the KERNEL.NEOVERSEN1 and KERNEL.THUNDERX3T110 replacement for ARMV8. |
Thanks - the underruns make me suspect that _POSIX_TIMERS (for clock_gettime presence) is not defined on Big Sur, which would make the benchmarks fall back to gettimeofday() with only millisecond resolution. For most but unfortunately not all of the benchmarks you can set the environment variable OPENBLAS_LOOPS to some "suitable" repeat value to get measurable execution times. |
This version of benchmark/bench.h would probably work for OSX: |
I get the following when making the tests with the modified
|
Strange, looks like it ignored the declaration of |
@martin-frbg you can't call |
right, @danielchalef can you move that line |
Why it doesn't define |
The tests compiled. However, the math now appears off:
|
Off by only 1e9 probably (reporting nanoseconds instead of seconds), though something else seems to affect the very first call. |
@martin-frbg Your suggestion to set |
@fxcoudert from pocoproject/poco#1453 apparently clock_gettime was added in OSX 10.12 but the presence of _POSIX_TIMERS may depend on the minimum SDK version setting at compile time (?) Anyway we'd probably want this to work with OSX < 10.12 |
dgemm results on a MacBook Pro M1. OpenBLAS compiled with Xcode / clang version 12.0.0. The test was run 10 times with the first run discarded. OpenBLAS (with VORTEX/ ARMV8 kernel) vs VeclibOpenBLAS VORTEX/ ARMV8 vs NEOVERSEN1 vs THUNDERX3T110 kernels (all on the M1):A little difficult to see given the similarity in results and scale. See charts below for some interesting matrix dimension results. to;dr Veclib significantly outperforms OpenBLAS, likely as it is using native, hardware-based matrix multiplication acceleration. The NEOVERSEN1 kernel appears to offer better results for the M1 than the default ARMV8 kernel. |
Does anybody work on optimizing gemm on M1? Actually, I have the interest on it. |
I also ran some quick benchmarks in the context of the numpy / scipy stack. Here are the results: https://gist.github.com/ogrisel/87dcf2c3ab8a304ededf75934b116b61#gistcomment-3614885 In float64, OpenBLAS VORTEX is not too bad. But for float32 Accelerate/vecLib is really impressive (~2.6x faster). I also noticed that OpenBLAS / OpenMP detects 8 cores on Apple M1 (it has 4 performance cores and 4 efficiency cores) but apparently I get better performance by setting I can open a dedicated issue for the oversubscription problem. Edit: a first version of this post / bench results used Command I used to introspect the number of cores actually used:
|
Something is not right. Numpy is fine with Apple's It is rumoured/known that Apple uses proprietary co-processor in M1 Accelerate than no other can reach. It should be much better at either GEMM. |
I am not sure I understand. Here I used this setup to get numpy to use Accelerate for BLAS calls and netlib for LAPACK calls. This is using isuruf/vecLibFort. |
R is a bit more evolved towards supporting your laptop: |
Would be interesting (in short, like 2 years term) to select optimal core type for rosetta, no AVX supported, like 5 remaining. |
Good remark, here are the GEMM results with m, n, k = 4096: OpenBLAS (4 threads)
OpenBLAS (8 threads)
Accelerate / vecLib
Comments:
|
I'd let big boys play and wait for R to ship official release where user can switch between default Accelerate and NetLib BLAS, then sneak OpenBLAS in place of NetLib one. |
OpenBLAS can now be built for Apple chips using the port at https://github.com/iains/gcc-darwin-arm64.
The build succeeds and seems to run fine, so it might be time to think about tuning for this microarchitecture.
If anybody is interested in working on this issue, I can probably facilitate hardware access.
The text was updated successfully, but these errors were encountered: