Tuning for Apple chips #2814

Keno · 2020-09-01T22:53:59Z

OpenBLAS can now be built for Apple chips using the port at https://github.com/iains/gcc-darwin-arm64.
The build succeeds and seems to run fine, so it might be time to think about tuning for this microarchitecture.
If anybody is interested in working on this issue, I can probably facilitate hardware access.

brada4 · 2020-09-02T09:27:49Z

There is no specific code for Apple's desktop-to-be processor over there. As far as internets tell - there is ISA profile present already.

Instruction set A64 – ARMv8.4-A

There is no specific support for big.LITTLE configuration, apple or no apples.

If you find apple's accelerate framework outperforming openblas, describe regression here, as usual. As hardware becomes available more people will actually use it and report what they find unfair.

Keno · 2020-09-02T12:46:21Z

Well, that is why I said microarchitectural tuning, not architectural tuning ;) Accelerate will obviously be a good point of comparison.

martin-frbg · 2020-09-02T13:02:19Z

Does sysctl -n machdep.cpu.brand_string (as search engines tell me) return anything usable for identification ? (For #2804,I just made the code return ARMV8 by default). If it does, we could start with giving it its unique TARGET (which would allow assigning appropriate compiler options). Next trivial step could be to repurpose the ThunderX2T99 kernels (as they should be the most advanced we have right now) and see how they fare compared to generic ARMV8 or CortexA57.

Keno · 2020-09-02T17:38:58Z

sysctl machdep   
machdep.user_idle_level: 128
machdep.wake_abstime: 1258169647836
machdep.time_since_reset: 538131617
machdep.wake_conttime: 46263198095284
machdep.deferred_ipi_timeout: 64000
machdep.cpu.cores_per_package: 8
machdep.cpu.core_count: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8
machdep.cpu.brand_string: Apple processor
machdep.lck_mtx_adaptive_spin_mode: 1
machdep.virtual_address_size: 47

feature detection is in the hw sysctl though:

hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 17179869184
hw.activecpu: 8
hw.physicalcpu: 8
hw.physicalcpu_max: 8
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: 131287967
hw.cacheconfig: 8 1 1 0 0 0 0 0 0 0
hw.cachesize: 4008591360 131072 8388608 0 0 0 0 0 0 0
hw.pagesize: 16384
hw.pagesize32: 16384
hw.cachelinesize: 64
hw.l1icachesize: 131072
hw.l1dcachesize: 131072
hw.l2cachesize: 8388608
hw.tbfrequency: 24000000
hw.packages: 1
hw.osenvironment: 
hw.ephemeral_storage: 0
hw.use_recovery_securityd: 0
hw.use_kernelmanagerd: 1
hw.serialdebugmode: 0
hw.optional.floatingpoint: 1
hw.optional.watchpoint: 4
hw.optional.breakpoint: 6
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_2_fhm: 0
hw.optional.amx_version: 0
hw.optional.ucnormal_mem: 0
hw.optional.arm64: 1
hw.targettype: J273a

martin-frbg · 2020-09-02T21:14:42Z

Thanks for the data, rough first draft is in #2816

danielchalef · 2020-12-28T20:40:38Z

Apple's M1 appears to offer Intel AMX-like capabilities accessed via an arm64 ISA extension. This extension is in use by Apple's own Accelerate framework. Prototype code for using these matrix intrinsics may be found here: https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f

brada4 · 2020-12-28T21:06:24Z

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

martin-frbg · 2020-12-28T21:38:28Z

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

danielchalef · 2020-12-28T21:49:52Z

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

Ulp. You're right. I scanned the gist too quickly. That's not prototype code, rather a documentation effort.

danielchalef · 2020-12-28T21:51:31Z

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

Fair enough regarding the IP concern.

brada4 · 2020-12-29T14:34:08Z

If anyone could benchmark rosetta2 roughly and tell what works best from x86 world https://developer.apple.com/documentation/apple_silicon/about_the_rosetta_translation_environment#3616843

martin-frbg · 2020-12-29T15:42:39Z

@brada4 as I understand it Rosetta is more like a runtime x86 emulation environment to make x86 binaries run at all - I do not see how this would provide any insight compared to benchmarking the existing ARMV8 (and potentially thunderx2) kernels and working from there.

brada4 · 2020-12-29T22:41:24Z

That emulated x86 will be around for 3-5 years (looking at "smooth" ppc to x86 transition years ago)

dengemann · 2020-12-30T14:01:06Z

Hello everyone, I had some fun the past days benchmarking R and Python, lately also compiled with openblas. See threads here https://twitter.com/dngman/status/1342580260815200257?s=20 and https://twitter.com/fxcoudert/status/1342598509418176514?s=20

If there is anything that I can do to accelerate this effort e.g. with testing etc. please let me know.

martin-frbg · 2020-12-30T14:30:15Z

One trivial change to try (if you can spare the time) would be to edit kernel/arm64/KERNEL.VORTEX so that it includes either KERNEL.NEOVERSEN1 or KERNEL.THUNDERX3T110 instead of the more generic KERNEL.ARMV8 .
This is just a stab in the dark though - no guarantees that this will actually make OpenBLAS faster, just that it would then use more recent BLAS kernels for server-class cpus rather than a smallest common denominator capable of running on old phones.

fxcoudert · 2020-12-30T15:03:21Z

@martin-frbg is there a quick way to run a standard benchmark in openblas? I see there's a benchmark directory but not much info on how to use that…

martin-frbg · 2020-12-30T15:32:56Z

No decent framework currently - just a bunch of individual files either inherited from GotoBLAS or inspired by those. Run make in the benchmark directory, and then execute one of the generated *.goto files with optional arguments of initial dimension, final dimension, step size, e.g. dlinpack.goto 1000 10000 50 to get a simple printout of problem size vs. MFlops to feed into e.g. gnuplot.
The scripts subdirectory contains similarly trivial scripts for python,octave and r

danielchalef · 2020-12-30T15:35:47Z

I've run the benchmark suite on OpenBLAS (develop branch) compiled for:

arm64 / VORTEX with LLVM shipped with Big Sur
x86_64 using homebrew's gcc-10 toolchain and run using Rosetta. These tests were run twice in order to cache the translation.

For comparison, I've included results for veclib, where these benchmarks compiled cleanly. Many did not and some tests segfaulted.

You'll also note that many of the tests appear to have underruns. I've not yet had the opportunity to dig in to understand why this happened.

Results:
https://github.com/danielchalef/openblas-benchmark-m1

Next up: Try the KERNEL.NEOVERSEN1 and KERNEL.THUNDERX3T110 replacement for ARMV8.

martin-frbg · 2020-12-30T16:31:04Z

Thanks - the underruns make me suspect that _POSIX_TIMERS (for clock_gettime presence) is not defined on Big Sur, which would make the benchmarks fall back to gettimeofday() with only millisecond resolution. For most but unfortunately not all of the benchmarks you can set the environment variable OPENBLAS_LOOPS to some "suitable" repeat value to get measurable execution times.

martin-frbg · 2020-12-30T16:46:52Z

This version of benchmark/bench.h would probably work for OSX:
bench.h.txt

danielchalef · 2020-12-30T17:06:36Z

This version of benchmark/bench.h would probably work for OSX:
bench.h.txt

I get the following when making the tests with the modified bench.h:

./bench.h:82:21: error: expected parameter declarator
 mach_timebase_info(&info);
                    ^
./bench.h:82:21: error: expected ')'
./bench.h:82:20: note: to match this '('
 mach_timebase_info(&info);
                   ^
./bench.h:82:2: warning: type specifier missing, defaults to 'int' [-Wimplicit-int]
 mach_timebase_info(&info);
 ^
1 warning and 2 errors generated.
make: *** [sgemm.o] Error 1```

martin-frbg · 2020-12-30T17:24:26Z

Strange, looks like it ignored the declaration of info as a mach_timebase_info_data_t on the preceding line - but this is only cobbled together from various sources on the internet, not even compile-tested as I do not have any Apple hardware here.

fxcoudert · 2020-12-30T17:28:27Z

@martin-frbg you can't call mach_timebase_info() outside of an actual function

martin-frbg · 2020-12-30T17:32:33Z

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the
#elif defined(__APPLE__) there, please ?

fxcoudert · 2020-12-30T17:35:46Z

CLOCK_REALTIME should be available, though, with microsecond resolution:

$ cat a.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

int main (void){
  struct timespec tp;
  int result;

  result = clock_getres(CLOCK_REALTIME, &tp);
  printf("result: %d\n", result);
  printf("tp.tv_sec: %lld\n", (long long) tp.tv_sec);
  printf("tp.tv_nsec: %lld\n", (long long) tp.tv_nsec);
}
$ clang a.c && ./a.out
result: 0
tp.tv_sec: 0
tp.tv_nsec: 1000

Why it doesn't define _POSIX_TIMERS is beyond me…

danielchalef · 2020-12-30T17:49:11Z

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the
#elif defined(__APPLE__) there, please ?

The tests compiled. However, the math now appears off:
dgemm.goto

          SIZE                   Flops             Time
 M=   1, N=   1, K=   1 :        0.00 MFlops 15125.000000 sec
 M=   2, N=   2, K=   2 :        0.00 MFlops 458.333333 sec
 M=   3, N=   3, K=   3 :        0.00 MFlops 458.333333 sec
 M=   4, N=   4, K=   4 :        0.00 MFlops 458.333333 sec
 M=   5, N=   5, K=   5 :        0.00 MFlops 583.333333 sec

martin-frbg · 2020-12-30T18:00:19Z

Off by only 1e9 probably (reporting nanoseconds instead of seconds), though something else seems to affect the very first call.

danielchalef · 2020-12-30T18:08:31Z

@martin-frbg Your suggestion to set OPENBLAS_LOOPS to a larger number works. I'll upload dgemm results later today.

martin-frbg · 2020-12-30T20:40:21Z

@fxcoudert from pocoproject/poco#1453 apparently clock_gettime was added in OSX 10.12 but the presence of _POSIX_TIMERS may depend on the minimum SDK version setting at compile time (?) Anyway we'd probably want this to work with OSX < 10.12

danielchalef · 2020-12-30T23:16:55Z

dgemm results on a MacBook Pro M1. OpenBLAS compiled with Xcode / clang version 12.0.0. The test was run 10 times with the first run discarded. OPENBLAS_LOOPS was set to 20 in order to avoid the underflow discussed above.

OpenBLAS (with VORTEX/ ARMV8 kernel) vs Veclib

OpenBLAS VORTEX/ ARMV8 vs NEOVERSEN1 vs THUNDERX3T110 kernels (all on the M1):

A little difficult to see given the similarity in results and scale. See charts below for some interesting matrix dimension results.

to;dr Veclib significantly outperforms OpenBLAS, likely as it is using native, hardware-based matrix multiplication acceleration. The NEOVERSEN1 kernel appears to offer better results for the M1 than the default ARMV8 kernel.

xianyi · 2020-12-31T02:19:19Z

Does anybody work on optimizing gemm on M1? Actually, I have the interest on it.

ogrisel · 2021-02-01T00:11:32Z

I also ran some quick benchmarks in the context of the numpy / scipy stack. Here are the results:

https://gist.github.com/ogrisel/87dcf2c3ab8a304ededf75934b116b61#gistcomment-3614885

In float64, OpenBLAS VORTEX is not too bad. But for float32 Accelerate/vecLib is really impressive (~2.6x faster).

I also noticed that OpenBLAS / OpenMP detects 8 cores on Apple M1 (it has 4 performance cores and 4 efficiency cores) but apparently I get better performance by setting OMP_NUM_THREADS=4. So I suspect a bit of oversubscription when using the "efficiency cores". Maybe they do not support the vortex instructions?

I can open a dedicated issue for the oversubscription problem.

Edit: a first version of this post / bench results used OPENBLAS_NUM_THREADS=4 instead of OMP_NUM_THREADS=4. Only the latter works with the openblas build from conda-forge. I fixed the numbers accordingly.

Command I used to introspect the number of cores actually used:

% python -m threadpoolctl -i numpy
[
  {
    "filepath": "/Users/ogrisel/miniforge3/envs/openblas/lib/libopenblasp-r0.3.12.dylib",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.12",
    "num_threads": 8,
    "threading_layer": "openmp"
  },
  {
    "filepath": "/Users/ogrisel/miniforge3/envs/openblas/lib/libomp.dylib",
    "prefix": "libomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 8
  }
]

brada4 · 2021-02-01T14:04:17Z

Something is not right. Numpy is fine with Apple's cblas.h + Accelerate alone , and same netlib SVD is used in both cases (-O2 desirable).

It is rumoured/known that Apple uses proprietary co-processor in M1 Accelerate than no other can reach. It should be much better at either GEMM.

ogrisel · 2021-02-01T14:28:40Z

Numpy is fine with Apple's cblas.h + Accelerate alone

I am not sure I understand. Here I used this setup to get numpy to use Accelerate for BLAS calls and netlib for LAPACK calls. This is using isuruf/vecLibFort.

brada4 · 2021-02-01T15:20:57Z

R is a bit more evolved towards supporting your laptop:
https://developer.r-project.org/Blog/public/2020/11/02/will-r-work-on-apple-silicon/index.html
Please try with bigger data sets, 4/8MB is crumbles for modern CoD CPUs or multi-socket ones. Probably fits in M1 cache so that BLAS L1 is faster than any RAM chip made so far.

brada4 · 2021-02-01T19:30:49Z

Would be interesting (in short, like 2 years term) to select optimal core type for rosetta, no AVX supported, like 5 remaining.

ogrisel · 2021-02-02T09:42:09Z

Please try with bigger data sets, 4/8MB is crumbles for modern CoD CPUs or multi-socket ones. Probably fits in M1 cache so that BLAS L1 is faster than any RAM chip made so far.

Good remark, here are the GEMM results with m, n, k = 4096:

OpenBLAS (4 threads)

[float32] np.dot: 392.533 ms, 350.2 GFLOP/s
[float64] np.dot: 812.844 ms, 169.1 GFLOP/s

OpenBLAS (8 threads)

[float32] np.dot: 333.046 ms, 412.8 GFLOP/s
[float64] np.dot: 728.308 ms, 188.8 GFLOP/s

Accelerate / vecLib

[float32] np.dot: 191.281 ms, 718.7 GFLOP/s
[float64] np.dot: 712.759 ms, 192.9 GFLOP/s

Comments:

So now, the 8 threads are better used by OpenBLAS: no oversubscription anymore... interesting.
vecLib is still significantly faster with float32 but only by a factor of 1.7 now.

brada4 · 2021-02-02T16:26:38Z

I'd let big boys play and wait for R to ship official release where user can switch between default Accelerate and NetLib BLAS, then sneak OpenBLAS in place of NetLib one.
It is hard to impossible to believe that much touted extra piece of silicon meant for matrix acceleration turns out such a flop.

martin-frbg mentioned this issue Dec 12, 2020

Make error if gfortran installed on Apple M1 machine #3032

Closed

rcurtin mentioned this issue Jan 1, 2021

Compatibility with the new Apple M1 architecture mlpack/mlpack#2791

Closed

martin-frbg mentioned this issue Jan 18, 2021

How can i build source on Apple M1? #3071

Closed

ogrisel mentioned this issue Feb 1, 2021

BLD: fail to build on Apple M1 numpy/numpy#17807

Closed

martin-frbg mentioned this issue Mar 2, 2021

Support timing Apple M1 in the benchmarks #3126

Merged

jaakkor2 mentioned this issue Apr 8, 2021

openblas 0.3.13 uses clock_gettime that is only available macOS 10.12 onwards JuliaLang/julia#40375

Closed

martin-frbg mentioned this issue Oct 6, 2021

Improve performance on Apple M1 Vortex #3399

Merged

hokru mentioned this issue Oct 22, 2021

Apple Silicon / ARM64 notes psi4/psi4#2333

Open

jakirkham mentioned this issue Jan 26, 2022

Option to install numpy built with Apple's Accelerate BLAS implementation conda-forge/numpy-feedstock#253

Closed

ThomasJanssoone mentioned this issue Mar 2, 2022

Error while building on Mac OS with M1 chipset TadasBaltrusaitis/OpenFace#1003

Open

zinphi mentioned this issue Oct 17, 2022

Tuning for Apple M1 AMX2 coprocessor #3789

Open

Tuning for Apple chips #2814

Tuning for Apple chips #2814

Comments

Keno commented Sep 1, 2020

brada4 commented Sep 2, 2020

Keno commented Sep 2, 2020

martin-frbg commented Sep 2, 2020

Keno commented Sep 2, 2020

martin-frbg commented Sep 2, 2020

danielchalef commented Dec 28, 2020

brada4 commented Dec 28, 2020

martin-frbg commented Dec 28, 2020

danielchalef commented Dec 28, 2020

danielchalef commented Dec 28, 2020

brada4 commented Dec 29, 2020

martin-frbg commented Dec 29, 2020

brada4 commented Dec 29, 2020

dengemann commented Dec 30, 2020

martin-frbg commented Dec 30, 2020

fxcoudert commented Dec 30, 2020

martin-frbg commented Dec 30, 2020

danielchalef commented Dec 30, 2020

martin-frbg commented Dec 30, 2020 • edited Loading

martin-frbg commented Dec 30, 2020

danielchalef commented Dec 30, 2020

martin-frbg commented Dec 30, 2020

fxcoudert commented Dec 30, 2020

martin-frbg commented Dec 30, 2020

fxcoudert commented Dec 30, 2020 • edited Loading

danielchalef commented Dec 30, 2020 • edited Loading

martin-frbg commented Dec 30, 2020

danielchalef commented Dec 30, 2020 • edited Loading

martin-frbg commented Dec 30, 2020

danielchalef commented Dec 30, 2020 • edited Loading

OpenBLAS (with VORTEX/ ARMV8 kernel) vs Veclib

OpenBLAS VORTEX/ ARMV8 vs NEOVERSEN1 vs THUNDERX3T110 kernels (all on the M1):

xianyi commented Dec 31, 2020

ogrisel commented Feb 1, 2021 • edited Loading

brada4 commented Feb 1, 2021

ogrisel commented Feb 1, 2021 • edited Loading

brada4 commented Feb 1, 2021

brada4 commented Feb 1, 2021

ogrisel commented Feb 2, 2021 • edited Loading

OpenBLAS (4 threads)

OpenBLAS (8 threads)

Accelerate / vecLib

brada4 commented Feb 2, 2021

martin-frbg commented Dec 30, 2020 •

edited

Loading

fxcoudert commented Dec 30, 2020 •

edited

Loading

danielchalef commented Dec 30, 2020 •

edited

Loading

danielchalef commented Dec 30, 2020 •

edited

Loading

danielchalef commented Dec 30, 2020 •

edited

Loading

ogrisel commented Feb 1, 2021 •

edited

Loading

ogrisel commented Feb 1, 2021 •

edited

Loading

ogrisel commented Feb 2, 2021 •

edited

Loading