Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning for Apple chips #2814

Open
Keno opened this issue Sep 1, 2020 · 38 comments
Open

Tuning for Apple chips #2814

Keno opened this issue Sep 1, 2020 · 38 comments

Comments

@Keno
Copy link
Contributor

Keno commented Sep 1, 2020

OpenBLAS can now be built for Apple chips using the port at https://github.com/iains/gcc-darwin-arm64.
The build succeeds and seems to run fine, so it might be time to think about tuning for this microarchitecture.
If anybody is interested in working on this issue, I can probably facilitate hardware access.

@brada4
Copy link
Contributor

brada4 commented Sep 2, 2020

There is no specific code for Apple's desktop-to-be processor over there. As far as internets tell - there is ISA profile present already.

Instruction set A64 – ARMv8.4-A

There is no specific support for big.LITTLE configuration, apple or no apples.

If you find apple's accelerate framework outperforming openblas, describe regression here, as usual. As hardware becomes available more people will actually use it and report what they find unfair.

@Keno
Copy link
Contributor Author

Keno commented Sep 2, 2020

Well, that is why I said microarchitectural tuning, not architectural tuning ;) Accelerate will obviously be a good point of comparison.

@martin-frbg
Copy link
Collaborator

Does sysctl -n machdep.cpu.brand_string (as search engines tell me) return anything usable for identification ? (For #2804,I just made the code return ARMV8 by default). If it does, we could start with giving it its unique TARGET (which would allow assigning appropriate compiler options). Next trivial step could be to repurpose the ThunderX2T99 kernels (as they should be the most advanced we have right now) and see how they fare compared to generic ARMV8 or CortexA57.

@Keno
Copy link
Contributor Author

Keno commented Sep 2, 2020

sysctl machdep   
machdep.user_idle_level: 128
machdep.wake_abstime: 1258169647836
machdep.time_since_reset: 538131617
machdep.wake_conttime: 46263198095284
machdep.deferred_ipi_timeout: 64000
machdep.cpu.cores_per_package: 8
machdep.cpu.core_count: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8
machdep.cpu.brand_string: Apple processor
machdep.lck_mtx_adaptive_spin_mode: 1
machdep.virtual_address_size: 47

feature detection is in the hw sysctl though:

hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 17179869184
hw.activecpu: 8
hw.physicalcpu: 8
hw.physicalcpu_max: 8
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: 131287967
hw.cacheconfig: 8 1 1 0 0 0 0 0 0 0
hw.cachesize: 4008591360 131072 8388608 0 0 0 0 0 0 0
hw.pagesize: 16384
hw.pagesize32: 16384
hw.cachelinesize: 64
hw.l1icachesize: 131072
hw.l1dcachesize: 131072
hw.l2cachesize: 8388608
hw.tbfrequency: 24000000
hw.packages: 1
hw.osenvironment: 
hw.ephemeral_storage: 0
hw.use_recovery_securityd: 0
hw.use_kernelmanagerd: 1
hw.serialdebugmode: 0
hw.optional.floatingpoint: 1
hw.optional.watchpoint: 4
hw.optional.breakpoint: 6
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_2_fhm: 0
hw.optional.amx_version: 0
hw.optional.ucnormal_mem: 0
hw.optional.arm64: 1
hw.targettype: J273a

@martin-frbg
Copy link
Collaborator

Thanks for the data, rough first draft is in #2816

@danielchalef
Copy link

Apple's M1 appears to offer Intel AMX-like capabilities accessed via an arm64 ISA extension. This extension is in use by Apple's own Accelerate framework. Prototype code for using these matrix intrinsics may be found here: https://gist.github.com/dougallj/7a75a3be1ec69ca550e7c36dc75e0d6f

@brada4
Copy link
Contributor

brada4 commented Dec 28, 2020

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

@martin-frbg
Copy link
Collaborator

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

@danielchalef
Copy link

Thats not prototype code, nor intrimsic header of sorts, that is an earlu attempt to document an undocumented co-processor.

Ulp. You're right. I scanned the gist too quickly. That's not prototype code, rather a documentation effort.

@danielchalef
Copy link

Uh thanks, a deep link into a gist that looks as if it was supposed to be private, and has comments about being reverse-engineered from Apple's intellectual property ? I am not sure I would want to go there, least of all when nobody has even attempted to make proper use of what is openly available e.g. through benchmarking.

Fair enough regarding the IP concern.

@brada4
Copy link
Contributor

brada4 commented Dec 29, 2020

If anyone could benchmark rosetta2 roughly and tell what works best from x86 world https://developer.apple.com/documentation/apple_silicon/about_the_rosetta_translation_environment#3616843

@martin-frbg
Copy link
Collaborator

@brada4 as I understand it Rosetta is more like a runtime x86 emulation environment to make x86 binaries run at all - I do not see how this would provide any insight compared to benchmarking the existing ARMV8 (and potentially thunderx2) kernels and working from there.

@brada4
Copy link
Contributor

brada4 commented Dec 29, 2020

That emulated x86 will be around for 3-5 years (looking at "smooth" ppc to x86 transition years ago)

@dengemann
Copy link

Hello everyone, I had some fun the past days benchmarking R and Python, lately also compiled with openblas. See threads here https://twitter.com/dngman/status/1342580260815200257?s=20 and https://twitter.com/fxcoudert/status/1342598509418176514?s=20

If there is anything that I can do to accelerate this effort e.g. with testing etc. please let me know.

@martin-frbg
Copy link
Collaborator

One trivial change to try (if you can spare the time) would be to edit kernel/arm64/KERNEL.VORTEX so that it includes either KERNEL.NEOVERSEN1 or KERNEL.THUNDERX3T110 instead of the more generic KERNEL.ARMV8 .
This is just a stab in the dark though - no guarantees that this will actually make OpenBLAS faster, just that it would then use more recent BLAS kernels for server-class cpus rather than a smallest common denominator capable of running on old phones.

@fxcoudert
Copy link

@martin-frbg is there a quick way to run a standard benchmark in openblas? I see there's a benchmark directory but not much info on how to use that…

@martin-frbg
Copy link
Collaborator

No decent framework currently - just a bunch of individual files either inherited from GotoBLAS or inspired by those. Run make in the benchmark directory, and then execute one of the generated *.goto files with optional arguments of initial dimension, final dimension, step size, e.g. dlinpack.goto 1000 10000 50 to get a simple printout of problem size vs. MFlops to feed into e.g. gnuplot.
The scripts subdirectory contains similarly trivial scripts for python,octave and r

@danielchalef
Copy link

I've run the benchmark suite on OpenBLAS (develop branch) compiled for:

  • arm64 / VORTEX with LLVM shipped with Big Sur
  • x86_64 using homebrew's gcc-10 toolchain and run using Rosetta. These tests were run twice in order to cache the translation.

For comparison, I've included results for veclib, where these benchmarks compiled cleanly. Many did not and some tests segfaulted.

You'll also note that many of the tests appear to have underruns. I've not yet had the opportunity to dig in to understand why this happened.

Results:
https://github.com/danielchalef/openblas-benchmark-m1

Next up: Try the KERNEL.NEOVERSEN1 and KERNEL.THUNDERX3T110 replacement for ARMV8.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 30, 2020

Thanks - the underruns make me suspect that _POSIX_TIMERS (for clock_gettime presence) is not defined on Big Sur, which would make the benchmarks fall back to gettimeofday() with only millisecond resolution. For most but unfortunately not all of the benchmarks you can set the environment variable OPENBLAS_LOOPS to some "suitable" repeat value to get measurable execution times.

@martin-frbg
Copy link
Collaborator

This version of benchmark/bench.h would probably work for OSX:
bench.h.txt

@danielchalef
Copy link

This version of benchmark/bench.h would probably work for OSX:
bench.h.txt

I get the following when making the tests with the modified bench.h:

./bench.h:82:21: error: expected parameter declarator
 mach_timebase_info(&info);
                    ^
./bench.h:82:21: error: expected ')'
./bench.h:82:20: note: to match this '('
 mach_timebase_info(&info);
                   ^
./bench.h:82:2: warning: type specifier missing, defaults to 'int' [-Wimplicit-int]
 mach_timebase_info(&info);
 ^
1 warning and 2 errors generated.
make: *** [sgemm.o] Error 1```

@martin-frbg
Copy link
Collaborator

Strange, looks like it ignored the declaration of info as a mach_timebase_info_data_t on the preceding line - but this is only cobbled together from various sources on the internet, not even compile-tested as I do not have any Apple hardware here.

@fxcoudert
Copy link

@martin-frbg you can't call mach_timebase_info() outside of an actual function

@martin-frbg
Copy link
Collaborator

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the
#elif defined(__APPLE__) there, please ?

@fxcoudert
Copy link

fxcoudert commented Dec 30, 2020

CLOCK_REALTIME should be available, though, with microsecond resolution:

$ cat a.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

int main (void){
  struct timespec tp;
  int result;

  result = clock_getres(CLOCK_REALTIME, &tp);
  printf("result: %d\n", result);
  printf("tp.tv_sec: %lld\n", (long long) tp.tv_sec);
  printf("tp.tv_nsec: %lld\n", (long long) tp.tv_nsec);
}
$ clang a.c && ./a.out
result: 0
tp.tv_sec: 0
tp.tv_nsec: 1000

Why it doesn't define _POSIX_TIMERS is beyond me…

@danielchalef
Copy link

danielchalef commented Dec 30, 2020

right, @danielchalef can you move that line mach_timebase_info(&info); into the getsec() function immediately after the
#elif defined(__APPLE__) there, please ?

The tests compiled. However, the math now appears off:
dgemm.goto

          SIZE                   Flops             Time
 M=   1, N=   1, K=   1 :        0.00 MFlops 15125.000000 sec
 M=   2, N=   2, K=   2 :        0.00 MFlops 458.333333 sec
 M=   3, N=   3, K=   3 :        0.00 MFlops 458.333333 sec
 M=   4, N=   4, K=   4 :        0.00 MFlops 458.333333 sec
 M=   5, N=   5, K=   5 :        0.00 MFlops 583.333333 sec

@martin-frbg
Copy link
Collaborator

Off by only 1e9 probably (reporting nanoseconds instead of seconds), though something else seems to affect the very first call.

@danielchalef
Copy link

danielchalef commented Dec 30, 2020

@martin-frbg Your suggestion to set OPENBLAS_LOOPS to a larger number works. I'll upload dgemm results later today.

@martin-frbg
Copy link
Collaborator

@fxcoudert from pocoproject/poco#1453 apparently clock_gettime was added in OSX 10.12 but the presence of _POSIX_TIMERS may depend on the minimum SDK version setting at compile time (?) Anyway we'd probably want this to work with OSX < 10.12

@danielchalef
Copy link

danielchalef commented Dec 30, 2020

dgemm results on a MacBook Pro M1. OpenBLAS compiled with Xcode / clang version 12.0.0. The test was run 10 times with the first run discarded. OPENBLAS_LOOPS was set to 20 in order to avoid the underflow discussed above.

OpenBLAS (with VORTEX/ ARMV8 kernel) vs Veclib

visualization

OpenBLAS VORTEX/ ARMV8 vs NEOVERSEN1 vs THUNDERX3T110 kernels (all on the M1):

A little difficult to see given the similarity in results and scale. See charts below for some interesting matrix dimension results.

visualization (2)

visualization (3)

to;dr Veclib significantly outperforms OpenBLAS, likely as it is using native, hardware-based matrix multiplication acceleration. The NEOVERSEN1 kernel appears to offer better results for the M1 than the default ARMV8 kernel.

@xianyi
Copy link
Collaborator

xianyi commented Dec 31, 2020

Does anybody work on optimizing gemm on M1? Actually, I have the interest on it.

@ogrisel
Copy link
Contributor

ogrisel commented Feb 1, 2021

I also ran some quick benchmarks in the context of the numpy / scipy stack. Here are the results:

https://gist.github.com/ogrisel/87dcf2c3ab8a304ededf75934b116b61#gistcomment-3614885

In float64, OpenBLAS VORTEX is not too bad. But for float32 Accelerate/vecLib is really impressive (~2.6x faster).

I also noticed that OpenBLAS / OpenMP detects 8 cores on Apple M1 (it has 4 performance cores and 4 efficiency cores) but apparently I get better performance by setting OMP_NUM_THREADS=4. So I suspect a bit of oversubscription when using the "efficiency cores". Maybe they do not support the vortex instructions?

I can open a dedicated issue for the oversubscription problem.

Edit: a first version of this post / bench results used OPENBLAS_NUM_THREADS=4 instead of OMP_NUM_THREADS=4. Only the latter works with the openblas build from conda-forge. I fixed the numbers accordingly.

Command I used to introspect the number of cores actually used:

% python -m threadpoolctl -i numpy
[
  {
    "filepath": "/Users/ogrisel/miniforge3/envs/openblas/lib/libopenblasp-r0.3.12.dylib",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.12",
    "num_threads": 8,
    "threading_layer": "openmp"
  },
  {
    "filepath": "/Users/ogrisel/miniforge3/envs/openblas/lib/libomp.dylib",
    "prefix": "libomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 8
  }
]

@brada4
Copy link
Contributor

brada4 commented Feb 1, 2021

Something is not right. Numpy is fine with Apple's cblas.h + Accelerate alone , and same netlib SVD is used in both cases (-O2 desirable).

It is rumoured/known that Apple uses proprietary co-processor in M1 Accelerate than no other can reach. It should be much better at either GEMM.

@ogrisel
Copy link
Contributor

ogrisel commented Feb 1, 2021

Numpy is fine with Apple's cblas.h + Accelerate alone

I am not sure I understand. Here I used this setup to get numpy to use Accelerate for BLAS calls and netlib for LAPACK calls. This is using isuruf/vecLibFort.

@brada4
Copy link
Contributor

brada4 commented Feb 1, 2021

R is a bit more evolved towards supporting your laptop:
https://developer.r-project.org/Blog/public/2020/11/02/will-r-work-on-apple-silicon/index.html
Please try with bigger data sets, 4/8MB is crumbles for modern CoD CPUs or multi-socket ones. Probably fits in M1 cache so that BLAS L1 is faster than any RAM chip made so far.

@brada4
Copy link
Contributor

brada4 commented Feb 1, 2021

Would be interesting (in short, like 2 years term) to select optimal core type for rosetta, no AVX supported, like 5 remaining.

@ogrisel
Copy link
Contributor

ogrisel commented Feb 2, 2021

Please try with bigger data sets, 4/8MB is crumbles for modern CoD CPUs or multi-socket ones. Probably fits in M1 cache so that BLAS L1 is faster than any RAM chip made so far.

Good remark, here are the GEMM results with m, n, k = 4096:

OpenBLAS (4 threads)

[float32] np.dot: 392.533 ms, 350.2 GFLOP/s
[float64] np.dot: 812.844 ms, 169.1 GFLOP/s

OpenBLAS (8 threads)

[float32] np.dot: 333.046 ms, 412.8 GFLOP/s
[float64] np.dot: 728.308 ms, 188.8 GFLOP/s

Accelerate / vecLib

[float32] np.dot: 191.281 ms, 718.7 GFLOP/s
[float64] np.dot: 712.759 ms, 192.9 GFLOP/s

Comments:

  • So now, the 8 threads are better used by OpenBLAS: no oversubscription anymore... interesting.
  • vecLib is still significantly faster with float32 but only by a factor of 1.7 now.

@brada4
Copy link
Contributor

brada4 commented Feb 2, 2021

I'd let big boys play and wait for R to ship official release where user can switch between default Accelerate and NetLib BLAS, then sneak OpenBLAS in place of NetLib one.
It is hard to impossible to believe that much touted extra piece of silicon meant for matrix acceleration turns out such a flop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants