Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workstation SKX is mis-identified #352

Open
loveshack opened this issue Oct 7, 2019 · 26 comments
Open

Workstation SKX is mis-identified #352

loveshack opened this issue Oct 7, 2019 · 26 comments
Labels

Comments

@loveshack
Copy link
Contributor

I have a desktop SKX, model W-2123, which the cpuid code identifies as haswell (obvious with configure auto).

It turns out that it doesn't report avx512 vpus due to not parsing the model name. I fixed it with #351.

@fgvanzee
Copy link
Member

fgvanzee commented Oct 7, 2019

@loveshack Thanks for your contribution, Dave. I was not even aware of this W line of parts. (I was confused at first because "desktop SKX" seemed contradictory; up until now we knew all desktop Skylakes to be sans AVX-512. But it seems the W is for workstation, which makes sense in that it's targeted at configurations that want AVX-512 but don't necessarily have space for the server-grade part, and/or don't need as many cores.)

@fgvanzee
Copy link
Member

fgvanzee commented Oct 7, 2019

@devinamatthews Could you give your stamp of approval to this patch? @dnparikh seems to recall there being an issue of 1 VPU vs 2 VPUs, but I don't have any memory of this one way or another.

@devinamatthews
Copy link
Member

Comments on #351

@devinamatthews
Copy link
Member

up until now we knew all desktop Skylakes to be sans AVX-512

Except for Skylake-X i7 and i9! And Cannon Lake! And Cascade Lake!

@fgvanzee
Copy link
Member

fgvanzee commented Oct 7, 2019

Except for Skylake-X i7 and i9! And Cannon Lake! And Cascade Lake!

Ugh. I know nothing, then.

Thanks for your comments, Devin.

@mratsim
Copy link

mratsim commented Dec 30, 2019

My Skylake-X i9-9980XE is also misidentified as Haswell.

@devinamatthews
Copy link
Member

@mratsim can you send the output of /proc/cpuinfo or the equivalent on your platform?

@mratsim
Copy link

mratsim commented Dec 30, 2019

Here you go:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz
stepping	: 4
microcode	: 0x2000043
cpu MHz		: 1406.852
cache size	: 25344 KB
physical id	: 0
siblings	: 36
core id		: 0
cpu cores	: 18
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips	: 6002.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

The full output is at https://gist.github.com/mratsim/419062e11ee1f66daa62c7fe4c13dc5d

@mratsim
Copy link

mratsim commented Dec 30, 2019

If that helps, for my own BLAS implementation purposes (see pytorch/pytorch#26534 (comment)), I only test if the CPU implements AVX512F (with CPUID instruction, leaf 7 -> ebx -> bit 16 is set), see my CPU detection code

@jeffhammond
Copy link
Member

@mratsim Can you run https://github.com/jeffhammond/vpu-count to see if that detects AVX-512 2x FMA correctly?

@jeffhammond
Copy link
Member

jeffhammond commented Dec 30, 2019

@jeffhammond
Copy link
Member

@mratsim For context, the reason BLIS detects your CPU as Haswell is that it incorrectly thinks it has only 1 FMA unit. The handful of SKX processors that have only 1 FMA should be treated as Haswell by BLIS, for reasons discussed in #351.

@mratsim
Copy link

mratsim commented Dec 31, 2019

That's a good point, in my own code I assume that everyone who cared about numerical computing that got an AVX512 CPU was informed enough to get one with 2 FMA units.

Here is the output of the test.x script:

$  ./test.x 
0x0: 16,756e6547,6c65746e,49656e69
Intel? yes
0x1: 50654,16400800,7ffefbbf,bfebfbff
signature:  0x050654
model:      0x55=85
family:     0x06=6
ext model:  0x05=5
Skylake server? yes
0x7: 0,d39ffffb,0,c000000
Skylake AVX-512 detected
cpu_name = Intel(R) Core(TM) i9-9980XE CPU 
cpu_name[9] = C
cpu_name[17] =  
CPU has 2 AVX-512 VPUs

The empirical script however sometimes give me 1, sometimes give me 2, but my CPU is overclocked 4.1 GHz all-core turbo, 4.0 GHz all-core AVX turbo, 3.5 GHz all-core AVX512 turbo, see perf profile) so that overclocking + a non-performance CPU governor might throw off the script.

@loveshack
Copy link
Contributor Author

loveshack commented Dec 31, 2019 via email

@jeffhammond
Copy link
Member

jeffhammond commented Dec 31, 2019

@loveshack Your suggestion of OpenBLAS here is total garbage. Unless there has been a rewrite for SKX, it's nowhere near as fast on SKX. See SkylakeX on the Performance Wiki page for details.

Anyone can trivially fix the problem in this issue by setting the configuration name explicitly, which is how I just built a SKX binary on my HSW workstation. Using the build system options effectively is much easier than switch BLAS libraries.

./configure skx

I don't know what you are talking about with the licensing issues. vpu-count.c uses the MIT license. I am not going to ask for a license clarification on empirical.c because that method is worse in every way that matters to us. It is shown here once again giving wrong answers to @mratsim due to system noise.

@jeffhammond
Copy link
Member

@mratsim You are right that people buy the 2 FMA parts when they are building HPC systems, but there are a lot of academics and software developers at small firms who buy the low-end server CPUs in their workstations. I too was surprised but my activities on this front were motivated by reports of strange DGEMM performance with Skylake Xeon 4xxx SKUs on GitHub.

@loveshack
Copy link
Contributor Author

loveshack commented Jan 2, 2020 via email

@jeffhammond
Copy link
Member

@loveshack Your suggestion of OpenBLAS here is total garbage. Unless
there has been a rewrite for SKX, it's nowhere near as fast on SKX.

Aggressively addressing something different when confronted with
uncomfortable facts is the sort of tactic I expect from disreputable
politicians. BLIS certainly knows how to drive off outside
contributors.

Please note that I am an outside contributor.

I have actually compared development BLIS and OpenBLAS -- believe it or
not -- rather than talking through my hat. (The releases both use avx2
on my W-series box, but OpenBLAS does rather better.) The OpenBLAS
author also claims to be able to outperform MKL on avx2.

Please post the data.

Anyone can trivially fix the problem in this issue by setting the
configuration name explicitly. Using the build system options
effectively is much easier than switch BLAS libraies.

First you have to understand the undocumented issue, then on
heterogeneous systems you need to build N copies of the library and
ensure they're correctly used at run time. I see that sort of mess, and
the consequences. [Switching ELF dynamically-linked BLAS is trivial,
and is supported by my packaging.]

What is not documented? Are you suggesting that the auto and skx configuration options are not properly documented?

I don't know what you are talking about with the licensing issues.
vpu-count.c
uses the MIT license.

https://github.com/jeffhammond/vpu-count/blob/master/empirical.c is not
MIT licensed according to the header. It also won't compile with GCC.

True, but I am not telling anyone to use it, so why does it matter?

I am not going to ask for a license change on
empirical.c
because that method is worse in every way that matters to us. It is
shown here once again giving wrong answers to @mratsim due to system
noise.

If it's junk it would be helpful to warn potential users.

It is not junk. It just isn't recommended for most users.

https://github.com/jeffhammond/vpu-count/blob/master/README.md#usage has
been modified to address this.

The Gromacs version appears more robust; it uses a different timer. Anyhow, all
bets are off for performance under such conditions, and it would only be
a fallback if you're going to default to assuming one FMA unit.

As I've said many times in the past, the default should be 2 FMA units on server platforms. The server parts with 1 FMA unit are the exception.

Since you raise variance, note that I'm entitled to ignore measurements
without error bars, like the published BLIS ones.

This comment is not made in good faith and has been ignored.

@mratsim
Copy link

mratsim commented Jan 3, 2020

For what it's worth, it looks as if OpenBLAS now has decent (but
unreleased) skylakex support, so if you just want a good BLAS, it will
probably be the best option.

Since we are veering into off-topic (but the original problem was yours, and it's understood with a potential fix underway), allow me to expand on my use of BLAS libraries.

I interact with BLAS / BLIS with 3 different hats.

As a regular user

I focus on data science workloads, while lots of compute-intensive part is offloaded on GPU, there are still many cases where a CPU BLAS is needed, for example Principal Component Analysis.

As I have an Intel CPU, linking to MKL gives me the best performance. I recompiled the latest OpenBLAS from source and it gave me 2.75TFlops on my machine, MKL reached 3.37TFlops and the theoretical peak is a 4032TFlops (3.5GHz all AVX512 turbo).

There is one library, which is the industry-standard in Natural Language Processing that requires BLIS called Spacy / https://github.com/explosion/spaCy. The reason is the flexibility in strides that other BLAS libraries don't provide, see https://github.com/explosion/cython-blis.
Spacy is at 15k Github stars and used everywhere in NLP, I'd like to see BLIS correctly detect my CPU so that NLP workloads which are becoming huge in size (i.e. Wikipedia is in the Terabytes although it's usually processed on GPU) are using the full extent of my CPU.

As a user-facing library developer

I develop Arraymancer, a tensor library for the Nim programming language, think of it as Numpy + Sklearn + PyTorch but only for data science in terms of scope.
Like many other libraries, I stand on top of BLAS and users can compile in any library they desire.
I even provide a specialized BLIS backend and compilation flag that avoids making a tensor contiguous before doing a matrix multiplication. https://github.com/mratsim/Arraymancer/blob/v0.5.2/src/tensor/backend/blis_api.nim

I encountered the following difficulties with BLAS libraries, note that many issues are not under the hands of BLAS developers but ultimately as the user-facing library dev, it's me that have to deal with those:

  • inconsistent deployment:
    • On Windows I never managed to get LAPACK working into my CI
    • On Archlinux the default OpenBLAS is mispackaged (mixed with netlib cblas) and gives wrong result for float64: https://bugs.archlinux.org/task/63054
    • Distributions are sometimes providing cblas API in blas.so sometimes in cblas.so making it a pain to work with. I am not enven sure how to deal with the naming on WIndows
  • Reusing the Principal Component Analysis example OpenBLAS is 2x slower than MKL here Randomized pca [Ready] mratsim/Arraymancer#384 (comment) (though 40% of it can probably be attributed by AVX512). It is implemented with a mix of gemm, gesdd, getrf, laswp, triu, tril, geqrf and orgqrf (tried ormqr but perf is abysmal for all BLAS). While gemm is very important, good primitives across the board are also important and MKL delivers.
  • BLAS libraries are not composable when they use OpenMP.
    For many deep-learning cases, you want to do batched matrix multiplication (within a convolution for example) or call a BLAS from an already parallel region. Furthermore a deep-learning library does not know the problem size, sometimes you parallelizing over the batch is enough because you have 256 1000x1000 matrices which would saturate any CPU or sometimes you have 3x1980x720 and you need the inner parallelism. This lead to Julia developers adding hooks to OpenBLAS for a pluggable threading backend: WIP: allow threading backend to be replaced by caller OpenMathLib/OpenBLAS#2255 (and FFTW: partr threads backend FFTW/fftw3#175)
  • MKL with TBB is abysmally slow. MKL+TBB is at 1TFlops on my machine, with GNU OpenMP at 2.8 TFlops and with Intel OpenMP at 3.1Tflops up to 3.3 (rarely). This is float32 flops. TBB was supposed to bring composability, etc but it's unusable.
  • MKL is uses SSE on AMD GPU, has no ARM support
  • Besides BLIS none supports strided matrices which are very common in machine learning as you slice tensors and matrices very often (and the reason spaCy went full BLIS)
  • All BLAS are slow with small matrices (besides libxsmm) which are quite common in deep learning as you convolve over small patches of images or text.
  • Besides MKL-DNN none supports a custom epilogue to apply in-place updates like deep-learning activations: Relu, sigmoid, tanh, see: MKLDNN+AMD BLIS path for PyTorch  pytorch/pytorch#26534 (comment)
  • Besides MKL, none as a batched or prepacked API.

As a low-level linear algebra building blocks developer

All of the composability and deployment woes led me to 2 things:

1. Developing my own BLAS

The goal behind developing my own BLAS is to understand the intricacy of those.

Like many others I am using the BLIS approach instead of GotoBLAS/OpenBLAS approach for the ease of use:

The performance is also there.
On my CPU, my own BLAS reaches 2.7~2.8 TFlops, similar to Intel MKL+GNU OpenMP or OpenBLAS. There is a caveat though, this is only with GCC OpenMP, with Clang/ICC I only reach 2.1 TFlops, probably because the libkomp underneath doesn't properly support #pragma omp taskloop which I found necessary to parallelize both the ic and jr loop. I couldn't find the trick that OpenBLAS / BLIS are using to parallelize multiple loops

In short, even if BLIS usage is lower than OpenBLAS or MKL, it is the leading learning platform and introduction to high-performance computing.

I could also extend BLIS to prepacked GEMM with minimal effort.

2. Replacing OpenMP

To tackle composability issues, vectorization, optimization and also autodifferentation I started to write my own linear algebra and deep learning compiler, implemented as an embedded DSL in Nim, so that it can be seamlessly used in user-code. I however quickly hit the limits of OpenMP again.

As I think OpenMP limits are fundamental, and also given the bad state of some features in one runtime or the other (no work-stealing in GCC, no taskloop in Clang/ICC), I developed my own multithreading runtime from scratch, Weave, with the goal to be the backend of my high-performance libraries.

The runtime is now pretty solid, I ported my own BLAS to it and can reach 2.65 TFlops, with nested parallelism. There is overhead in work-stealing that GCC OpenMP doesn't have but in contrast I don't suffer from load imbalance, threashold or grain size issue that are plaguing PyTorch: https://github.com/zy97140/omp-benchmark-for-pytorch.

So now I want to benchmark that runtime against BLIS approach to parallelization to check if besides my core kernel, what is the state-of-the-art speedup (time parallel / time serial) that the BLIS approach brings. As a comparison, Intel MKL + Intel OpenMP speedup is 15~16x, OpenBLAS is 14x while my runtime is at 15~15.5x (if I allow workers to backoff when they can't steal work) or 16.9x (if I don't allow them to backoff).

Summary

  1. As a user, BLIS is not a choice for NLP, otherwise I just use MKL.
  2. As a tensor library dev, I am BLAS-agnostic and also use the specialied bli_gemm API. However dealing with distribution, paths and parallel composition are painful
  3. As a multithreading runtime dev, I want to bench my parallel speedup against BLIS parallel speedup as I'm using BLIS approach and want to make sure speedups are comparable.

@devinamatthews
Copy link
Member

@mratsim w.r.t. threading I have been meaning for some time to port BLIS to my TCI threading library that I use in TBLIS. This library can use either thread-based (OpenMP, pthreads) or task-based (TBB, GCD, PPL) back-ends. On Haswell TBLIS+TBB can beat MKL+TBB quite handily (perf. only a few % lower than OpenMP), although KNL had some teething issues. Haven't tested SKX but I would be hopeful. I would also be interested in seeing if Weave is something that could be used as a back-end in TCI.

w.r.t. the rest, I really don't see BLIS as a BLAS library (role 1)--MKL is free, so why the heck wouldn't people use that? What is unique about BLIS is a) for library developers (role 2) you get great interface extensions like general stride and now mixed-domain and mixed-precision, 1m, and more to come in the future, and b) for low-level developers (role 3) you get a nice toolbox of linear algebra pieces to build new operations with (this isn't the easiest thing right now, we are working on making this much more powerful in the future). For example, I don't really care at all about GEMM; I care about tensor contraction, row- and column-scaled multiplication, three-matrix SYRK-like multiplication, GEMM-then-cwise-dot, etc. that don't even have standard interfaces or existing high-performance implementations.

@rvdg
Copy link
Collaborator

rvdg commented Jan 3, 2020 via email

@devinamatthews
Copy link
Member

@mratsim is your CPU still misidentified? If so please send the full output of configure.

@mratsim
Copy link

mratsim commented Aug 13, 2020

Both test.x and empirical.x properly detect 2 VPUs (as of commit jeffhammond/vpu-count@b20db6d)

@devinamatthews
Copy link
Member

@mratsim The code is BLIS is slightly different from @jeffhammond's code. Can you test with BLIS? Configuring with configure auto should show that it selects the skx2 sub-configuration.

@mratsim
Copy link

mratsim commented Aug 14, 2020

As of commit 9c5b485 this is my ./configure auto output

configure: detected Linux kernel version 5.7.12-arch1-1.
configure: python interpeter search list is: python python3 python2.
configure: using 'python' python interpreter.
configure: found python version 3.8.5 (maj: 3, min: 8, rev: 5).
configure: python 3.8.5 appears to be supported.
configure: C compiler search list is: gcc clang cc.
configure: using 'gcc' C compiler.
configure: C++ compiler search list is: g++ clang++ c++.
configure: using 'g++' C++ compiler (for sandbox only).
configure: found gcc version 10.1.0 (maj: 10, min: 1, rev: 0).
configure: checking for blacklisted configurations due to gcc 10.1.0.
configure: checking gcc 10.1.0 against known consequential version ranges.
configure: found assembler ('as') version 2.34.0 (maj: 2, min: 34, rev: 0).
configure: checking for blacklisted configurations due to as 2.34.0.
configure: reading configuration registry...done.
configure: determining default version string.
configure: found '.git' directory; assuming git clone.
configure: executing: git describe --tags.
configure: got back 0.7.0-38-g9c5b485d.
configure: truncating to 0.7.0-38.
configure: starting configuration of BLIS 0.7.0-38.
configure: configuring with official version string.
configure: found shared library .so version '3.0.0'.
configure:   .so major version: 3
configure:   .so minor.build version: 0.0
configure: automatic configuration requested.
configure: hardware detection driver returned 'skx'.
configure: checking configuration against contents of 'config_registry'.
configure: configuration 'skx' is registered.
configure: 'skx' is defined as having the following sub-configurations:
configure:    skx
configure: which collectively require the following kernels:
configure:    skx haswell zen
configure: checking sub-configurations:
configure:   'skx' is registered...and exists.
configure: checking sub-configurations' requisite kernels:
configure:   'skx' kernels...exist.
configure:   'haswell' kernels...exist.
configure:   'zen' kernels...exist.
configure: no install prefix option given; defaulting to '/usr/local'.
configure: no install exec_prefix option given; defaulting to PREFIX.
configure: no install libdir option given; defaulting to EXECPREFIX/lib.
configure: no install includedir option given; defaulting to PREFIX/include.
configure: no install sharedir option given; defaulting to PREFIX/share.
configure: final installation directories:
configure:   prefix:      /usr/local
configure:   exec_prefix: ${prefix}
configure:   libdir:      ${exec_prefix}/lib
configure:   includedir:  ${prefix}/include
configure:   sharedir:    ${prefix}/share
configure: NOTE: the variables above can be overridden when running make.
configure: no preset CFLAGS detected.
configure: no preset LDFLAGS detected.
configure: debug symbols disabled.
configure: disabling verbose make output. (enable with 'make V=1'.)
configure: disabling ARG_MAX hack.
configure: building BLIS as both static and shared libraries.
configure: exporting only public symbols within shared library.
configure: threading is disabled.
configure: requesting slab threading in jr and ir loops.
configure: internal memory pools for packing blocks are enabled.
configure: internal memory pools for small blocks are enabled.
configure: memory tracing output is disabled.
configure: libmemkind not found; disabling.
configure: compiler appears to support #pragma omp simd.
configure: the BLAS compatibility layer is enabled.
configure: the CBLAS compatibility layer is disabled.
configure: mixed datatype support is enabled.
configure: mixed datatype optimizations requiring extra memory are enabled.
configure: small matrix handling is enabled.
configure: the BLIS API integer size is automatically determined.
configure: the BLAS/CBLAS API integer size is 32-bit.
configure: configuring for conventional gemm implementation.
configure: creating ./config.mk from ./build/config.mk.in
configure: creating ./bli_config.h from ./build/bli_config.h.in
configure: creating ./obj/skx
configure: creating ./obj/skx/config/skx
configure: creating ./obj/skx/kernels/skx
configure: creating ./obj/skx/kernels/haswell
configure: creating ./obj/skx/kernels/zen
configure: creating ./obj/skx/ref_kernels/skx
configure: creating ./obj/skx/frame
configure: creating ./obj/skx/blastest
configure: creating ./obj/skx/testsuite
configure: creating ./lib/skx
configure: creating ./include/skx
configure: mirroring ./config/skx to ./obj/skx/config/skx
configure: mirroring ./kernels/skx to ./obj/skx/kernels/skx
configure: mirroring ./kernels/haswell to ./obj/skx/kernels/haswell
configure: mirroring ./kernels/zen to ./obj/skx/kernels/zen
configure: mirroring ./ref_kernels to ./obj/skx/ref_kernels
configure: mirroring ./ref_kernels to ./obj/skx/ref_kernels/skx
configure: mirroring ./frame to ./obj/skx/frame
configure: creating makefile fragments in ./obj/skx/config/skx
configure: creating makefile fragments in ./obj/skx/kernels/skx
configure: creating makefile fragments in ./obj/skx/kernels/haswell
configure: creating makefile fragments in ./obj/skx/kernels/zen
configure: creating makefile fragments in ./obj/skx/ref_kernels
configure: creating makefile fragments in ./obj/skx/frame
configure: configured to build within top-level directory of source distribution.

@mratsim
Copy link

mratsim commented Aug 14, 2020

When grep-ing in the repo, I don't see skx2 anywhere. What branch should I use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants