-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workstation SKX is mis-identified #352
Comments
@loveshack Thanks for your contribution, Dave. I was not even aware of this W line of parts. (I was confused at first because "desktop SKX" seemed contradictory; up until now we knew all desktop Skylakes to be sans AVX-512. But it seems the W is for workstation, which makes sense in that it's targeted at configurations that want AVX-512 but don't necessarily have space for the server-grade part, and/or don't need as many cores.) |
@devinamatthews Could you give your stamp of approval to this patch? @dnparikh seems to recall there being an issue of 1 VPU vs 2 VPUs, but I don't have any memory of this one way or another. |
Comments on #351 |
Except for Skylake-X i7 and i9! And Cannon Lake! And Cascade Lake! |
Ugh. I know nothing, then. Thanks for your comments, Devin. |
My Skylake-X i9-9980XE is also misidentified as Haswell. |
@mratsim can you send the output of |
Here you go:
The full output is at https://gist.github.com/mratsim/419062e11ee1f66daa62c7fe4c13dc5d |
If that helps, for my own BLAS implementation purposes (see pytorch/pytorch#26534 (comment)), I only test if the CPU implements AVX512F (with CPUID instruction, leaf 7 -> ebx -> bit 16 is set), see my CPU detection code |
@mratsim Can you run https://github.com/jeffhammond/vpu-count to see if that detects AVX-512 2x FMA correctly? |
@fgvanzee @devinamatthews @loveshack I have to wonder if addressing my comment here (https://github.com/flame/blis/pull/351/files/94b34d38f6dffc074a4f12a1936c0ddba51f47ee..5597169dda1c4cb0c447cb589f8d5c2a5418a259#diff-81ef49aa7330af78381263dcf0acbea8) would solve this problem. |
That's a good point, in my own code I assume that everyone who cared about numerical computing that got an AVX512 CPU was informed enough to get one with 2 FMA units. Here is the output of the test.x script:
The empirical script however sometimes give me 1, sometimes give me 2, but my CPU is overclocked 4.1 GHz all-core turbo, 4.0 GHz all-core AVX turbo, 3.5 GHz all-core AVX512 turbo, see perf profile) so that overclocking + a non-performance CPU governor might throw off the script. |
My Skylake-X i9-9980XE is also misidentified as Haswell.
That should be fixed by the changes I submitted, which don't seem to be
wanted. (If the CPU isn't recognized, the cpuid code might take a
fraction of a second to run measurement code, but that needs a
suitably-licensed implementation; Gromacs' has syntax for GCC, but is
GPL. Otherwise, it's probably best just to default to the normal case
of 2×FMA.)
For what it's worth, it looks as if OpenBLAS now has decent (but
unreleased) skylakex support, so if you just want a good BLAS, it will
probably be the best option.
|
@loveshack Your suggestion of OpenBLAS here is total garbage. Unless there has been a rewrite for SKX, it's nowhere near as fast on SKX. See SkylakeX on the Performance Wiki page for details. Anyone can trivially fix the problem in this issue by setting the configuration name explicitly, which is how I just built a SKX binary on my HSW workstation. Using the build system options effectively is much easier than switch BLAS libraries. ./configure skx I don't know what you are talking about with the licensing issues. vpu-count.c uses the MIT license. I am not going to ask for a license clarification on empirical.c because that method is worse in every way that matters to us. It is shown here once again giving wrong answers to @mratsim due to system noise. |
@mratsim You are right that people buy the 2 FMA parts when they are building HPC systems, but there are a lot of academics and software developers at small firms who buy the low-end server CPUs in their workstations. I too was surprised but my activities on this front were motivated by reports of strange DGEMM performance with Skylake Xeon 4xxx SKUs on GitHub. |
@loveshack Your suggestion of OpenBLAS here is total garbage. Unless
there has been a rewrite for SKX, it's nowhere near as fast on SKX.
Aggressively addressing something different when confronted with
uncomfortable facts is the sort of tactic I expect from disreputable
politicians. BLIS certainly knows how to drive off outside
contributors.
I have actually compared development BLIS and OpenBLAS -- believe it or
not -- rather than talking through my hat. (The releases both use avx2
on my W-series box, but OpenBLAS does rather better.) The OpenBLAS
author also claims to be able to outperform MKL on avx2.
Anyone can *trivially* fix the problem in this issue by setting the
configuration name explicitly. Using the build system options
effectively is much easier than switch BLAS libraies.
First you have to understand the undocumented issue, then on
heterogeneous systems you need to build N copies of the library and
ensure they're correctly used at run time. I see that sort of mess, and
the consequences. [Switching ELF dynamically-linked BLAS is trivial,
and is supported by my packaging.]
On the other hand, I contributed code to fix this issue in all the cases
I could find, to diagnose it, and to override the micro-arch selection
dynamically.
I don't know what you are talking about with the licensing issues.
[vpu-count.c](https://github.com/jeffhammond/vpu-count/blob/master/vpu-count.c)
uses the MIT license.
https://github.com/jeffhammond/vpu-count/blob/master/empirical.c is not
MIT licensed according to the header. It also won't compile with GCC.
I am not going to ask for a license change on
[empirical.c](https://github.com/jeffhammond/vpu-count/blob/master/empirical.c)
because that method is worse in every way that matters to us. It is
shown here once again giving wrong answers to @mratsim due to system
noise.
If it's junk it would be helpful to warn potential users. The Gromacs
version appears more robust; it uses a different timer. Anyhow, all
bets are off for performance under such conditions, and it would only be
a fallback if you're going to default to assuming one FMA unit.
Since you raise variance, note that I'm entitled to ignore measurements
without error bars, like the published BLIS ones.
|
Please note that I am an outside contributor.
Please post the data.
What is not documented? Are you suggesting that the
True, but I am not telling anyone to use it, so why does it matter?
It is not junk. It just isn't recommended for most users. https://github.com/jeffhammond/vpu-count/blob/master/README.md#usage has
As I've said many times in the past, the default should be 2 FMA units on server platforms. The server parts with 1 FMA unit are the exception.
This comment is not made in good faith and has been ignored. |
Since we are veering into off-topic (but the original problem was yours, and it's understood with a potential fix underway), allow me to expand on my use of BLAS libraries. I interact with BLAS / BLIS with 3 different hats. As a regular userI focus on data science workloads, while lots of compute-intensive part is offloaded on GPU, there are still many cases where a CPU BLAS is needed, for example Principal Component Analysis. As I have an Intel CPU, linking to MKL gives me the best performance. I recompiled the latest OpenBLAS from source and it gave me 2.75TFlops on my machine, MKL reached 3.37TFlops and the theoretical peak is a 4032TFlops (3.5GHz all AVX512 turbo). There is one library, which is the industry-standard in Natural Language Processing that requires BLIS called Spacy / https://github.com/explosion/spaCy. The reason is the flexibility in strides that other BLAS libraries don't provide, see https://github.com/explosion/cython-blis. As a user-facing library developerI develop Arraymancer, a tensor library for the Nim programming language, think of it as Numpy + Sklearn + PyTorch but only for data science in terms of scope. I encountered the following difficulties with BLAS libraries, note that many issues are not under the hands of BLAS developers but ultimately as the user-facing library dev, it's me that have to deal with those:
As a low-level linear algebra building blocks developerAll of the composability and deployment woes led me to 2 things: 1. Developing my own BLASThe goal behind developing my own BLAS is to understand the intricacy of those. Like many others I am using the BLIS approach instead of GotoBLAS/OpenBLAS approach for the ease of use:
The performance is also there. In short, even if BLIS usage is lower than OpenBLAS or MKL, it is the leading learning platform and introduction to high-performance computing. I could also extend BLIS to prepacked GEMM with minimal effort. 2. Replacing OpenMPTo tackle composability issues, vectorization, optimization and also autodifferentation I started to write my own linear algebra and deep learning compiler, implemented as an embedded DSL in Nim, so that it can be seamlessly used in user-code. I however quickly hit the limits of OpenMP again. As I think OpenMP limits are fundamental, and also given the bad state of some features in one runtime or the other (no work-stealing in GCC, no taskloop in Clang/ICC), I developed my own multithreading runtime from scratch, Weave, with the goal to be the backend of my high-performance libraries. The runtime is now pretty solid, I ported my own BLAS to it and can reach 2.65 TFlops, with nested parallelism. There is overhead in work-stealing that GCC OpenMP doesn't have but in contrast I don't suffer from load imbalance, threashold or grain size issue that are plaguing PyTorch: https://github.com/zy97140/omp-benchmark-for-pytorch. So now I want to benchmark that runtime against BLIS approach to parallelization to check if besides my core kernel, what is the state-of-the-art speedup (time parallel / time serial) that the BLIS approach brings. As a comparison, Intel MKL + Intel OpenMP speedup is 15~16x, OpenBLAS is 14x while my runtime is at 15~15.5x (if I allow workers to backoff when they can't steal work) or 16.9x (if I don't allow them to backoff). Summary
|
@mratsim w.r.t. threading I have been meaning for some time to port BLIS to my TCI threading library that I use in TBLIS. This library can use either thread-based (OpenMP, pthreads) or task-based (TBB, GCD, PPL) back-ends. On Haswell TBLIS+TBB can beat MKL+TBB quite handily (perf. only a few % lower than OpenMP), although KNL had some teething issues. Haven't tested SKX but I would be hopeful. I would also be interested in seeing if Weave is something that could be used as a back-end in TCI. w.r.t. the rest, I really don't see BLIS as a BLAS library (role 1)--MKL is free, so why the heck wouldn't people use that? What is unique about BLIS is a) for library developers (role 2) you get great interface extensions like general stride and now mixed-domain and mixed-precision, 1m, and more to come in the future, and b) for low-level developers (role 3) you get a nice toolbox of linear algebra pieces to build new operations with (this isn't the easiest thing right now, we are working on making this much more powerful in the future). For example, I don't really care at all about GEMM; I care about tensor contraction, row- and column-scaled multiplication, three-matrix SYRK-like multiplication, GEMM-then-cwise-dot, etc. that don't even have standard interfaces or existing high-performance implementations. |
Folks,
Let me second that. With BLIS, we have always been willing to give up 5% for flexibility, maintainability, and extendability. We are always delighted when people take building blocks or ideas from BLIS and build their own. We don’t feel threatened when others opt for other solutions or roll their own. One of our greatest delights comes from people realizing that providing BLAS-like functionality is not just for experts. Indeed, we have a MOOC for that (which will start again on Jan. 15: https://www.edx.org/course/laff-on-programming-for-high-performance). Competition is a wonderful thing.
And on quite a few occasions, BLIS is the fastest, as a bonus.
Have a BLISful New Year
Robert
… On Jan 3, 2020, at 12:18 PM, Devin Matthews ***@***.***> wrote:
@mratsim <https://github.com/mratsim> w.r.t. threading I have been meaning for some time to port BLIS to my TCI <https://github.com/devinamatthews/tci> threading library that I use in TBLIS <https://github.com/devinasmatthews/tblis>. This library can use either thread-based (OpenMP, pthreads) or task-based (TBB, GCD, PPL) back-ends. On Haswell TBLIS+TBB can beat MKL+TBB quite handily (perf. only a few % lower than OpenMP), although KNL had some teething issues. Haven't tested SKX but I would be hopeful. I would also be interested in seeing if Weave is something that could be used as a back-end in TCI.
w.r.t. the rest, I really don't see BLIS as a BLAS library (role 1)--MKL is free, so why the heck wouldn't people use that? What is unique about BLIS is a) for library developers (role 2) you get great interface extensions like general stride and now mixed-domain and mixed-precision, 1m, and more to come in the future, and b) for low-level developers (role 3) you get a nice toolbox of linear algebra pieces to build new operations with (this isn't the easiest thing right now, we are working on making this much more powerful in the future). For example, I don't really care at all about GEMM; I care about tensor contraction, row- and column-scaled multiplication, three-matrix SYRK-like multiplication, GEMM-then-cwise-dot, etc. that don't even have standard interfaces or existing high-performance implementations.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#352?email_source=notifications&email_token=ABLLYJ53UDDAGWX7QNQQBJLQ35XPTA5CNFSM4I6GJCA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIBTZKQ#issuecomment-570637482>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLLYJYG6FQ26TANRO24T5TQ35XPTANCNFSM4I6GJCAQ>.
|
@mratsim is your CPU still misidentified? If so please send the full output of configure. |
Both test.x and empirical.x properly detect 2 VPUs (as of commit jeffhammond/vpu-count@b20db6d) |
@mratsim The code is BLIS is slightly different from @jeffhammond's code. Can you test with BLIS? Configuring with |
As of commit 9c5b485 this is my
|
When grep-ing in the repo, I don't see skx2 anywhere. What branch should I use? |
I have a desktop SKX, model W-2123, which the cpuid code identifies as haswell (obvious with
configure auto
).It turns out that it doesn't report avx512 vpus due to not parsing the model name. I fixed it with #351.
The text was updated successfully, but these errors were encountered: