-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parsing in vpu_count on workstation SKX #351
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all Xeon W have 2 VPUs, we should probably just check the "W" part and return 2. The current code doesn't handle new processors W-32XX, W-22XX, and W-3175X.
Since all Xeon W have 2 VPUs,
Not according to
https://ark.intel.com/content/www/us/en/ark/products/125038/intel-xeon-w-2102-processor-8-25m-cache-2-90-ghz.html
See also
https://github.com/jeffhammond/vpu-count/blob/master/vpu-count.c, but
that's broken under Linux 4.19, at least.
I now realize both BLIS' and Jeff's code fail according to
https://ark.intel.com/content/www/us/en/ark/products/codename/37572/skylake.html
e.g. i9 and i7 with 2 FMA.
The logic clearly needs revisiting. Is there a more systematic approach
than going through the product list (which doesn't have W210x)? Also
Wikipedia notes that Gold 5122 has two units, not the one that would be
identified currently.
|
Ah, 2102 and 2104 are conveniently omitted from https://ark.intel.com/content/www/us/en/ark/products/series/125035/intel-xeon-w-processor.html. It would be great if you could fix up the logic. The only other way to detect is to run and time two loops (highly unrolled): one over fma only and one with fma+permute. If they take the same amount of time then there are 2 VPUs. |
Ah, 2102 and 2104 are conveniently omitted from
https://ark.intel.com/content/www/us/en/ark/products/series/125035/intel-xeon-w-processor.html.
It would be great if you could fix up the logic.
I'm not sure how to do it reliably without a systematic technique or at
least assuming the answer is two for future processors.
I don't know how complete Jeff's code is, but maybe it's best just to
use that if you're happy with the licence notice. I mis-spoke about it
not recognizing i9/i7 -- maybe I was looking at an old version. It does
miss D- series, at least, but it looks as if you want those counted as
haswell. There are also, for instance, i9s with avx512 but no fma count
documented; does that mean they don't have fma?
[If a single unit isn't useful for GEMM, I wonder what it is useful for.
I haven't researched that, and really can't keep up...]
The only other way to
detect is to run and time two loops (highly unrolled): one over fma
only and one with fma+permute. If they take the same amount of time
then there are 2 VPUs.
Is that a reasonable way to do it -- as a fallback? -- for dispatch at
run time? (The code at
https://github.com/jeffhammond/vpu-count/blob/master/empirical.c is
icc-specific, and isn't usable without a licence.)
|
If you want to use the empirical code, just build a standalone binary and run it during configure. Running the empirical test as part of a BLAS library is gross (which is why I created my repo in the first place). |
I asked the product marketing owner to get the answer.
It's complicated. 1 FMA doesn't help GEMM because the frequency is lower with 1x512 than 2x256 (don't ask me why). AVX3 (aka AVX-512) has other uses besides GEMM where it has upside versus AVX2. |
I guess if @jeffhammond is happy tracking down the specs for all future AVX-512 products then we can just use his version. @fgvanzee ? |
If you want to use the empirical code, just build a standalone binary
and run it during configure. Running the empirical test as part of a
BLAS library is gross (which is why I created my repo in the first
place).
The code isn't distributable without a licence (or buildable without
icc). I'd be more interested in the x86_64 target than auto, but I
agree that's probably not something you want to do at run time, but I
wonder if something like that is better than returning inferior results
for suitable hardware. (I don't know how long it takes.) The repo is
obviously helpful, thanks.
Is it known what MKL does? I assume it has to do the same dance if the
hardware doesn't report the info.
|
It's complicated. 1 FMA doesn't help GEMM because the frequency is
lower with 1x512 than 2x256 (don't ask me why). AVX3 (aka AVX-512)
has other uses besides GEMM where it has upside versus AVX2.
Yes, I realize it's complicated... I guess this isn't the place for
discussion. Anyhow, I hadn't realized a single unit ran slower, thanks.
|
They are both gross (albeit different magnitudes of grossness). I'd rather try to code all of the model number logic if that is just a few tweaks away from what we have already.
I agree that all BLIS needs to care about is whether AVX-512 is supported. Thanks for reminding us of this, Jeff.
Agreed. (I assume by "his version" you mean the code that is currently in BLIS?) I see two paths forward:
Am I missing anything? |
@fgvanzee I meant copy https://github.com/jeffhammond/vpu-count/blob/master/vpu-count.c over periodically and adjust the interface as needed. It is MIT licensed so shouldn't be an issue. |
…nment Intended particularly for diagnosing mis-selection of SKX through unknown, or incorrect, number of VPUs.
I agree that all BLIS needs to care about is whether AVX-512 is
supported. Thanks for reminding us of this, Jeff.
?? If that's the case, you don't need to check for multiple FMA units,
(conditional on avx512) which Jeff says is necessary.
> I guess if @jeffhammond is happy tracking down the specs for all
> future AVX-512 products then we can just use his version. @fgvanzee
> ?
I think I've got further with that than Jeff has, per an issue against
his repo. Also, the parsing in BLIS actually looks more robust, given
the change I needed for W- compared with what Jeff had reported.
|
Great! What I meant is that I am happy for anyone but me to keep it up to date 😄. |
It sounds as if you don't want it, but as I'd done most of the work
already, I pushed changes to my branch to update the models supported
and allow reporting the number of VPUs chosen and architecture selected
generally. That might save someone effort, but I may well have made
mistakes, especially as I didn't try to scrape the ARK listings. I
assumed ranges of models similar to how it was done already.
|
@loveshack Dave, I apologize for letting this issue slip through the cracks. I think what happened here is that I had trouble following along with some of the conversation, which caused me to place the issue on the back burner until everyone with an interest in it worked out their differences / came to a consensus, but then I failed to realize when that pan was done simmering. :) I'll start taking a look at this shortly, and hopefully others can chime in to help us get this PR resolved. |
Details: - Moved architecture/sub-config logging-related code from bli_cpuid.c to bli_arch.c, tweaked names, and added more set/get layering. - Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c. - Content, whitespace changes to new bullet in HardwareSupport.md that relates to single-VPU Skylake-Xs.
@loveshack I resolved the trivial conflict in the copyright portion of the license header to @devinamatthews If you have a moment, please comment on this before we merge. |
* Fix parsing in vpu_count on workstation SKX * Document Skylake-X as Haswell for single FMA * Update vpu_count for Skylake and Cascade Lake models * Support printing the configuration selected, controlled by the environment Intended particularly for diagnosing mis-selection of SKX through unknown, or incorrect, number of VPUs. * Move bli_log outside the cpp condition, and use it where intended * Add Fixme comment (Skylake D) * Mostly superficial edits to commits towards flame#351. Details: - Moved architecture/sub-config logging-related code from bli_cpuid.c to bli_arch.c, tweaked names, and added more set/get layering. - Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c. - Content, whitespace changes to new bullet in HardwareSupport.md that relates to single-VPU Skylake-Xs. * Fix comment typos Co-authored-by: Field G. Van Zee <[email protected]>
This correctly treats
Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz