Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why kernel size is d6x8 in zen #253

Closed
VirtualEarth opened this issue Sep 19, 2018 · 8 comments
Closed

why kernel size is d6x8 in zen #253

VirtualEarth opened this issue Sep 19, 2018 · 8 comments

Comments

@VirtualEarth
Copy link

why kernel size is d6x8 in zen? According to the paper when the micro kernel is GEBP, GEMM will attain best performance, so the kernel size should be d8x6?

@devinamatthews
Copy link
Member

In that paper, the GEBP part refers to the shape of the algorithm w.r.t. the L2 cache (e.g. the "macrokernel" in BLIS parlance). The part that corresponds to the microkernel in BLIS (a term not explicitly used in that paper) is sections 6.1-6.2. This paper presents a more detailed analysis of the block sizes as used in BLIS.

@VirtualEarth
Copy link
Author

thx, but i'm not understand it 。the latency and throughput of FMA is 5 , 0.5 in haswell Nvec is 4

so mr is 4 and nr is 4?

@fgvanzee
Copy link
Member

@VirtualEarth Turn your attention to Eq. 1:

  mr nr >= Nvec Lvfma Nvfma

On Intel Haswell/Broadwell/Skylake/Kabylake, Nvec is 4, Lvfma is 5, and Nvfma is 2. Thus, the register blocksize product mr nr must be at least 40 in order to overcome the floating-point latency. 6 x 8 = 48 more than satisfies this, but 8 x 6 would also be fine. The former is biased toward row-oriented SIMD output while the latter is better for column-oriented SIMD output. In BLIS, it almost never actually matters which one you pick because high-level logic will transpose the entire operation and make the output matrix C appear to be row- or column-stored, depending on the SIMD output "preference" of the microkernel.

@VirtualEarth
Copy link
Author

@VirtualEarth Turn your attention to Eq. 1:

  mr nr >= Nvec Lvfma Nvfma

On Intel Haswell/Broadwell/Skylake/Kabylake, Nvec is 4, Lvfma is 5, and Nvfma is 2. Thus, the register blocksize product mr nr must be at least 40 in order to overcome the floating-point latency. 6 x 8 = 48 more than satisfies this, but 8 x 6 would also be fine. The former is biased toward row-oriented SIMD output while the latter is better for column-oriented SIMD output. In BLIS, it almost never actually matters which one you pick because high-level logic will transpose the entire operation and make the output matrix C appear to be row- or column-stored, depending on the SIMD output "preference" of the microkernel.

thx,i got it.

@devinamatthews
Copy link
Member

@VirtualEarth thanks for your interest in BLIS and I hope all of your questions got answered. I'll close the issue now--if you do have more questions then this would make a great thread on the blis-discuss group.

@fgvanzee
Copy link
Member

@VirtualEarth One more thing: I realized that your issue asks "why kernel size is d6x8 in zen?" A microkernel for Intel Haswell-like architectures happens to work on AMD Zen-based architectures, though not the other way around. (Zen can only execute one 256-bit FMA per cycle, whereas Haswell can execute two, and therefore the microtile size for Zen can be smaller.) I plan to eventually rename the zen kernel set to haswell. Then, we can (optionally) experiment with kernels that specifically target Ryzen/Epyc.

@mert-kurttutan
Copy link

Hi @devinamatthews ,
Here you mentioned that Lvfma is 5 for Intel Haswell/Broadwell/Skylake/Kabylake. But from the reference of intel intrinsics guide, it seems that
Lvfma is 4. Am I missing something?

@devinamatthews
Copy link
Member

It was 5 on Haswell, which is the architecture that the kernel was originally written for. The latency dropped to 4 in SKL (I think?) but this doesn't change the basic design of the kernel. Note that the math using Lvfma gives one a minimum kernel size, but it's almost always best to pick a kernel as large as possible to reduce bandwidth from cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants