-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31
Comments
You mean the part where we de-stripe the ab vectors right? Those 8 This is a "fidelity" loss compared with the orginal BLIS avx sgemm kernel in asm, but maybe it's just for the better, a good thing with intrinsics hopefully. |
Keep the ideas coming! |
Now that I am almost at the end of the implementation, I notice that the main possible source of savings comes right at the end when writing the kernel-results back to memory: If you perform a shuffle + permute, you optimize for row-major storage in the C matrix with The same for blend + permute: you optimize for column-major storage, i.e. All other cases are handled with It becomes most obvious if I demonstrate it. Here I have taken the same assumptions that blis is doing, namely that the matrix a is column major, the matrix b in row major form. The below is for f64, but the argument stays the same for 32 (just more terms). Blend + permute
So the final results in this scheme are:
So Shuffle + permuteFirst shuffling instead of blending gives the following results:
The final results are then:
So now the first index stays fixed while the second index changes: that's a row-major layout. |
While implementing the dgemm kernel, I noticed that one can choose to either a) use
_mm256_blend_ps
followed by_mm256_permute2f128_ps
or b) use_mm256_shuffle_ps
followed by_mm256_permute2f128_ps
to achieve the same goal (this is at the end, when scaling the product ofa
andb
byalpha
, andc
bybeta
).Doing the first operation leads to packed simd vectors containing a column (of 8 rows) each, while doing the second operation gives rows (containing 8 columns each).
Currently, the
sgemm
kernel implements option b), where_mm256_shuffle_ps
has latency 1 and throughput 1. Doing option a) we'd get latency 1 but throughput 0.33 (on most Intel architectures).It's worth investigating if this improves performance.
The text was updated successfully, but these errors were encountered: