Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31

Closed
SuperFluffy opened this issue Nov 29, 2018 · 3 comments
Closed

Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31

SuperFluffy opened this issue Nov 29, 2018 · 3 comments

Comments

@SuperFluffy
Copy link
Contributor

SuperFluffy commented Nov 29, 2018

While implementing the dgemm kernel, I noticed that one can choose to either a) use _mm256_blend_ps followed by _mm256_permute2f128_ps or b) use _mm256_shuffle_ps followed by _mm256_permute2f128_ps to achieve the same goal (this is at the end, when scaling the product of a and b by alpha, and c by beta).

Doing the first operation leads to packed simd vectors containing a column (of 8 rows) each, while doing the second operation gives rows (containing 8 columns each).

Currently, the sgemm kernel implements option b), where _mm256_shuffle_ps has latency 1 and throughput 1. Doing option a) we'd get latency 1 but throughput 0.33 (on most Intel architectures).

It's worth investigating if this improves performance.

@bluss
Copy link
Owner

bluss commented Nov 29, 2018

You mean the part where we de-stripe the ab vectors right? Those 8 _mm256_shuffle_ps followed by eight _mm256_permute2f128_ps all already compile to vblend, but I wouldn't mind if we changed intrinsics and confirmed it would compile to the same thing.

This is a "fidelity" loss compared with the orginal BLIS avx sgemm kernel in asm, but maybe it's just for the better, a good thing with intrinsics hopefully.

@bluss
Copy link
Owner

bluss commented Nov 29, 2018

Keep the ideas coming!

@SuperFluffy
Copy link
Contributor Author

SuperFluffy commented Nov 30, 2018

Now that I am almost at the end of the implementation, I notice that the main possible source of savings comes right at the end when writing the kernel-results back to memory:

If you perform a shuffle + permute, you optimize for row-major storage in the C matrix with csc=1. This way you can use storeu_ps and save 8 elements with one operation along a column.

The same for blend + permute: you optimize for column-major storage, i.e. rsc=1, to save 8 elements in one operation along a row.

All other cases are handled with _m256_extract128_ps and finally _mm_store_ss, where you fall back to sse instructions! This is necessary for general storage matrices, but not for column- or row-major ones.

It becomes most obvious if I demonstrate it. Here I have taken the same assumptions that blis is doing, namely that the matrix a is column major, the matrix b in row major form. The below is for f64, but the argument stays the same for 32 (just more terms).

Blend + permute

a0 b0 | a1 b1 | a2 b2 | a3 b3
a0 b1 | a1 b0 | a2 b3 | a3 b2
=> _mm256_blend_pd with 0b1010
a0 b0 | a1 b0 | a2 b2 | a3 b2 (only columns 0 and 2)
                                                     
Step 0.1
a0 b1 | a1 b0 | a2 b3 | a3 b2 (flipped the order)
a0 b0 | a1 b1 | a2 b2 | a3 b3
=> _mm256_blend_pd with 0b1010
a0 b1 | a1 b1 | a2 b3 | a3 b3 (only columns 1 and 3)
                                                     
Step 0.2
a0 b2 | a1 b3 | a2 b0 | a3 b1
a0 b3 | a1 b2 | a2 b1 | a3 b0
=> _mm256_blend_pd with 0b1010
a0 b2 | a1 b2 | a2 b0 | a3 b0 (only columns 0 and 2)
                                                     
Step 0.3
a0 b3 | a1 b2 | a2 b1 | a3 b0 (flipped the order)
a0 b2 | a1 b3 | a2 b0 | a3 b1
=> _mm256_blend_pd with 0b1010
a0 b3 | a1 b3 | a2 b1 | a3 b1 (only columns 1 and 3)
                                                     
Step 1.0 (combining steps 0.0 and 0.2)
                                                     
a0 b0 | a1 b0 | a2 b2 | a3 b2
a0 b2 | a1 b2 | a2 b0 | a3 b0
=> _mm256_permute2f128_pd with 0x30 = 0b0011_0000
a0 b0 | a1 b0 | a2 b0 | a3 b0
                                                     
Step 1.1 (combining steps 0.0 and 0.2)
                                                     
a0 b0 | a1 b0 | a2 b2 | a3 b2
a0 b2 | a1 b2 | a2 b0 | a3 b0
=> _mm256_permute2f128_pd with 0x12 = 0b0001_0010
a0 b2 | a1 b2 | a2 b2 | a3 b2
                                                     
Step 1.2 (combining steps 0.1 and 0.3)
a0 b1 | a1 b1 | a2 b3 | a3 b3
a0 b3 | a1 b3 | a2 b1 | a3 b1
=> _mm256_permute2f128_pd with 0x30 = 0b0011_0000
a0 b1 | a1 b1 | a2 b1 | a3 b1
                                                     
Step 1.3 (combining steps 0.1 and 0.3)
a0 b1 | a1 b1 | a2 b3 | a3 b3
a0 b3 | a1 b3 | a2 b1 | a3 b1
=> _mm256_permute2f128_pd with 0x12 = 0b0001_0010
a0 b3 | a1 b3 | a2 b3 | a3 b3

So the final results in this scheme are:

a0 b0 | a1 b0 | a2 b0 | a3 b0
 
a0 b2 | a1 b2 | a2 b2 | a3 b2
                                                     
a0 b1 | a1 b1 | a2 b1 | a3 b1
                                                     
a0 b3 | a1 b3 | a2 b3 | a3 b3

So a0 b0 | a1 b0 | a2 b0 | a3 b0 is __m256d packed 4-vector.

Shuffle + permute

First shuffling instead of blending gives the following results:

a0 b0 | a1 b1 | a2 b2 | a3 b3
a0 b1 | a1 b0 | a2 b3 | a3 b2
=> _mm256_shuffle_pd with 0000
a0 b0 | a0 b1 | a2 b2 | a2 b3 (only rows 0 and 2)
                                                           
Step 0.1
a0 b1 | a1 b0 | a2 b3 | a3 b2 (flipped the order)
a0 b0 | a1 b1 | a2 b2 | a3 b3
=> _mm256_shuffle_pd with 1111
a1 b0 | a1 b1 | a3 b2 | a3 b3 (only rows 1 and 3)
                                                           
Next, we perform the same operation on the other two rows:
                                                           
Step 0.2
a0 b2 | a1 b3 | a2 b0 | a3 b1
a0 b3 | a1 b2 | a2 b1 | a3 b0
=> _mm256_shuffle_pd with 0000
a0 b2 | a0 b3 | a2 b0 | a2 b1 (only rows 0 and 2)
                                                           
Step 0.3
a0 b3 | a1 b2 | a2 b1 | a3 b0
a0 b2 | a1 b3 | a2 b0 | a3 b1
=> _mm256_shuffle_pd with 1111
a1 b2 | a1 b3 | a3 b0 | a3 b1 (only rows 1 and 3)
                                                      
Step 1.0 (combining Steps 0.0 and 0.2):
a0 b0 | a0 b1 | a2 b2 | a2 b3
a0 b2 | a0 b3 | a2 b0 | a2 b1
=> _mm256_permute_2f128_pd with 0x20 = 0b0010_0000
a0 b0 | a0 b1 | a0 b2 | a0 b3
                                                           
Step 1.1 (combining Steps 0.0 and 0.2):
a0 b0 | a0 b1 | a2 b2 | a2 b3
a0 b2 | a0 b3 | a2 b0 | a2 b1
=> _mm256_permute_2f128_pd with 0x03 = 0b0001_0011
a2 b0 | a2 b1 | a2 b2 | a2 b3
                                                           
Step 1.2 (combining Steps 0.1 and 0.3):
a1 b0 | a1 b1 | a3 b2 | a3 b3
a1 b2 | a1 b3 | a3 b0 | a3 b1
=> _mm256_permute_2f128_pd with 0x20 = 0b0010_0000
a1 b0 | a1 b1 | a1 b2 | a1 b3
                                                           
Step 1.3 (combining Steps 0.1 and 0.3):
a1 b0 | a1 b1 | a3 b2 | a3 b3
a1 b2 | a1 b3 | a3 b0 | a3 b1
=> _mm256_permute_2f128_pd with 0x03 = 0b0001_0011
a3 b0 | a3 b1 | a3 b2 | a3 b3

The final results are then:

a0 b0 | a0 b1 | a0 b2 | a0 b3
                                                           
a2 b0 | a2 b1 | a2 b2 | a2 b3
                                                           
a1 b0 | a1 b1 | a1 b2 | a1 b3
                                                           
a3 b0 | a3 b1 | a3 b2 | a3 b3

So now the first index stays fixed while the second index changes: that's a row-major layout.

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 3, 2018
SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018
SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018
@bluss bluss closed this as completed in #36 Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants