Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31

SuperFluffy · 2018-11-29T14:11:47Z

While implementing the dgemm kernel, I noticed that one can choose to either a) use _mm256_blend_ps followed by _mm256_permute2f128_ps or b) use _mm256_shuffle_ps followed by _mm256_permute2f128_ps to achieve the same goal (this is at the end, when scaling the product of a and b by alpha, and c by beta).

Doing the first operation leads to packed simd vectors containing a column (of 8 rows) each, while doing the second operation gives rows (containing 8 columns each).

Currently, the sgemm kernel implements option b), where _mm256_shuffle_ps has latency 1 and throughput 1. Doing option a) we'd get latency 1 but throughput 0.33 (on most Intel architectures).

It's worth investigating if this improves performance.

The text was updated successfully, but these errors were encountered:

bluss · 2018-11-29T20:51:34Z

You mean the part where we de-stripe the ab vectors right? Those 8 _mm256_shuffle_ps followed by eight _mm256_permute2f128_ps all already compile to vblend, but I wouldn't mind if we changed intrinsics and confirmed it would compile to the same thing.

This is a "fidelity" loss compared with the orginal BLIS avx sgemm kernel in asm, but maybe it's just for the better, a good thing with intrinsics hopefully.

bluss · 2018-11-29T20:52:48Z

Keep the ideas coming!

SuperFluffy · 2018-11-30T01:08:50Z

Now that I am almost at the end of the implementation, I notice that the main possible source of savings comes right at the end when writing the kernel-results back to memory:

If you perform a shuffle + permute, you optimize for row-major storage in the C matrix with csc=1. This way you can use storeu_ps and save 8 elements with one operation along a column.

The same for blend + permute: you optimize for column-major storage, i.e. rsc=1, to save 8 elements in one operation along a row.

All other cases are handled with _m256_extract128_ps and finally _mm_store_ss, where you fall back to sse instructions! This is necessary for general storage matrices, but not for column- or row-major ones.

It becomes most obvious if I demonstrate it. Here I have taken the same assumptions that blis is doing, namely that the matrix a is column major, the matrix b in row major form. The below is for f64, but the argument stays the same for 32 (just more terms).

Blend + permute

a0 b0 | a1 b1 | a2 b2 | a3 b3
a0 b1 | a1 b0 | a2 b3 | a3 b2
=> _mm256_blend_pd with 0b1010
a0 b0 | a1 b0 | a2 b2 | a3 b2 (only columns 0 and 2)
                                                     
Step 0.1
a0 b1 | a1 b0 | a2 b3 | a3 b2 (flipped the order)
a0 b0 | a1 b1 | a2 b2 | a3 b3
=> _mm256_blend_pd with 0b1010
a0 b1 | a1 b1 | a2 b3 | a3 b3 (only columns 1 and 3)
                                                     
Step 0.2
a0 b2 | a1 b3 | a2 b0 | a3 b1
a0 b3 | a1 b2 | a2 b1 | a3 b0
=> _mm256_blend_pd with 0b1010
a0 b2 | a1 b2 | a2 b0 | a3 b0 (only columns 0 and 2)
                                                     
Step 0.3
a0 b3 | a1 b2 | a2 b1 | a3 b0 (flipped the order)
a0 b2 | a1 b3 | a2 b0 | a3 b1
=> _mm256_blend_pd with 0b1010
a0 b3 | a1 b3 | a2 b1 | a3 b1 (only columns 1 and 3)
                                                     
Step 1.0 (combining steps 0.0 and 0.2)
                                                     
a0 b0 | a1 b0 | a2 b2 | a3 b2
a0 b2 | a1 b2 | a2 b0 | a3 b0
=> _mm256_permute2f128_pd with 0x30 = 0b0011_0000
a0 b0 | a1 b0 | a2 b0 | a3 b0
                                                     
Step 1.1 (combining steps 0.0 and 0.2)
                                                     
a0 b0 | a1 b0 | a2 b2 | a3 b2
a0 b2 | a1 b2 | a2 b0 | a3 b0
=> _mm256_permute2f128_pd with 0x12 = 0b0001_0010
a0 b2 | a1 b2 | a2 b2 | a3 b2
                                                     
Step 1.2 (combining steps 0.1 and 0.3)
a0 b1 | a1 b1 | a2 b3 | a3 b3
a0 b3 | a1 b3 | a2 b1 | a3 b1
=> _mm256_permute2f128_pd with 0x30 = 0b0011_0000
a0 b1 | a1 b1 | a2 b1 | a3 b1
                                                     
Step 1.3 (combining steps 0.1 and 0.3)
a0 b1 | a1 b1 | a2 b3 | a3 b3
a0 b3 | a1 b3 | a2 b1 | a3 b1
=> _mm256_permute2f128_pd with 0x12 = 0b0001_0010
a0 b3 | a1 b3 | a2 b3 | a3 b3

So the final results in this scheme are:

a0 b0 | a1 b0 | a2 b0 | a3 b0
 
a0 b2 | a1 b2 | a2 b2 | a3 b2
                                                     
a0 b1 | a1 b1 | a2 b1 | a3 b1
                                                     
a0 b3 | a1 b3 | a2 b3 | a3 b3

So a0 b0 | a1 b0 | a2 b0 | a3 b0 is __m256d packed 4-vector.

Shuffle + permute

First shuffling instead of blending gives the following results:

a0 b0 | a1 b1 | a2 b2 | a3 b3
a0 b1 | a1 b0 | a2 b3 | a3 b2
=> _mm256_shuffle_pd with 0000
a0 b0 | a0 b1 | a2 b2 | a2 b3 (only rows 0 and 2)
                                                           
Step 0.1
a0 b1 | a1 b0 | a2 b3 | a3 b2 (flipped the order)
a0 b0 | a1 b1 | a2 b2 | a3 b3
=> _mm256_shuffle_pd with 1111
a1 b0 | a1 b1 | a3 b2 | a3 b3 (only rows 1 and 3)
                                                           
Next, we perform the same operation on the other two rows:
                                                           
Step 0.2
a0 b2 | a1 b3 | a2 b0 | a3 b1
a0 b3 | a1 b2 | a2 b1 | a3 b0
=> _mm256_shuffle_pd with 0000
a0 b2 | a0 b3 | a2 b0 | a2 b1 (only rows 0 and 2)
                                                           
Step 0.3
a0 b3 | a1 b2 | a2 b1 | a3 b0
a0 b2 | a1 b3 | a2 b0 | a3 b1
=> _mm256_shuffle_pd with 1111
a1 b2 | a1 b3 | a3 b0 | a3 b1 (only rows 1 and 3)
                                                      
Step 1.0 (combining Steps 0.0 and 0.2):
a0 b0 | a0 b1 | a2 b2 | a2 b3
a0 b2 | a0 b3 | a2 b0 | a2 b1
=> _mm256_permute_2f128_pd with 0x20 = 0b0010_0000
a0 b0 | a0 b1 | a0 b2 | a0 b3
                                                           
Step 1.1 (combining Steps 0.0 and 0.2):
a0 b0 | a0 b1 | a2 b2 | a2 b3
a0 b2 | a0 b3 | a2 b0 | a2 b1
=> _mm256_permute_2f128_pd with 0x03 = 0b0001_0011
a2 b0 | a2 b1 | a2 b2 | a2 b3
                                                           
Step 1.2 (combining Steps 0.1 and 0.3):
a1 b0 | a1 b1 | a3 b2 | a3 b3
a1 b2 | a1 b3 | a3 b0 | a3 b1
=> _mm256_permute_2f128_pd with 0x20 = 0b0010_0000
a1 b0 | a1 b1 | a1 b2 | a1 b3
                                                           
Step 1.3 (combining Steps 0.1 and 0.3):
a1 b0 | a1 b1 | a3 b2 | a3 b3
a1 b2 | a1 b3 | a3 b0 | a3 b1
=> _mm256_permute_2f128_pd with 0x03 = 0b0001_0011
a3 b0 | a3 b1 | a3 b2 | a3 b3

The final results are then:

a0 b0 | a0 b1 | a0 b2 | a0 b3
                                                           
a2 b0 | a2 b1 | a2 b2 | a2 b3
                                                           
a1 b0 | a1 b1 | a1 b2 | a1 b3
                                                           
a3 b0 | a3 b1 | a3 b2 | a3 b3

So now the first index stays fixed while the second index changes: that's a row-major layout.

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 3, 2018

Special case for row major matrices in dgemm; addresses bluss#31

74d7c03

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018

Special case for row major matrices in dgemm; addresses bluss#31

14fe942

SuperFluffy added a commit to SuperFluffy/matrixmultiply that referenced this issue Dec 4, 2018

Special case for row major matrices in dgemm; addresses bluss#31

c2dd672

This was referenced Dec 4, 2018

Use optimal kernel parameters (architectures, matrix layouts) #34

Open

Implement sgemm and dgemm using fma #36

Merged

bluss closed this as completed in #36 Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31

Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31

SuperFluffy commented Nov 29, 2018 •

edited

Loading

bluss commented Nov 29, 2018

bluss commented Nov 29, 2018

SuperFluffy commented Nov 30, 2018 •

edited

Loading

Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31

Explore performance of _mm256_blend_ps vs _mm256_shuffle_ps #31

Comments

SuperFluffy commented Nov 29, 2018 • edited Loading

bluss commented Nov 29, 2018

bluss commented Nov 29, 2018

SuperFluffy commented Nov 30, 2018 • edited Loading

Blend + permute

Shuffle + permute

SuperFluffy commented Nov 29, 2018 •

edited

Loading

SuperFluffy commented Nov 30, 2018 •

edited

Loading