256 and 512 variants of lab 'core_bound/compiler_intrinsics_1' #75

XCemaXX · 2023-08-20T17:08:29Z

XCemaXX
Aug 20, 2023
Collaborator

Hi. I'm asking for help with lab core_bound/compiler_intrinsics_1

For 256bit registers I tried load data as:

__m128i sub_u8 =  _mm_loadu_si128((const __m128i_u*)(first + i));
__m256i sub = _mm256_cvtepu8_epi16(sub_u8);

I think that it is OK. Check values in debugger.

//I change 128bit functions to:
_mm256_sub_epi16
_mm256_add_epi16
_mm256_slli_si256(a, imm) // with shifts 2, 4 , 8, 16. But I'm not sure that it is right.
_mm256_storeu_si256
_mm256_extract_epi16(res, 15);
_mm256_set1_epi16
// change loop step:
for (; i+15 < limit - pos; i += 16)

With this changes validation is wrong for me. My error in shift function, because it is shifting by 128bit halfs instead off all 256 value?
Description of function differs from 128b version: 'Shifts each 128-bit half of the 256-bit integer vector a left by imm bytes, shifting in zero bytes, and returns the result. If imm is greater than 15, the returned result is all zeroes.'

I also tried 512b version with:

__m256i sub_u8 =  _mm256_loadu_si256((const __m256i_u*)(first + i));
__m512i sub = _mm512_cvtepu8_epi16(sub_u8);
_mm512_sub_epi16
_mm512_add_epi16
_mm512_slli_si512 // there is no such instruction. Any alternatives?
_mm512_storeu_si512
_mm512_extract_epi16
_mm512_set1_epi16

There is actually no such shift function.

dendibakh · 2023-08-21T15:23:32Z

dendibakh
Aug 21, 2023
Maintainer

Hi @XCemaXX, are you trying to port the solution to AVX2, correct?
I don't know the exact algorithm you use (can you share the full code of the loop?), but I think the problem could be that _mm_slli_si128 and _mm256_slli_si256 are functionally not equivalent.
Example:

_mm_slli_si128 (X, 8) // shift by 8 bytes
input:    X1 X2 X3 X4
output:   X2 X3 X4 0

_mm256_slli_si256 (XY, 8) // shift by 8 bytes
input:    X1 X2 X3 X4 | Y1 Y2 Y3 Y4
output:   X2 X3 X4 0  | Y2 Y3 Y4 0
                      ^
              128-bit lane split

I always use the following printf-style debugging to dump the contents of a vector:

  auto printVector8 = [](auto vec, auto name) {
    uint8_t debug[32];
    _mm256_storeu_si256((__m256i*)debug, vec);
    std::cout << name;
    for (int i = 0; i < 32; i++)
      std::cout << (int)debug[i] << " ";
    std::cout << "\n";
  };
  auto printVector16 = [](auto vec, auto name) {
    uint16_t debug[16];
    _mm256_storeu_si256((__m256i*)debug, vec);
    std::cout << name;
    for (int i = 0; i < 16; i++)
      std::cout << (int)debug[i] << " ";
    std::cout << "\n";
  };
  // ... etc
  printVector8(YMM1, "YMM1    :");

Hope that helps.

4 replies

XCemaXX Aug 23, 2023
Collaborator Author

Yes, I'm trying to port the solution to AVX2.
I used algorithm from the lesson video, but I changed 128bit functions to 256bit funtions (described above). I didn't post the algoritm, because I don't want to write a spoiler.
Thanks, you confirmed my assumptions about _mm_slli_si128 and _mm256_slli_si256. And thank you for debug tip.
Now I need to find out how to make this change:
input: X1 X2 X3 X4 | Y1 Y2 Y3 Y4
output: X2 X3 X4 Y1 | Y2 Y3 Y4 0
I'm not the first who encounter this problem. Find out on stackoverflow:
https://stackoverflow.com/questions/25248766/emulating-shifts-on-32-bytes-with-avx
https://stackoverflow.com/questions/49367822/how-to-implement-lane-crossing-logical-bit-wise-shift-rotate-left-and-right-in

But it seems inefficient to do: save Y1 and shift it, execute _mm256_slli_si256, add shifted Y1 to result of previous command. In algortihm there are 4 such shifts (with 2, 4, 8, 16 bits). Anyway I will try to implement and check the performance.

dendibakh Aug 24, 2023
Maintainer

Cool! I'm glad to hear you're making progress...
I think I implemented it with AVX2, I just don't remember, it was ~1.5 years ago, maybe more. :)
I think you can still do it with shuffles. Although it will be a cross-lane shuffle, which is more expensive than an in-lane shuffle.
Since you have 16-bit integers, _mm256_permute_ps will not work, but this might do the job: _mm256_blendv_epi8(YMM1, YMM_ALL_ZEROS, mask), where you need to have different masks for different shifts.
Those blendV instructions are expensive, so it would be interesting to compare performance with SSE implementation.

dendibakh Aug 24, 2023
Maintainer

Also, check out this website: https://www.officedaytime.com/simd512e/
There are some nice visualizations for asm instructions.

XCemaXX Aug 27, 2023
Collaborator Author

Thanks for help and link 👍

XCemaXX · 2023-08-27T05:08:19Z

XCemaXX
Aug 27, 2023
Collaborator Author

I'm implemented shift like this:
#define real_mm256_slli_si256(A, N) _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)
And got times:
LinuxIntelCoffeelake
sse128 bench_partial_sum 8.13 us 8.13 us 172180
avx256 bench_partial_sum 7.48 us 7.48 us 187505
no ci bench_partial_sum 22.0 us 22.0 us 63762

LinuxIntelAlderlake
sse128 bench_partial_sum 4.31 us 4.31 us 324683
avx256 bench_partial_sum 5.35 us 5.35 us 261570
no ci bench_partial_sum 17.2 us 17.2 us 81229

WinZen3
sse128 bench_partial_sum 6.24 us 4.27 us 373333
avx256 bench_partial_sum 7.48 us 4.98 us 213333
no ci bench_partial_sum 26.3 us 15.2 us 81455

LinuxWSL Skylake (myPC)
sse128 bench_partial_sum 9.46 us 9.46 us 73882
avx256 bench_partial_sum 8.87 us 8.87 us 75924
no ci bench_partial_sum 26.0 us 26.0 us 27474

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

256 and 512 variants of lab 'core_bound/compiler_intrinsics_1' #75

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

256 and 512 variants of lab 'core_bound/compiler_intrinsics_1' #75

XCemaXX Aug 20, 2023 Collaborator

Replies: 2 comments · 4 replies

dendibakh Aug 21, 2023 Maintainer

XCemaXX Aug 23, 2023 Collaborator Author

dendibakh Aug 24, 2023 Maintainer

dendibakh Aug 24, 2023 Maintainer

XCemaXX Aug 27, 2023 Collaborator Author

XCemaXX Aug 27, 2023 Collaborator Author

XCemaXX
Aug 20, 2023
Collaborator

Replies: 2 comments 4 replies

dendibakh
Aug 21, 2023
Maintainer

XCemaXX Aug 23, 2023
Collaborator Author

dendibakh Aug 24, 2023
Maintainer

dendibakh Aug 24, 2023
Maintainer

XCemaXX Aug 27, 2023
Collaborator Author

XCemaXX
Aug 27, 2023
Collaborator Author