Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AVX2 implementation of first_max_element #2806

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

redzic
Copy link
Collaborator

@redzic redzic commented Sep 15, 2021

This also now requires BMI1 and BMI2 for AVX2 in CpuFeatureLevel (for tzcnt).

@tdaede
Copy link
Collaborator

tdaede commented Sep 15, 2021

Resulting assembly:

example::first_max_element_avx2:
        vmovdqu xmm0, xmmword ptr [rdi]
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]
        vpshufd xmm1, xmm0, 238
        vpmaxsd xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 85
        vpmaxsd xmm0, xmm0, xmm1
        vmovd   edx, xmm0
        vpbroadcastd    ymm0, xmm0
        vpcmpeqd        ymm0, ymm0, ymmword ptr [rdi]
        vmovmskps       eax, ymm0
        tzcnt   eax, eax
        vzeroupper
        ret

@tdaede
Copy link
Collaborator

tdaede commented Sep 15, 2021

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

@coveralls
Copy link
Collaborator

coveralls commented Sep 15, 2021

Coverage Status

Coverage decreased (-0.1%) to 83.273% when pulling cff1af9 on redzic:find-first-max-avx2 into 2ec4e67 on xiph:master.

@redzic
Copy link
Collaborator Author

redzic commented Sep 15, 2021

It looks like the autovectorizer fails to use ymm registers for the max. Maybe an intrinsic is needed there too?

I don't think it's possible to use ymm registers for getting the max of 8 32-bit integers, or if it is there probably wouldn't be any benefit.

As far as I understand, this is how the algorithm LLVM uses works:

max:
        ; rdi is a pointer to the first element, and it is statically known that there are 8 elements

        ; load the first 4 elements of the slice into xmm0
        vmovdqu xmm0, xmmword ptr [rdi]

        ; do a packed max the last 4 elements of the slice (rdi + 16)
        ; basically, xmm0 now looks like this (where x represents the original 8-element array):
        ; xmm0 = [max(x[0], x[4]), max(x[1], x[5]), max(x[2], x[6]), max(x[3], x[7])]
        ;
        ; at this point, we can forget about the entire array and just worry about finding the max of
        ; the 4 elements that are now in xmm0, since the max is definitely inside xmm0
        vpmaxsd xmm0, xmm0, xmmword ptr [rdi + 16]

        ; shuffle: xmm1 = xmm0[2,3,2,3]
        vpshufd xmm1, xmm0, 238

        ; packed max of xmm0 and xmm1, store the result in xmm0
        ; here is what the registers look like before this step:
        ; xmm0 = [x[0], x[1], x[2], x[3]]
        ; xmm1 = [x[2], x[3], x[2], x[3]]
        ; so the last 2 elements of xmm0 after this instruction are going to be irrelevant,
        ; and as you can see, the max will be either the first or second element of xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; shuffle : xmm1 = xmm0[1,1,1,1]
        ; xmm0 = [max(x[0], x[2]), max(x[1], x[3]), ...]
        ; xmm1 = [max(x[1], x[3]), max(x[1], x[3]), ...]
        ; (other elements omitted since we don't care about them any more)
        vpshufd xmm1, xmm0, 85

        ; packed max of xmm0 and xmm1, store the result in xmm0
        vpmaxsd xmm0, xmm0, xmm1

        ; now xmm0 = [max(max(x[0], x[2]), max(x[1], x[3])), ...]

        ; so move the first element of xmm0, which is the final max, into eax
        vmovd   eax, xmm0
        ret

@tdaede
Copy link
Collaborator

tdaede commented Sep 15, 2021

Ah yeah sorry, yeah the first max comparison can only be 4 vs 4.

src/cdef.rs Outdated Show resolved Hide resolved
This also now requires BMI1 and BMI2 for AVX2 in `CpuFeatureLevel`.
@KyleSiefring
Copy link
Collaborator

I'm confused. Isn't this already optimized from the assembly called from here? https://github.com/xiph/rav1e/blob/master/src/asm/x86/cdef.rs#L195

@redzic
Copy link
Collaborator Author

redzic commented Sep 16, 2021

@KyleSiefring I think you're right, I didn't notice this.

If I'm not mistaken, the actual assembly corresponding to this particular function seems to be in here:

rav1e/src/x86/cdef_avx2.asm

Lines 1773 to 1785 in 2ec4e67

pmaxsd xm2, xm0, xm1
pshufd xm3, xm2, q1032
pmaxsd xm2, xm3
pshufd xm3, xm2, q2301
pmaxsd xm2, xm3 ; best cost
; find the idx using minpos
; make everything other than the best cost negative via subtraction
; find the min of unsigned 16-bit ints to sort out the negative values
psubd xm4, xm1, xm2
psubd xm3, xm0, xm2
packssdw xm3, xm4
phminposuw xm3, xm3

However, I think there could still be at least some value for making first_max_element in the rust code take a &[i32; 8], as it allows the compiler to eliminate some bounds checks and generate less code.

@lu-zero
Copy link
Collaborator

lu-zero commented Sep 16, 2021

Could you please split the scalar &[i32; 8] part from the patch?

Your simd code would be good in the not-nasm situation but maybe it would need further discussion.

@tdaede
Copy link
Collaborator

tdaede commented Sep 17, 2021

Could we turn this into an implementation for a non-nasm platform, e.g. ARM/PPC/RISC-V, that doesn't implement CDEF yet?

@redzic
Copy link
Collaborator Author

redzic commented Sep 18, 2021

We probably can, I just have to figure out how to do that. I will mark the PR as WIP for now as I try to learn how to write the SIMD implementation for other platforms.

@redzic redzic marked this pull request as draft September 18, 2021 00:39
@redzic
Copy link
Collaborator Author

redzic commented Nov 16, 2021

@lu-zero @tdaede Now that std::simd has reached nightly, we could probably implement a portable SIMD version of this function with it, but gate it behind nightly, similar to how aarch64 intrinsics are currently being used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants