Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty with 8 nested loops? #126

Open
mcabbott opened this issue Jun 8, 2020 · 10 comments
Open

Difficulty with 8 nested loops? #126

mcabbott opened this issue Jun 8, 2020 · 10 comments

Comments

@mcabbott
Copy link

mcabbott commented Jun 8, 2020

From https://stackoverflow.com/questions/62245905/optimisation-of-4d-tensor-rotation, a strange error:

Q    = rand(3,3);
C    = rand(3,3,3,3);
Crot = zeros(3,3,3,3);
function rotation_avx!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
    # aux = 0.0
    @avx for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for m = 1:3
                        for n = 1:3
                            for o = 1:3
                                for p = 1:3
                                    aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
                                end
                            end
                        end
                    end
                    Crot[i,j,k,l] += aux
                end
            end
        end
    end
end
rotation_avx!(Crot, Q, C)
# ERROR: InexactError: trunc(Int64, Inf)
# Stacktrace:
#  [1] trunc at ./float.jl:703 [inlined]
#  [2] floor at ./float.jl:363 [inlined]
#  [3] solve_unroll_constU(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::Int64) at /Users/me/.julia/packages/LoopVectorization/lmGAZ/src/determinestrategy.jl:301
#  [4] solve_unroll(::Symbol, ::Symbol, ::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::Int64, ::Symbol, ::LoopVectorization.Loop, ::LoopVectorization.Loop) at /Users/me/.julia/packages/LoopVectorization/lmGAZ/src/determinestrategy.jl:389

It works if I use @avx for m = 1:3, i.e. only on the inner loops.

It also works when generated by @tullio Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p], which has each group of loops in the opposite order but seems otherwise similar. But it takes almost a whole minute to compile.

@chriselrod
Copy link
Member

chriselrod commented Jun 8, 2020

I fixed the bug, but it still has trouble with 8 nested loops:

julia> using LoopVectorization

julia> Q    = rand(3,3);

julia> C    = rand(3,3,3,3);

julia> Crot = zeros(3,3,3,3);

julia> ls = LoopVectorization.@avx_debug for i = 1:3
               for j = 1:3
                   for k = 1:3
                       for l = 1:3
                           aux = 0.0 # surely!
                           for m = 1:3
                               for n = 1:3
                                   for o = 1:3
                                       for p = 1:3
                                           aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
                                       end
                                   end
                               end
                           end
                           Crot[i,j,k,l] += aux
                       end
                   end
               end
           end;
OPS = Tuple{:numericconstant,Symbol("##0.0#256"),LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000000000, 0x0000000000005678, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000051, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000062, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x03),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000073, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x04),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000084, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x04, 0x05),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000005678, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x05, 0x06),:LoopVectorization,:vmul,LoopVectorization.OperationStruct(0x0000000006273845, 0x0000000000000000, 0x0000000000000000, 0x0000000003040506, LoopVectorization.compute, 0x00, 0x07),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000051627384, 0x0000000000005678, 0x0000000000000000, 0x0000000000020701, LoopVectorization.compute, 0x00, 0x01),:LoopVectorization,:identity,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000005678, 0x0000000000000000, 0x0000000000000008, LoopVectorization.compute, 0x00, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x06, 0x08),:LoopVectorization,:vadd,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000005678, 0x0000000000000000, 0x0000000000000a09, LoopVectorization.compute, 0x00, 0x08),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000005678, 0x0000000000000000, 0x000000000000000b, LoopVectorization.memstore, 0x06, 0x09)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000501, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000602, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000703, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000804, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:C,Symbol("##vptr##_C")}(0x0000000001010101, 0x0000000005060708, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Crot,Symbol("##vptr##_Crot")}(0x0000000001010101, 0x0000000001020304, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(1, LoopVectorization.HardFloat)},Tuple{}}
LPSYM = Tuple{:i,:j,:k,:l,:m,:n,:o,:p}
LB = NTuple{8,VectorizationBase.StaticUnitRange{1,3}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,3}(Ptr{Float64} @0x00007f91c79e3410, (24, 72, 216)), VectorizationBase.PackedStridedPointer{Float64,3}(Ptr{Float64} @0x00007f91c8155470, (24, 72, 216)))

julia> @time LoopVectorization.lower(ls);
 37.202224 seconds (421.04 M allocations: 35.271 GiB, 8.04% gc time)

julia> @time LoopVectorization.choose_order(ls)
 37.241371 seconds (421.03 M allocations: 35.271 GiB, 8.05% gc time)
([:k, :l, :j, :i, :m, :p, :n, :o], :i, :l, :i, 3, 3)

Lowering (converting the LoopSet object into a Julia Expr) takes about 37 seconds, about 100% of which is just picking the order to evaluate the loops.

At least it seems fast:

julia> @benchmark rotation_avx!($Crot, $Q, $C)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     991.455 ns (0.00% GC)
  median time:      999.182 ns (0.00% GC)
  mean time:        1.001 μs (0.00% GC)
  maximum time:     2.954 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     11

julia> @benchmark rotation!($Crot, $Q, $C)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.214 μs (0.00% GC)
  median time:      5.220 μs (0.00% GC)
  mean time:        5.232 μs (0.00% GC)
  maximum time:     7.567 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark rotation_nexpr!($Crot, $Q, $C)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.657 μs (0.00% GC)
  median time:      4.662 μs (0.00% GC)
  mean time:        4.678 μs (0.00% GC)
  maximum time:     7.952 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

I replaced the += with = for the sake of being easier to test after running a benchmark; the three versions:

julia> function rotation_nexpr!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
           # aux = 0.0
           @inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
               Base.Cartesian.@nexprs 3 j -> begin
                   Base.Cartesian.@nexprs 3 k -> begin
                       Base.Cartesian.@nexprs 3 l -> begin
                           aux = 0.0 # surely!
                           Base.Cartesian.@nexprs 3 m -> begin
                               Base.Cartesian.@nexprs 3 n -> begin
                                   Base.Cartesian.@nexprs 3 o -> begin
                                       Base.Cartesian.@nexprs 3 p -> begin
                                           aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
                                       end
                                   end
                               end
                           end
                           Crot[i,j,k,l] = aux
                       end
                   end
               end
           end
       end
rotation_nexpr! (generic function with 1 method)

function rotation_avx!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
           # aux = 0.0
           @avx for i = 1:3
               for j = 1:3
                   for k = 1:3
                       for l = 1:3
                           aux = 0.0 # surely!
                           for m = 1:3
                               for n = 1:3
                                   for o = 1:3
                                       for p = 1:3
                                           aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
                                       end
                                   end
                               end
                           end
                           Crot[i,j,k,l] = aux
                       end
                   end
               end
           end
       end

julia> function rotation!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
           # aux = 0.0
           @inbounds @fastmath for i = 1:3
               for j = 1:3
                   for k = 1:3
                       for l = 1:3
                           aux = 0.0 # surely!
                           for m = 1:3
                               for n = 1:3
                                   for o = 1:3
                                       for p = 1:3
                                           aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
                                       end
                                   end
                               end
                           end
                           Crot[i,j,k,l] = aux
                       end
                   end
               end
           end
       end
rotation! (generic function with 1 method)

So if you save 3.6 microseconds per evaluation (compared to the @nexpr version), you'll just need to evaluate it around 10 million times to pay for the excessive compilation cost...

Its worth thinking about how to speed up choose_order here.
The code is really bad in terms of computational complexity.
Letting N be the number of loops, it iterates through all possible loop orders (N!) all possible two-loop combinations to unroll (N choose 2), and all possible loops to vectorize (N), making for nine million combinations

julia> factorial(8) * binomial(8,2) * 8
9031680

The code itself could also be optimized by caching repeated calculations, for example. Probably also worth just looking over to see if there are low hanging fruit.

@mcabbott
Copy link
Author

mcabbott commented Jun 8, 2020

Thanks, yes I can see how all orders of 8 loops is quite a few to check!

There is indeed low-hanging fruit, in that splitting this up into successive multiplications saves a lot of steps. Although bizarrely I get the wrong answer right now:

function rot2_ein!(Crot, Q, C)
    @einsum mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
    @einsum Crot[i,j,k,l] = Q[m,i] * Q[n,j] * mid[m,n,k,l]
end
rot2_ein!(Crot,Q,C)  res # true 
@btime rot2_ein!($Crot, $Q, $C);  # 1.585 μs (2 allocations: 784 bytes)


@tullio avx=true tensor=false
function rot2_avx!(Crot, Q, C)
    @tullio mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
    @tullio Crot[i,j,k,l] = Q[m,i] * Q[n,j] * mid[m,n,k,l]
end
@time rot2_avx!(Crot,Q,C)  res # false?!, 3 sec
extrema(rot2_avx!(Crot,Q,C) .- res) # (-2.769, 3.819)

@btime rot2_avx!($Crot, $Q, $C);  # 895.073 ns

@chriselrod
Copy link
Member

chriselrod commented Jun 8, 2020

I can reproduce the wrong answer, but it seems to be a problem in Tullio:

julia> function rot2_tullio!(Crot, Q, C)
           @tullio avx=false mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
           @tullio avx=false Crot[i,j,k,l] = Q[m,i] * Q[n,j] * mid[m,n,k,l]
       end
rot2_tullio! (generic function with 1 method)

julia> @time rot2_tullio!(Crot,Q,C)  res # false?!, 3 sec
  0.073518 seconds (255.17 k allocations: 12.422 MiB)
false

julia> extrema(rot2_tullio!(Crot,Q,C) .- res) # (-2.769, 3.819)
(-2.4461462850848257, 0.36586935225542216)

You can also manually unroll 1 or 2 of the inner most loops:

julia> function rotation_avx_nexpr!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
           # aux = 0.0
           @avx for i = 1:3
               for j  1:3
                   for k  1:3
                       for l  1:3
                           aux = 0.0 # surely!
                           for m = 1:3
                               for n = 1:3
                                   Base.Cartesian.@nexprs 3 o -> begin
                                       Base.Cartesian.@nexprs 3 p -> begin
                                           aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
                                       end
                                   end
                               end
                           end
                           Crot[i,j,k,l] = aux
                       end
                   end
               end
           end
       end
rotation_avx_nexpr! (generic function with 1 method)

julia> @benchmark rotation_avx_nexpr!($Crot, $Q, $C)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     933.516 ns (0.00% GC)
  median time:      942.323 ns (0.00% GC)
  mean time:        943.301 ns (0.00% GC)
  maximum time:     1.730 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     31

julia> Crot  res
true

@mcabbott
Copy link
Author

mcabbott commented Jun 8, 2020

Right, the wrong answer is my fault, sorry I should have said. But I don't see why yet.

That unrolling helps a lot. If I'm counting right, these nested loops are still 9^4 operations, vs. 9*4 * 27 operations if you do 4 different matrix multiplications, a factor 6.75 less.

@chriselrod
Copy link
Member

chriselrod commented Jun 8, 2020

Isn't it, for all in one nested loop (unrolling inside like I did above doesn't really help runtime, just compile time):
5 * 3 ^ 8 = 32805
floating point operations, or perhaps more usefully
4 * 3^8 = 26244
multiplications. I say the multiplications are more useful here for comparing the expected speed of the different algorithms, because as long as the number of additions exceeds the number of multiplications, the additions are "free" (assuming the CPU has an FMA instruction set).
Versus:
8 * 3^5 = 1944 floating point ops, or
4 * 3^5 = 972 multiplications.

Meaning it should be about 27 times faster?

Three implementations:

function rotation!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4}, mid = Array{Float64}(undef, 3, 3, 3, 3))
    @inbounds @fastmath for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for m = 1:3
                        aux += Q[m,i] * C[m,j,k,l];
                    end
                    mid[i,j,k,l] = aux
                end
            end
        end
    end
    @inbounds @fastmath for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for n = 1:3
                        aux += Q[n,j] * mid[i,n,k,l];
                    end
                    Crot[i,j,k,l] = aux
                end
            end
        end
    end
    @inbounds @fastmath for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for o = 1:3
                        aux += Q[o,k] * Crot[i,j,o,l];
                    end
                    mid[i,j,k,l] = aux
                end
            end
        end
    end
    @inbounds @fastmath for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for p = 1:3
                        aux += Q[p,l] * mid[i,j,k,p];
                    end
                    Crot[i,j,k,l] = aux
                end
            end
        end
    end
end

function rotation_nexpr!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
    @inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
        Base.Cartesian.@nexprs 3 j -> begin
            Base.Cartesian.@nexprs 3 k -> begin
                Base.Cartesian.@nexprs 3 l -> begin
                    mid1_i_j_k_l = 0.0 # surely!
                    Base.Cartesian.@nexprs 3 m -> begin
                        mid1_i_j_k_l += Q[m,i] * C[m,j,k,l];
                    end
                end
            end
        end
    end
    @inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
        Base.Cartesian.@nexprs 3 j -> begin
            Base.Cartesian.@nexprs 3 k -> begin
                Base.Cartesian.@nexprs 3 l -> begin
                    mid2_i_j_k_l = 0.0 # surely!
                    Base.Cartesian.@nexprs 3 n -> begin
                        mid2_i_j_k_l += Q[n,j] * mid1_i_n_k_l
                    end
                end
            end
        end
    end
    @inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
        Base.Cartesian.@nexprs 3 j -> begin
            Base.Cartesian.@nexprs 3 k -> begin
                Base.Cartesian.@nexprs 3 l -> begin
                    mid3_i_j_k_l = 0.0 # surely!
                    Base.Cartesian.@nexprs 3 o -> begin
                        mid3_i_j_k_l += Q[o,k] * mid2_i_j_o_l
                    end
                end
            end
        end
    end
    @inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
        Base.Cartesian.@nexprs 3 j -> begin
            Base.Cartesian.@nexprs 3 k -> begin
                Base.Cartesian.@nexprs 3 l -> begin
                    mid4_i_j_k_l = 0.0 # surely!
                    Base.Cartesian.@nexprs 3 p -> begin
                        mid4_i_j_k_l += Q[p,l] * mid3_i_j_k_p
                    end
                    Crot[i,j,k,l] = mid4_i_j_k_l
                end
            end
        end
    end
end



function rotation_avx!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4}, mid = Array{Float64}(undef, 3, 3, 3, 3))
    @avx for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for m = 1:3
                        aux += Q[m,i] * C[m,j,k,l];
                    end
                    mid[i,j,k,l] = aux
                end
            end
        end
    end
    @avx for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for n = 1:3
                        aux += Q[n,j] * mid[i,n,k,l];
                    end
                    Crot[i,j,k,l] = aux
                end
            end
        end
    end
    @avx for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for o = 1:3
                        aux += Q[o,k] * Crot[i,j,o,l];
                    end
                    mid[i,j,k,l] = aux
                end
            end
        end
    end
    @avx for i = 1:3
        for j = 1:3
            for k = 1:3
                for l = 1:3
                    aux = 0.0 # surely!
                    for p = 1:3
                        aux += Q[p,l] * mid[i,j,k,p];
                    end
                    Crot[i,j,k,l] = aux
                end
            end
        end
    end
end

The two loop-versions accept a preallocated buffer, but the @nexpr version doesn't need it.

julia> mid = similar(Crot);

julia> @benchmark rotation!($Crot, $Q, $C, $mid)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     408.830 ns (0.00% GC)
  median time:      409.735 ns (0.00% GC)
  mean time:        410.452 ns (0.00% GC)
  maximum time:     555.030 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     200

julia> @benchmark rotation_avx!($Crot, $Q, $C, $mid)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     170.711 ns (0.00% GC)
  median time:      172.669 ns (0.00% GC)
  mean time:        172.980 ns (0.00% GC)
  maximum time:     186.600 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     750

julia> @benchmark rotation_nexpr!($Crot, $Q, $C)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     145.776 ns (0.00% GC)
  median time:      149.655 ns (0.00% GC)
  mean time:        149.475 ns (0.00% GC)
  maximum time:     174.536 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     836

It is also (now) the fastest. Checking the assembly, this was despite the fact it was not SIMD at all.

@chriselrod
Copy link
Member

chriselrod commented Jun 8, 2020

I think LoopVectorization is making a bad decision above (the decision looks bad to me, anyway).
I'm looking into why now.

EDIT: Nevermind, the decision makes sense. =/

@mcabbott
Copy link
Author

mcabbott commented Jun 8, 2020

That's pretty qiuck. LoopVectorization is doing some unnrolling then, but not as much?

Good catch on the counting, I missed a factor 4.

BTW, am I right to think that the transformation here from the naiive loops to the staged algorithm is way out of scope for these automatic transformations? It would require understanding that you can distribute * over +, and hence pull e.g. Q[m,i] * out of the innermost 3 loops... this is too hard to see from the loopset I presume? This factorisation is what TensorOperations does, but working at a different level. (And ends up slow I think because of doing permutedims & calling BLAS on such small arrays.)

@chriselrod
Copy link
Member

chriselrod commented Jun 8, 2020

LoopVectorization completely unrolls it and uses SIMD, but it seems that is slower.

LoopVectorization's assembly features a huge number of address calculations.
SIMD-ing the first multiplication isn't especially efficient either, because the three elements of the vector then have to be summed up before storing, since it prefers not to vectorize the i-loop thanks to those loads not being contiguous in Q[m,i].

The addressing calculations are something I've struggled with for a while, because LLVM is very bad at them.
I recently made changes to improve them, but those changes actually get turned off for loops with compile-time known numbers of iterations, because those changes inadvertently prevented LLVM from seeing how many iterations there are.
I should be more clever in controlling them [EDIT: no, changing the current behavior regresses performance]. E.g., LoopVectorization itself manually unrolls many of these loops, so it makes sense to turn them on there.

I'll also make LoopVectorization aware of the vector width (it isn't aware while optimizing, as an old hold-over from the days before it used a generated function to actually know the element types of the arrays), so that it wont need LLVM to perform cleanup (which is why it was important for LLVM to be aware of the number of iterations).

@chriselrod
Copy link
Member

chriselrod commented Jun 8, 2020

BTW, am I right to think that the transformation here from the naiive loops to the staged algorithm is way out of scope for these automatic transformations? It would require understanding that you can distribute * over +, and hence pull e.g. Q[m,i] * out of the innermost 3 loops... this is too hard to see from the loopset I presume? This factorisation is what TensorOperations does, but working at a different level. (And ends up slow I think because of doing permutedims & calling BLAS on such small arrays.)

That's not something it can do right now, but I'm inclined to call these transformations in scope.

Perhaps the way to go about it would be to

  1. Give LoopVectorization a preallocated buffer to work with. Eg. CONST BUFFER = [UInt8[] for _ in 1:Threads.nthreads()].
  2. Update split_loops.jl to use this buffer as a means of splitting apart dependency chains (by storing and loading from the thread-local buffer).
  3. Figure out when you're allowed to delete loops outright when splitting.

It should also be taught about the distributive property more generally, and use it in its modelling. Currently, it still just naively pattern matches to turn a * b + c into a muladd. It uses fast flags to let LLVM do that more generally (when it doesn't use asm call to avoid LLVM's bad decisions), but the naive substitution makes cost modeling more accurate.

@chriselrod
Copy link
Member

As of LoopVectorization 0.12.40, compile times should be much better here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants