-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulty with 8 nested loops? #126
Comments
I fixed the bug, but it still has trouble with 8 nested loops: julia> using LoopVectorization
julia> Q = rand(3,3);
julia> C = rand(3,3,3,3);
julia> Crot = zeros(3,3,3,3);
julia> ls = LoopVectorization.@avx_debug for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for m = 1:3
for n = 1:3
for o = 1:3
for p = 1:3
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
end
end
end
Crot[i,j,k,l] += aux
end
end
end
end;
OPS = Tuple{:numericconstant,Symbol("##0.0#256"),LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000000000, 0x0000000000005678, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000051, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000062, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x03),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000073, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x04),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000084, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x04, 0x05),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000005678, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x05, 0x06),:LoopVectorization,:vmul,LoopVectorization.OperationStruct(0x0000000006273845, 0x0000000000000000, 0x0000000000000000, 0x0000000003040506, LoopVectorization.compute, 0x00, 0x07),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000051627384, 0x0000000000005678, 0x0000000000000000, 0x0000000000020701, LoopVectorization.compute, 0x00, 0x01),:LoopVectorization,:identity,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000005678, 0x0000000000000000, 0x0000000000000008, LoopVectorization.compute, 0x00, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x06, 0x08),:LoopVectorization,:vadd,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000005678, 0x0000000000000000, 0x0000000000000a09, LoopVectorization.compute, 0x00, 0x08),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000001234, 0x0000000000005678, 0x0000000000000000, 0x000000000000000b, LoopVectorization.memstore, 0x06, 0x09)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000501, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000602, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000703, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Q,Symbol("##vptr##_Q")}(0x0000000000000101, 0x0000000000000804, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:C,Symbol("##vptr##_C")}(0x0000000001010101, 0x0000000005060708, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:Crot,Symbol("##vptr##_Crot")}(0x0000000001010101, 0x0000000001020304, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(1, LoopVectorization.HardFloat)},Tuple{}}
LPSYM = Tuple{:i,:j,:k,:l,:m,:n,:o,:p}
LB = NTuple{8,VectorizationBase.StaticUnitRange{1,3}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f91acceb570, (24,)), VectorizationBase.PackedStridedPointer{Float64,3}(Ptr{Float64} @0x00007f91c79e3410, (24, 72, 216)), VectorizationBase.PackedStridedPointer{Float64,3}(Ptr{Float64} @0x00007f91c8155470, (24, 72, 216)))
julia> @time LoopVectorization.lower(ls);
37.202224 seconds (421.04 M allocations: 35.271 GiB, 8.04% gc time)
julia> @time LoopVectorization.choose_order(ls)
37.241371 seconds (421.03 M allocations: 35.271 GiB, 8.05% gc time)
([:k, :l, :j, :i, :m, :p, :n, :o], :i, :l, :i, 3, 3) Lowering (converting the LoopSet object into a Julia At least it seems fast: julia> @benchmark rotation_avx!($Crot, $Q, $C)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 991.455 ns (0.00% GC)
median time: 999.182 ns (0.00% GC)
mean time: 1.001 μs (0.00% GC)
maximum time: 2.954 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 11
julia> @benchmark rotation!($Crot, $Q, $C)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.214 μs (0.00% GC)
median time: 5.220 μs (0.00% GC)
mean time: 5.232 μs (0.00% GC)
maximum time: 7.567 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 6
julia> @benchmark rotation_nexpr!($Crot, $Q, $C)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.657 μs (0.00% GC)
median time: 4.662 μs (0.00% GC)
mean time: 4.678 μs (0.00% GC)
maximum time: 7.952 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 7 I replaced the julia> function rotation_nexpr!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
# aux = 0.0
@inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
Base.Cartesian.@nexprs 3 j -> begin
Base.Cartesian.@nexprs 3 k -> begin
Base.Cartesian.@nexprs 3 l -> begin
aux = 0.0 # surely!
Base.Cartesian.@nexprs 3 m -> begin
Base.Cartesian.@nexprs 3 n -> begin
Base.Cartesian.@nexprs 3 o -> begin
Base.Cartesian.@nexprs 3 p -> begin
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
end
end
end
Crot[i,j,k,l] = aux
end
end
end
end
end
rotation_nexpr! (generic function with 1 method)
function rotation_avx!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
# aux = 0.0
@avx for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for m = 1:3
for n = 1:3
for o = 1:3
for p = 1:3
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
end
end
end
Crot[i,j,k,l] = aux
end
end
end
end
end
julia> function rotation!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
# aux = 0.0
@inbounds @fastmath for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for m = 1:3
for n = 1:3
for o = 1:3
for p = 1:3
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
end
end
end
Crot[i,j,k,l] = aux
end
end
end
end
end
rotation! (generic function with 1 method) So if you save 3.6 microseconds per evaluation (compared to the Its worth thinking about how to speed up julia> factorial(8) * binomial(8,2) * 8
9031680 The code itself could also be optimized by caching repeated calculations, for example. Probably also worth just looking over to see if there are low hanging fruit. |
Thanks, yes I can see how all orders of 8 loops is quite a few to check! There is indeed low-hanging fruit, in that splitting this up into successive multiplications saves a lot of steps. Although bizarrely I get the wrong answer right now: function rot2_ein!(Crot, Q, C)
@einsum mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
@einsum Crot[i,j,k,l] = Q[m,i] * Q[n,j] * mid[m,n,k,l]
end
rot2_ein!(Crot,Q,C) ≈ res # true
@btime rot2_ein!($Crot, $Q, $C); # 1.585 μs (2 allocations: 784 bytes)
@tullio avx=true tensor=false
function rot2_avx!(Crot, Q, C)
@tullio mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
@tullio Crot[i,j,k,l] = Q[m,i] * Q[n,j] * mid[m,n,k,l]
end
@time rot2_avx!(Crot,Q,C) ≈ res # false?!, 3 sec
extrema(rot2_avx!(Crot,Q,C) .- res) # (-2.769, 3.819)
@btime rot2_avx!($Crot, $Q, $C); # 895.073 ns |
I can reproduce the wrong answer, but it seems to be a problem in Tullio: julia> function rot2_tullio!(Crot, Q, C)
@tullio avx=false mid[m,n,k,l] := Q[o,k] * Q[p,l] * C[m,n,o,p]
@tullio avx=false Crot[i,j,k,l] = Q[m,i] * Q[n,j] * mid[m,n,k,l]
end
rot2_tullio! (generic function with 1 method)
julia> @time rot2_tullio!(Crot,Q,C) ≈ res # false?!, 3 sec
0.073518 seconds (255.17 k allocations: 12.422 MiB)
false
julia> extrema(rot2_tullio!(Crot,Q,C) .- res) # (-2.769, 3.819)
(-2.4461462850848257, 0.36586935225542216) You can also manually unroll 1 or 2 of the inner most loops: julia> function rotation_avx_nexpr!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
# aux = 0.0
@avx for i = 1:3
for j ∈ 1:3
for k ∈ 1:3
for l ∈ 1:3
aux = 0.0 # surely!
for m = 1:3
for n = 1:3
Base.Cartesian.@nexprs 3 o -> begin
Base.Cartesian.@nexprs 3 p -> begin
aux += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p];
end
end
end
end
Crot[i,j,k,l] = aux
end
end
end
end
end
rotation_avx_nexpr! (generic function with 1 method)
julia> @benchmark rotation_avx_nexpr!($Crot, $Q, $C)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 933.516 ns (0.00% GC)
median time: 942.323 ns (0.00% GC)
mean time: 943.301 ns (0.00% GC)
maximum time: 1.730 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 31
julia> Crot ≈ res
true |
Right, the wrong answer is my fault, sorry I should have said. But I don't see why yet. That unrolling helps a lot. If I'm counting right, these nested loops are still 9^4 operations, vs. 9*4 * 27 operations if you do 4 different matrix multiplications, a factor 6.75 less. |
Isn't it, for all in one nested loop (unrolling inside like I did above doesn't really help runtime, just compile time): Meaning it should be about 27 times faster? Three implementations: function rotation!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4}, mid = Array{Float64}(undef, 3, 3, 3, 3))
@inbounds @fastmath for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for m = 1:3
aux += Q[m,i] * C[m,j,k,l];
end
mid[i,j,k,l] = aux
end
end
end
end
@inbounds @fastmath for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for n = 1:3
aux += Q[n,j] * mid[i,n,k,l];
end
Crot[i,j,k,l] = aux
end
end
end
end
@inbounds @fastmath for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for o = 1:3
aux += Q[o,k] * Crot[i,j,o,l];
end
mid[i,j,k,l] = aux
end
end
end
end
@inbounds @fastmath for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for p = 1:3
aux += Q[p,l] * mid[i,j,k,p];
end
Crot[i,j,k,l] = aux
end
end
end
end
end
function rotation_nexpr!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4})
@inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
Base.Cartesian.@nexprs 3 j -> begin
Base.Cartesian.@nexprs 3 k -> begin
Base.Cartesian.@nexprs 3 l -> begin
mid1_i_j_k_l = 0.0 # surely!
Base.Cartesian.@nexprs 3 m -> begin
mid1_i_j_k_l += Q[m,i] * C[m,j,k,l];
end
end
end
end
end
@inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
Base.Cartesian.@nexprs 3 j -> begin
Base.Cartesian.@nexprs 3 k -> begin
Base.Cartesian.@nexprs 3 l -> begin
mid2_i_j_k_l = 0.0 # surely!
Base.Cartesian.@nexprs 3 n -> begin
mid2_i_j_k_l += Q[n,j] * mid1_i_n_k_l
end
end
end
end
end
@inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
Base.Cartesian.@nexprs 3 j -> begin
Base.Cartesian.@nexprs 3 k -> begin
Base.Cartesian.@nexprs 3 l -> begin
mid3_i_j_k_l = 0.0 # surely!
Base.Cartesian.@nexprs 3 o -> begin
mid3_i_j_k_l += Q[o,k] * mid2_i_j_o_l
end
end
end
end
end
@inbounds @fastmath Base.Cartesian.@nexprs 3 i -> begin
Base.Cartesian.@nexprs 3 j -> begin
Base.Cartesian.@nexprs 3 k -> begin
Base.Cartesian.@nexprs 3 l -> begin
mid4_i_j_k_l = 0.0 # surely!
Base.Cartesian.@nexprs 3 p -> begin
mid4_i_j_k_l += Q[p,l] * mid3_i_j_k_p
end
Crot[i,j,k,l] = mid4_i_j_k_l
end
end
end
end
end
function rotation_avx!(Crot::Array{Float64,4},Q::Array{Float64,2},C::Array{Float64,4}, mid = Array{Float64}(undef, 3, 3, 3, 3))
@avx for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for m = 1:3
aux += Q[m,i] * C[m,j,k,l];
end
mid[i,j,k,l] = aux
end
end
end
end
@avx for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for n = 1:3
aux += Q[n,j] * mid[i,n,k,l];
end
Crot[i,j,k,l] = aux
end
end
end
end
@avx for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for o = 1:3
aux += Q[o,k] * Crot[i,j,o,l];
end
mid[i,j,k,l] = aux
end
end
end
end
@avx for i = 1:3
for j = 1:3
for k = 1:3
for l = 1:3
aux = 0.0 # surely!
for p = 1:3
aux += Q[p,l] * mid[i,j,k,p];
end
Crot[i,j,k,l] = aux
end
end
end
end
end The two loop-versions accept a preallocated buffer, but the julia> mid = similar(Crot);
julia> @benchmark rotation!($Crot, $Q, $C, $mid)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 408.830 ns (0.00% GC)
median time: 409.735 ns (0.00% GC)
mean time: 410.452 ns (0.00% GC)
maximum time: 555.030 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 200
julia> @benchmark rotation_avx!($Crot, $Q, $C, $mid)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 170.711 ns (0.00% GC)
median time: 172.669 ns (0.00% GC)
mean time: 172.980 ns (0.00% GC)
maximum time: 186.600 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 750
julia> @benchmark rotation_nexpr!($Crot, $Q, $C)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 145.776 ns (0.00% GC)
median time: 149.655 ns (0.00% GC)
mean time: 149.475 ns (0.00% GC)
maximum time: 174.536 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 836 It is also (now) the fastest. Checking the assembly, this was despite the fact it was not SIMD at all. |
I think LoopVectorization is making a bad decision above (the decision looks bad to me, anyway). EDIT: Nevermind, the decision makes sense. =/ |
That's pretty qiuck. LoopVectorization is doing some unnrolling then, but not as much? Good catch on the counting, I missed a factor 4. BTW, am I right to think that the transformation here from the naiive loops to the staged algorithm is way out of scope for these automatic transformations? It would require understanding that you can distribute |
LoopVectorization completely unrolls it and uses SIMD, but it seems that is slower. LoopVectorization's assembly features a huge number of address calculations. The addressing calculations are something I've struggled with for a while, because LLVM is very bad at them. I'll also make LoopVectorization aware of the vector width (it isn't aware while optimizing, as an old hold-over from the days before it used a generated function to actually know the element types of the arrays), so that it wont need LLVM to perform cleanup (which is why it was important for LLVM to be aware of the number of iterations). |
That's not something it can do right now, but I'm inclined to call these transformations in scope. Perhaps the way to go about it would be to
It should also be taught about the distributive property more generally, and use it in its modelling. Currently, it still just naively pattern matches to turn |
As of LoopVectorization 0.12.40, compile times should be much better here. |
From https://stackoverflow.com/questions/62245905/optimisation-of-4d-tensor-rotation, a strange error:
It works if I use
@avx for m = 1:3
, i.e. only on the inner loops.It also works when generated by
@tullio Crot[i,j,k,l] += Q[m,i] * Q[n,j] * Q[o,k] * Q[p,l] * C[m,n,o,p]
, which has each group of loops in the opposite order but seems otherwise similar. But it takes almost a whole minute to compile.The text was updated successfully, but these errors were encountered: