Summing along a dimension of a PermutedDimsArray could be faster #38774

pdeffebach · 2020-12-08T17:30:58Z

I've been wanting to store all my matrices of with particular axes, but some matrices need fast iteration over columns and other matrices need fast iteration over rows. The solution to this is to use PermutedDimsArrays.

However it looks like you don't get the full performance benefit of this strategy using views. Below is an MWE

julia> function summatrows(x)
       N = size(x, 1)
       z = Vector{Float64}(undef, N)
       @inbounds for i in 1:N
           z[i] = sum(@view x[i, :])
       end
       z
       end;
julia> function summatcols(x)
       N = size(x, 2)
       z = Vector{Float64}(undef, N)
       @inbounds for i in 1:N
           z[i] = sum(@view x[:, i])
       end
       z
       end;
julia> x = rand(1000, 1000);
julia> y = transpose(permutedims(x));
julia> using BenchmarkTools;
julia> @btime summatcols($x);
  354.489 μs (1 allocation: 7.94 KiB)
julia> @btime summatrows($x);
  2.249 ms (1 allocation: 7.94 KiB)
julia> @btime summatcols($y);
  2.345 ms (1 allocation: 7.94 KiB)
julia> @btime summatrows($y);
  1.269 ms (1 allocation: 7.94 KiB)

The last timing should be around 350 μs, but it is instead more than 3 times that.

Note that I think the @view may be the problem. Consider a scenario that only depends on the order of the loops. There, PermutedDimsArray works as as expected.

julia> function sumcolumnsfast(x::AbstractMatrix)
           s = 0.0
           for i in 1:size(x, 2)
               for j in 1:size(x, 1)
                   s += x[j, i]
               end
           end
           return s
       end
sumcolumnsfast (generic function with 1 method)
julia> function sumrowsfast(x::AbstractMatrix)
           s = 0.0
           for i in 1:size(x, 1)
               for j in 1:size(x, 2)
                   s += x[i, j]
               end
           end
           return s
       end
sumrowsfast (generic function with 1 method)
julia> x = rand(1000, 1000);
julia> y = PermutedDimsArray(permutedims(x), (2, 1));
julia> y == x
true
julia> sumcolumnsfast(x) ≈ sumrowsfast(x)
true
julia> @btime sumcolumnsfast($x)
  1.270 ms (0 allocations: 0 bytes)
499894.0109807858
julia> @btime sumrowsfast($x)
  1.856 ms (0 allocations: 0 bytes)
499894.0109807898
julia> @btime sumcolumnsfast($y)
  1.836 ms (0 allocations: 0 bytes)
499894.0109807858
julia> @btime sumrowsfast($y)
  1.278 ms (0 allocations: 0 bytes)
499894.0109807898

xref #34847. I commented on that issue but it might not be that so I'm filing a new issue here.

Thank you!

The text was updated successfully, but these errors were encountered:

timholy · 2020-12-08T17:44:39Z

Most likely it's a cache-order effect. See https://julialang.org/blog/2013/09/fast-numeric/#write_cache-friendly_codes

mcabbott · 2020-12-08T17:45:53Z

I agree that summatrows(x) is being asked to do something cache-unfriendly, and is expected to be slow. The view / Transpose wrapper puzzle is why summatrows(y) isn't faster, since that ought to be cache-friendly again.

Without figuring out high-tech things like #34847, is it possible that some views of Transposes should be made to un-wrap and view the original object? This already happens for views of views:

julia> view(x, 1:10, 1) |> typeof
SubArray{Float64, 1, Matrix{Float64}, Tuple{UnitRange{Int64}, Int64}, true}

julia> view(view(x, :, 1), 1:10) |> typeof
SubArray{Float64, 1, Matrix{Float64}, Tuple{UnitRange{Int64}, Int64}, true}

Also, I think summatcols is just sum(x, dims=1), for which #33029 discusses something similar. The lower two times here (reducing y) could almost just dispatch to the upper two, but instead seem to do something more naiive:

julia> summatcols(x) == vec(sum(x, dims=1))
true

julia> @btime sum($x, dims=1);
  293.548 μs (1 allocation: 7.94 KiB)

julia> @btime sum($x, dims=2);
  320.756 μs (5 allocations: 8.02 KiB)

julia> @btime sum($y, dims=1);
  1.430 ms (1 allocation: 7.94 KiB)

julia> @btime sum($y, dims=2);
  1.558 ms (5 allocations: 8.02 KiB)

jishnub · 2021-01-27T16:50:58Z

Looking at a section of the mapreduce operation:

julia/base/reducedim.jl

Lines 279 to 284 in 527d6b6

    
           @inbounds for IA in CartesianIndices(indsAt) 
        
               IR = Broadcast.newindex(IA, keep, Idefault) 
        
               @simd for i in axes(A, 1) 
        
                   R[i,IR] = op(R[i,IR], f(A[i,IA])) 
        
               end 
        
           end

it does look like a cache-unfriendly indexing of A if it is a transpose, and flipping the order of the loops improves performance.

mcabbott mentioned this issue Jan 31, 2021

Simplify some views of Adjoint matrices #39467

Merged

kshyatt added arrays [a, r, r, a, y, s] performance Must go faster labels Feb 6, 2021

vtjnash closed this as completed in #39467 Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summing along a dimension of a PermutedDimsArray could be faster #38774

Summing along a dimension of a PermutedDimsArray could be faster #38774

pdeffebach commented Dec 8, 2020

timholy commented Dec 8, 2020

mcabbott commented Dec 8, 2020 •

edited

Loading

jishnub commented Jan 27, 2021

Summing along a dimension of a PermutedDimsArray could be faster #38774

Summing along a dimension of a PermutedDimsArray could be faster #38774

Comments

pdeffebach commented Dec 8, 2020

timholy commented Dec 8, 2020

mcabbott commented Dec 8, 2020 • edited Loading

jishnub commented Jan 27, 2021

mcabbott commented Dec 8, 2020 •

edited

Loading