Suboptimal dot product speed for last dimension SubArray slicing #42305

JeffFessler · 2021-09-19T02:57:42Z

In both v1.6.3 and v1.7-rc1 the dot product between a dense matrix and a matched size SubArray produced by a view that slices along the last dimension falls back on the general dot product for AbstractArrays rather than efficiently using BLAS.dot. This can be improved by making a new dot method tailored for such SubArray views, as illustrated in the code below. However the hack below seems not yet worthy of a PR because it is very specific to slicing a 3D array along the last dimension like @view array3d[:,:,slice]. That is a pretty common way to slice, but it would be better for any new method to be more general. I would make a PR if I knew how to make a type like SubArray{T, N-1, Array{T, N}, Tuple{Base.Slice{Base.OneTo{Int}}, ..., Base.Slice{Base.OneTo{Int}}, Int}, true) to express "last dimension sliced".

using LinearAlgebra: dot
import LinearAlgebra # BLAS.dot
using BenchmarkTools: @btime

x = rand(100,200)
y = rand(100,200,2)
y = @view y[:,:,1] # view of slice along last dim - a common use case

function f1(x, y) # basic dot product
    dot(x, y)
end

function f2(x, y) # dot product with vec()
    dot(vec(x), vec(y)) # this is almost optimally fast, but allocates per `@btime`
end

function f3(x, y) # call BLAS.dot directly
    LinearAlgebra.BLAS.dot(length(x), x, 1, y, 1) # this works because the SubArray data is contiguous
end

Slice2{T} = SubArray{T, 2, Array{T, 3}, Tuple{Base.Slice{Base.OneTo{Int}}, Base.Slice{Base.OneTo{Int}}, Int}, true}
mydot(x::Array{S,2}, y::Slice2) where {S} = LinearAlgebra.BLAS.dot(length(x), x, 1, y, 1) # a hack that solves it

function f4(x, y) # proposed
    mydot(x, y)
end

@assert f1(x, y) ≈ f2(x, y) ≈ f3(x, y) ≈ f4(x, y)
@assert f1(x, y) != f2(x, y) # they call different dot methods !?
@assert f2(x, y) == f3(x, y)

@btime f1($x, $y) # 17.9 μs (0 allocations: 0 bytes)
@btime f2($x, $y) #  1.2 μs (2 allocations: 80 bytes)
@btime f3($x, $y) #  1.1 μs (0 allocations: 0 bytes)
@btime f4($x, $y) #  1.1 μs (0 allocations: 0 bytes) <= this is the goal

Pinging @dkarrasch as being one of the most recent people who committed to the general dot methods, albeit 2 years ago in #32739 😄

The text was updated successfully, but these errors were encountered:

N5N3 · 2021-09-19T11:35:24Z

We can extend the BLAS.dot to arbitrary StridedArray.

If IndexStyle(x, y) isa IndexLinear, then we can call low level api safely.
Otherwise invoke the general version.

Since IndexStyle(x, y) is type based, this won't do harm to (runtime) performance.

BTW, view(randn(100, 100), 1:2:100, :) can also be calculated with BLAS.dot theoretically.
The above solution does not help with it.
Maybe a faster general dot, with @simd, shared iterater etc. , is better.
(I doubt whether it's worth doing layout checks at run time.)

N5N3 · 2022-07-18T13:17:19Z

The posted example have been fixed by #44758.
On master we have

@btime f1($x, $y) # 1.330 μs (0 allocations: 0 bytes)
@btime f2($x, $y) # 1.320 μs (2 allocations: 80 bytes)
@btime f3($x, $y) # 1.290 μs (0 allocations: 0 bytes)
@btime f4($x, $y) # 1.280 μs (0 allocations: 0 bytes)

dkarrasch added linear algebra Linear algebra performance Must go faster labels Sep 19, 2021

N5N3 closed this as completed Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal dot product speed for last dimension SubArray slicing #42305

Suboptimal dot product speed for last dimension SubArray slicing #42305

JeffFessler commented Sep 19, 2021

N5N3 commented Sep 19, 2021 •

edited

Loading

N5N3 commented Jul 18, 2022

Suboptimal dot product speed for last dimension SubArray slicing #42305

Suboptimal dot product speed for last dimension SubArray slicing #42305

Comments

JeffFessler commented Sep 19, 2021

N5N3 commented Sep 19, 2021 • edited Loading

N5N3 commented Jul 18, 2022

N5N3 commented Sep 19, 2021 •

edited

Loading