Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal dot product speed for last dimension SubArray slicing #42305

Closed
JeffFessler opened this issue Sep 19, 2021 · 2 comments
Closed

Suboptimal dot product speed for last dimension SubArray slicing #42305

JeffFessler opened this issue Sep 19, 2021 · 2 comments
Labels
linear algebra Linear algebra performance Must go faster

Comments

@JeffFessler
Copy link
Contributor

In both v1.6.3 and v1.7-rc1 the dot product between a dense matrix and a matched size SubArray produced by a view that slices along the last dimension falls back on the general dot product for AbstractArrays rather than efficiently using BLAS.dot. This can be improved by making a new dot method tailored for such SubArray views, as illustrated in the code below. However the hack below seems not yet worthy of a PR because it is very specific to slicing a 3D array along the last dimension like @view array3d[:,:,slice]. That is a pretty common way to slice, but it would be better for any new method to be more general. I would make a PR if I knew how to make a type like SubArray{T, N-1, Array{T, N}, Tuple{Base.Slice{Base.OneTo{Int}}, ..., Base.Slice{Base.OneTo{Int}}, Int}, true) to express "last dimension sliced".

using LinearAlgebra: dot
import LinearAlgebra # BLAS.dot
using BenchmarkTools: @btime

x = rand(100,200)
y = rand(100,200,2)
y = @view y[:,:,1] # view of slice along last dim - a common use case

function f1(x, y) # basic dot product
    dot(x, y)
end

function f2(x, y) # dot product with vec()
    dot(vec(x), vec(y)) # this is almost optimally fast, but allocates per `@btime`
end

function f3(x, y) # call BLAS.dot directly
    LinearAlgebra.BLAS.dot(length(x), x, 1, y, 1) # this works because the SubArray data is contiguous
end

Slice2{T} = SubArray{T, 2, Array{T, 3}, Tuple{Base.Slice{Base.OneTo{Int}}, Base.Slice{Base.OneTo{Int}}, Int}, true}
mydot(x::Array{S,2}, y::Slice2) where {S} = LinearAlgebra.BLAS.dot(length(x), x, 1, y, 1) # a hack that solves it

function f4(x, y) # proposed
    mydot(x, y)
end

@assert f1(x, y)  f2(x, y)  f3(x, y)  f4(x, y)
@assert f1(x, y) != f2(x, y) # they call different dot methods !?
@assert f2(x, y) == f3(x, y)

@btime f1($x, $y) # 17.9 μs (0 allocations: 0 bytes)
@btime f2($x, $y) #  1.2 μs (2 allocations: 80 bytes)
@btime f3($x, $y) #  1.1 μs (0 allocations: 0 bytes)
@btime f4($x, $y) #  1.1 μs (0 allocations: 0 bytes) <= this is the goal

Pinging @dkarrasch as being one of the most recent people who committed to the general dot methods, albeit 2 years ago in #32739 😄

@dkarrasch dkarrasch added linear algebra Linear algebra performance Must go faster labels Sep 19, 2021
@N5N3
Copy link
Member

N5N3 commented Sep 19, 2021

We can extend the BLAS.dot to arbitrary StridedArray.

  1. If IndexStyle(x, y) isa IndexLinear, then we can call low level api safely.
  2. Otherwise invoke the general version.

Since IndexStyle(x, y) is type based, this won't do harm to (runtime) performance.

BTW, view(randn(100, 100), 1:2:100, :) can also be calculated with BLAS.dot theoretically.
The above solution does not help with it.
Maybe a faster general dot, with @simd, shared iterater etc. , is better.
(I doubt whether it's worth doing layout checks at run time.)

@N5N3
Copy link
Member

N5N3 commented Jul 18, 2022

The posted example have been fixed by #44758.
On master we have

@btime f1($x, $y) # 1.330 μs (0 allocations: 0 bytes)
@btime f2($x, $y) # 1.320 μs (2 allocations: 80 bytes)
@btime f3($x, $y) # 1.290 μs (0 allocations: 0 bytes)
@btime f4($x, $y) # 1.280 μs (0 allocations: 0 bytes)

@N5N3 N5N3 closed this as completed Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants