Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear solve in Float32 #196

Open
ctkelley opened this issue Aug 15, 2020 · 18 comments
Open

Linear solve in Float32 #196

ctkelley opened this issue Aug 15, 2020 · 18 comments

Comments

@ctkelley
Copy link

Hi

Using LinearAlgebra, if A is dense and Float32 and b is a Float64 vector, A\b returns a Float64 result.

However, if A is a BandedMatrix, A\b fails if b is Float 64. All is well if b is Float32. I don't think I understand what you're doing well enough to formulate a useful PR.

Here is an example

julia> P=BandedMatrix{Float32}(rand(8,8),(2,2));

julia> P
8×8 BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}:
 1.15647e-01  2.53893e-01  2.28227e-01                          
 7.60258e-01  3.87160e-01  1.02753e-01                           
 2.27781e-01  1.68999e-01  5.20508e-01                           
             5.79490e-01  8.32107e-02                           
                         6.97223e-01     3.71620e-01            
                                       2.16488e-01  5.41685e-01
                                        4.83165e-01  8.53445e-02
                                        5.71176e-01  8.70984e-01

julia> b=rand(8,);

julia> P\b
ERROR: MethodError: no method matching ldiv!(::BandedMatrices.BandedLU{Float32,BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}}, ::Array{Float64,1})
Closest candidates are:
  ldiv!(::Number, ::AbstractArray) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:251
  ldiv!(::LowerTriangular{T,var"#s796"} where var"#s796"<:(StridedArray{T, 2} where T), ::StridedVecOrMat{T}) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/triangular.jl:767
  ldiv!(::Transpose{T,var"#s826"} where var"#s826"<:(LU{T,var"#s825"} where var"#s825"<:(StridedArray{T, 2} where T)), ::StridedVecOrMat{T}) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/lu.jl:399
  ...
Stacktrace:
 [1] ldiv!(::Array{Float64,1}, ::BandedMatrices.BandedLU{Float32,BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}}, ::Array{Float64,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/factorization.jl:139
 [2] _ldiv! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:74 [inlined]
 [3] copyto! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:92 [inlined]
 [4] ldiv! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:84 [inlined]
 [5] _ldiv! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:73 [inlined]
 [6] copyto! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:92 [inlined]
 [7] copy at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:21 [inlined]
 [8] materialize at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:22 [inlined]
 [9] ldiv at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:78 [inlined]
 [10] \(::BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}, ::Array{Float64,1}) at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:119
 [11] top-level scope at REPL[50]:1
@dlfivefifty
Copy link
Member

Interestingly, in StdLib it works almost by accident: it's not converting the RHS to Float32 or LHS to Float64 and calling BLAS for the triangular solves, but rather following back to the generic naivesub! routine.

This is a bit annoying to replicate as we do not yet have a Julia native BandedLU. We do have a Julia native QR though, and that works:

julia> qr(P) \ b
8-element Array{Float64,1}:
 -2.3678433850975633
  1.7256988915603115
  3.4832536067367204
  0.4916213305479896
  0.5015980815362554
 -8.826729518502406
  8.672489866500378
 13.338910315594969

Some options, from smallest change to biggest:

  1. Add overload
function ldiv!(A::BandedLU{T}, b::AbstractVector{V})
   TV = promote_type(T,V)
  ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{V}, b))
end

This will unfortunately allocate.
2. Make qr the default factorisation for banded matrices. This will slow down \, though if I remember correctly it actually is dependent on the bandwidth which is faster.
3. Write a Julia native BandedLU ldiv!. This is easier than it sounds as one just needs to work out the pivoting: we already have Julia native banded triangular solves, and in this case we only need to implement the solve, not the computation of the factorisation. But requires more effort than I'm willing to do right now.

@ctkelley
Copy link
Author

ctkelley commented Aug 15, 2020 via email

@dlfivefifty
Copy link
Member

dlfivefifty commented Aug 15, 2020

👍 I think you are right, a more sensible definition is

function ldiv!(A::BandedLU{T}, b::AbstractVector)
   c = ldiv!(A, convert(AbstractVector{T}, b))
   copyto!(b, c)
end

@dlfivefifty
Copy link
Member

Your code, by the way, is faster and allocates far less than using SparseArrays and SuiteSparse

Good to hear. If you aren't already I recommend using MKL: many of its banded implementations are 4x faster than OpenBLAS, last I checked.

@ctkelley
Copy link
Author

ctkelley commented Aug 15, 2020 via email

@dlfivefifty
Copy link
Member

The best is to make a PR and add it to BandedLU.jl, but if you want to do it on the REPL this should work:

julia> using BandedMatrices

julia> import BandedMatrices: BandedLU

julia> LinearAlgebra.ldiv!(A::BandedLU{T}, b::AbstractVector) where T = copyto!(b, ldiv!(A, convert(AbstractVector{T}, b)))

julia> P=BandedMatrix{Float32}(rand(8,8),(2,2));

julia> b=rand(8,);

julia> P\b
8-element Array{Float64,1}:
  0.731564998626709
  0.7191325426101685
  0.6498026847839355
 -2.1065616607666016
  1.2697333097457886
  0.6517024040222168
 -0.6199814677238464
  0.9412044882774353

@ctkelley
Copy link
Author

ctkelley commented Aug 15, 2020 via email

@ctkelley
Copy link
Author

ctkelley commented Nov 3, 2020

I've been using qr! and with larger problems I'm getting killed with allocations in the solve phase. Here's an example. The timings and allocations are very consistent over several trials.

julia> n=10^6; T=Float64
julia> A=brand(T,n,n,2,4);
julia> for ip = 3:6
           A[band(ip)] .= 0.0;
       end
julia> x=rand(n);
julia> b=A*x;
julia> A32=Float32.(A);
julia> @time AF=qr!(A);
  0.037117 seconds (3 allocations: 7.630 MiB)
julia> @time AF32=qr!(A32);
  0.033772 seconds (3 allocations: 3.815 MiB)
julia> @time y=AF\b;
  0.018406 seconds (4 allocations: 7.630 MiB)
# And y is correct
julia> norm(y-x)
6.39544e-08
#Switch to single and it 4x time and nearly 10x allocations
julia> @time z=AF32\b;
  0.040336 seconds (9 allocations: 68.665 MiB, 31.04% gc time

The allocation burden is much better if I convert b to Float32, but I gain nothing in compute time over double.

julia> c=Float32.(b);
julia> @time z=AF32\c;
  0.016362 seconds (4 allocations: 3.815 MiB)

@MikaelSlevinsky
Copy link
Member

I think there are reasonable explanations:

  • mixing precisions uses a generic fallback.
  • narrow bandwidths interfere with maximizing throughput (i.e. getting close to peak flops) because SIMD might not be as efficient or fully invoked. To put it another way, 32- and 64-bit methods might be moving the nearly the same number of registers.

@ctkelley
Copy link
Author

ctkelley commented Nov 3, 2020 via email

@ctkelley
Copy link
Author

ctkelley commented Nov 3, 2020 via email

@MikaelSlevinsky
Copy link
Member

As far as I know, LAPACK doesn't mix precisions.

I guess you need to follow the stack trace to see what's really going on

@MikaelSlevinsky
Copy link
Member

For starters, the factorization itself behaves as I'd expect: half the memory but nearly the same time.

julia> using BenchmarkTools

julia> @btime qr(A);
  72.829 ms (6 allocations: 76.29 MiB)

julia> @btime qr(A32);
  65.758 ms (6 allocations: 38.15 MiB)

@ctkelley
Copy link
Author

ctkelley commented Nov 3, 2020 via email

@ctkelley
Copy link
Author

ctkelley commented Nov 3, 2020

LAPACK does it in both Matlab and Julia and the timings/allocations are what one would expect.

@dlfivefifty
Copy link
Member

This has nothing to do with BandedMatrices.jl: StdLib/LinearAlgebra.jl converts the factorization to the higher precision:

In \(A, B) at /Users/sheehanolver/Projects/julia-1.5/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/qr.jl:870
 870  function (\)(A::Union{QR{TA},QRCompactWY{TA},QRPivoted{TA}}, B::AbstractVecOrMat{TB}) where {TA,TB}
 871      require_one_based_indexing(B)
 872      S = promote_type(TA,TB)
 873      m, n = size(A)
874      m == size(B,1) || throw(DimensionMismatch("Both inputs should have the same number of rows"))
 875  
 876      AA = Factorization{S}(A)

@ctkelley
Copy link
Author

ctkelley commented Nov 4, 2020 via email

@ctkelley
Copy link
Author

ctkelley commented Nov 4, 2020

Oops. It seems that it's a problem in the dense case as well. It would seem to be that one should not have to cast b to Float32 to get this to work. That was certainly not the case with LINPACK and I doubt it is with LAPACK.

I will ask discourse about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants