Linear solve in Float32 #196

ctkelley · 2020-08-15T16:36:19Z

Hi

Using LinearAlgebra, if A is dense and Float32 and b is a Float64 vector, A\b returns a Float64 result.

However, if A is a BandedMatrix, A\b fails if b is Float 64. All is well if b is Float32. I don't think I understand what you're doing well enough to formulate a useful PR.

Here is an example

julia> P=BandedMatrix{Float32}(rand(8,8),(2,2));

julia> P
8×8 BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}:
 1.15647e-01  2.53893e-01  2.28227e-01  …       ⋅            ⋅     
 7.60258e-01  3.87160e-01  1.02753e-01          ⋅            ⋅     
 2.27781e-01  1.68999e-01  5.20508e-01          ⋅            ⋅     
      ⋅       5.79490e-01  8.32107e-02          ⋅            ⋅     
      ⋅            ⋅       6.97223e-01     3.71620e-01       ⋅     
      ⋅            ⋅            ⋅       …  2.16488e-01  5.41685e-01
      ⋅            ⋅            ⋅          4.83165e-01  8.53445e-02
      ⋅            ⋅            ⋅          5.71176e-01  8.70984e-01

julia> b=rand(8,);

julia> P\b
ERROR: MethodError: no method matching ldiv!(::BandedMatrices.BandedLU{Float32,BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}}, ::Array{Float64,1})
Closest candidates are:
  ldiv!(::Number, ::AbstractArray) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:251
  ldiv!(::LowerTriangular{T,var"#s796"} where var"#s796"<:(StridedArray{T, 2} where T), ::StridedVecOrMat{T}) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/triangular.jl:767
  ldiv!(::Transpose{T,var"#s826"} where var"#s826"<:(LU{T,var"#s825"} where var"#s825"<:(StridedArray{T, 2} where T)), ::StridedVecOrMat{T}) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/lu.jl:399
  ...
Stacktrace:
 [1] ldiv!(::Array{Float64,1}, ::BandedMatrices.BandedLU{Float32,BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}}, ::Array{Float64,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/factorization.jl:139
 [2] _ldiv! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:74 [inlined]
 [3] copyto! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:92 [inlined]
 [4] ldiv! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:84 [inlined]
 [5] _ldiv! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:73 [inlined]
 [6] copyto! at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:92 [inlined]
 [7] copy at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:21 [inlined]
 [8] materialize at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:22 [inlined]
 [9] ldiv at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:78 [inlined]
 [10] \(::BandedMatrix{Float32,Array{Float32,2},Base.OneTo{Int64}}, ::Array{Float64,1}) at /Users/ctk/.julia/packages/ArrayLayouts/x9nhz/src/ldiv.jl:119
 [11] top-level scope at REPL[50]:1

dlfivefifty · 2020-08-15T19:50:13Z

Interestingly, in StdLib it works almost by accident: it's not converting the RHS to Float32 or LHS to Float64 and calling BLAS for the triangular solves, but rather following back to the generic naivesub! routine.

This is a bit annoying to replicate as we do not yet have a Julia native BandedLU. We do have a Julia native QR though, and that works:

julia> qr(P) \ b
8-element Array{Float64,1}:
 -2.3678433850975633
  1.7256988915603115
  3.4832536067367204
  0.4916213305479896
  0.5015980815362554
 -8.826729518502406
  8.672489866500378
 13.338910315594969

Some options, from smallest change to biggest:

Add overload

function ldiv!(A::BandedLU{T}, b::AbstractVector{V})
   TV = promote_type(T,V)
  ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{V}, b))
end

This will unfortunately allocate.
2. Make qr the default factorisation for banded matrices. This will slow down \, though if I remember correctly it actually is dependent on the bandwidth which is faster.
3. Write a Julia native BandedLU ldiv!. This is easier than it sounds as one just needs to work out the pivoting: we already have Julia native banded triangular solves, and in this case we only need to implement the solve, not the computation of the factorisation. But requires more effort than I'm willing to do right now.

ctkelley · 2020-08-15T20:24:04Z

Thanks for following up. I think, at least for my application, it is ok to demote the right hand side, return a Float32 result, which will do the right thing when I add it to a float 64. That way I will only allocate when I do Float32.(b). I can do this as a hack job in my own application (Newton solver). So I'd do A\Float32.(b) and not have to mess with the Float32 banded matrix. Your code, by the way, is faster and allocates far less than using SparseArrays and SuiteSparse. I was going to try to use the LAPACK bandsolvers until I noticed that you'd already done it. — Tim Interestingly, in StdLib it works almost by accident: it's *not* converting the RHS to Float32 or LHS to Float64 and calling BLAS for the triangular solves, but rather following back to the generic naivesub! routine. This is a bit annoying to replicate as we do not yet have a Julia native BandedLU. We do have a Julia native QR though, and that works: julia> qr(P) \ b 8-element Array{Float64,1}: -2.3678433850975633 1.7256988915603115 3.4832536067367204 0.4916213305479896 0.5015980815362554 -8.826729518502406 8.672489866500378 13.338910315594969 Some options, from smallest change to biggest: 1. Add overload function ldiv!(A::BandedLU{T}, b::AbstractVector{V}) TV = promote_type(T,V) ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{V}, b)) end

…

On Sat, Aug 15, 2020 at 3:50 PM Sheehan Olver ***@***.***> wrote: Interestingly, in StdLib it works almost by accident: it's *not* converting the RHS to Float32 or LHS to Float64 and calling BLAS for the triangular solves, but rather following back to the generic naivesub! routine. This is a bit annoying to replicate as we do not yet have a Julia native BandedLU. We do have a Julia native QR though, and that works: julia> qr(P) \ b8-element Array{Float64,1}: -2.3678433850975633 1.7256988915603115 3.4832536067367204 0.4916213305479896 0.5015980815362554 -8.826729518502406 8.672489866500378 13.338910315594969 Some options, from smallest change to biggest: 1. Add overload function ldiv!(A::BandedLU{T}, b::AbstractVector{V}) TV = promote_type(T,V) ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{V}, b))end This will unfortunately allocate. 2. Make qr the default factorisation for banded matrices. This will slow down \, though if I remember correctly it actually is dependent on the bandwidth which is faster. 3. Write a Julia native BandedLU ldiv!. This is easier than it sounds as one just needs to work out the pivoting: we already have Julia native banded triangular solves, and in this case we only need to implement the solve, not the computation of the factorisation. But requires more effort than I'm willing to do right now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX66PWTWK7YP4IYSLWE3SA3RIBANCNFSM4QALJVZQ> .

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) [email protected] https://ctk.math.ncsu.edu

dlfivefifty · 2020-08-15T20:28:05Z

👍 I think you are right, a more sensible definition is

function ldiv!(A::BandedLU{T}, b::AbstractVector)
   c = ldiv!(A, convert(AbstractVector{T}, b))
   copyto!(b, c)
end

dlfivefifty · 2020-08-15T20:30:49Z

Your code, by the way, is faster and allocates far less than using SparseArrays and SuiteSparse

Good to hear. If you aren't already I recommend using MKL: many of its banded implementations are 4x faster than OpenBLAS, last I checked.

ctkelley · 2020-08-15T20:42:37Z

How/where do I put ldiv!(A::BandedLU{T}, b::AbstractVector) = ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{T}, b)) in a place where my users don't have to know about it? I tested it in the REPL and got a complaint. I can't see any submodules in your source, so am stuck. I'd love to use MKL, but want to test my stuff in the environment most other people use, ie OpenBLAS. julia> import BandedMatrices julia> ldiv!(A::BandedLU{T}, b::AbstractVector) = ldiv!(convert(BandedLU{T}, A), convert(AbstractVector{T}, b)) ERROR: UndefVarError: BandedLU not defined Stacktrace: [1] top-level scope at REPL[5]:1 I'd love to use MKL, but want to test my stuff in the environment most other people use, ie OpenBLAS. — Tim

…

On Sat, Aug 15, 2020 at 4:31 PM Sheehan Olver ***@***.***> wrote: Your code, by the way, is faster and allocates far less than using SparseArrays and SuiteSparse Good to hear. If you aren't already I recommend using MKL: many of its banded implementations are 4x faster than OpenBLAS, last I checked. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX626FUGXG7VMALRX2NDSA3WAJANCNFSM4QALJVZQ> .

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) [email protected] https://ctk.math.ncsu.edu

dlfivefifty · 2020-08-15T20:51:16Z

The best is to make a PR and add it to BandedLU.jl, but if you want to do it on the REPL this should work:

julia> using BandedMatrices

julia> import BandedMatrices: BandedLU

julia> LinearAlgebra.ldiv!(A::BandedLU{T}, b::AbstractVector) where T = copyto!(b, ldiv!(A, convert(AbstractVector{T}, b)))

julia> P=BandedMatrix{Float32}(rand(8,8),(2,2));

julia> b=rand(8,);

julia> P\b
8-element Array{Float64,1}:
  0.731564998626709
  0.7191325426101685
  0.6498026847839355
 -2.1065616607666016
  1.2697333097457886
  0.6517024040222168
 -0.6199814677238464
  0.9412044882774353

ctkelley · 2020-08-15T21:09:44Z

Thanks. I do not have the confidence or the skill set to do a PR for this code. I put what you gave me in the module and all is well. Please let me know if this ever makes it into BandedMatrices.jl so I can clean up the module. — Tim

…

On Sat, Aug 15, 2020 at 4:51 PM Sheehan Olver ***@***.***> wrote: The best is to make a PR and add it to BandedLU.jl, but if you want to do it on the REPL this should work: julia> using BandedMatrices julia> import BandedMatrices: BandedLU julia> LinearAlgebra.ldiv!(A::BandedLU{T}, b::AbstractVector) where T = copyto!(b, ldiv!(A, convert(AbstractVector{T}, b))) julia> P=BandedMatrix{Float32}(rand(8,8),(2,2)); julia> b=rand(8,); julia> P\b8-element Array{Float64,1}: 0.731564998626709 0.7191325426101685 0.6498026847839355 -2.1065616607666016 1.2697333097457886 0.6517024040222168 -0.6199814677238464 0.9412044882774353 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX66LXJVLCNJ6R76KOGDSA3YM7ANCNFSM4QALJVZQ> .

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) [email protected] https://ctk.math.ncsu.edu

ctkelley · 2020-11-03T17:57:45Z

I've been using qr! and with larger problems I'm getting killed with allocations in the solve phase. Here's an example. The timings and allocations are very consistent over several trials.

julia> n=10^6; T=Float64
julia> A=brand(T,n,n,2,4);
julia> for ip = 3:6
           A[band(ip)] .= 0.0;
       end
julia> x=rand(n);
julia> b=A*x;
julia> A32=Float32.(A);
julia> @time AF=qr!(A);
  0.037117 seconds (3 allocations: 7.630 MiB)
julia> @time AF32=qr!(A32);
  0.033772 seconds (3 allocations: 3.815 MiB)
julia> @time y=AF\b;
  0.018406 seconds (4 allocations: 7.630 MiB)
# And y is correct
julia> norm(y-x)
6.39544e-08
#Switch to single and it 4x time and nearly 10x allocations
julia> @time z=AF32\b;
  0.040336 seconds (9 allocations: 68.665 MiB, 31.04% gc time

The allocation burden is much better if I convert b to Float32, but I gain nothing in compute time over double.

julia> c=Float32.(b);
julia> @time z=AF32\c;
  0.016362 seconds (4 allocations: 3.815 MiB)

MikaelSlevinsky · 2020-11-03T18:41:47Z

I think there are reasonable explanations:

mixing precisions uses a generic fallback.
narrow bandwidths interfere with maximizing throughput (i.e. getting close to peak flops) because SIMD might not be as efficient or fully invoked. To put it another way, 32- and 64-bit methods might be moving the nearly the same number of registers.

ctkelley · 2020-11-03T18:46:31Z

However, for tridiagonal matrices, there is no problem.

…

On Tue, Nov 3, 2020 at 1:42 PM Mikael Slevinsky ***@***.***> wrote: I think there are reasonable explanations: - mixing precisions uses a generic fallback. - narrow bandwidths interfere with maximizing throughput (i.e. getting close to peak flops) because SIMD might not be as efficient or fully invoked. To put it another way, 32- and 64-bit methods might be moving the nearly the same number of registers. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX62X54H7CPJ4HRUO62DSOBFHTANCNFSM4QALJVZQ> .

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) [email protected] https://ctk.math.ncsu.edu

ctkelley · 2020-11-03T18:52:19Z

ie the LAPACK tridiagonal solvers don't allocate depending on the precision of the right side and you see the performance you'd expect. There is more than narrow bandwidth in play here.

…

On Tue, Nov 3, 2020 at 1:46 PM C. T. Kelley ***@***.***> wrote: However, for tridiagonal matrices, there is no problem. On Tue, Nov 3, 2020 at 1:42 PM Mikael Slevinsky ***@***.***> wrote: > I think there are reasonable explanations: > > - mixing precisions uses a generic fallback. > - narrow bandwidths interfere with maximizing throughput (i.e. > getting close to peak flops) because SIMD might not be as efficient or > fully invoked. To put it another way, 32- and 64-bit methods might be > moving the nearly the same number of registers. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#196 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACOEX62X54H7CPJ4HRUO62DSOBFHTANCNFSM4QALJVZQ> > . > -- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) ***@***.*** https://ctk.math.ncsu.edu

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) [email protected] https://ctk.math.ncsu.edu

MikaelSlevinsky · 2020-11-03T19:01:05Z

As far as I know, LAPACK doesn't mix precisions.

I guess you need to follow the stack trace to see what's really going on

MikaelSlevinsky · 2020-11-03T19:11:33Z

For starters, the factorization itself behaves as I'd expect: half the memory but nearly the same time.

julia> using BenchmarkTools

julia> @btime qr(A);
  72.829 ms (6 allocations: 76.29 MiB)

julia> @btime qr(A32);
  65.758 ms (6 allocations: 38.15 MiB)

ctkelley · 2020-11-03T19:55:24Z

Yes. It's the solve phase that's trouble.

…

On Tue, Nov 3, 2020 at 2:11 PM Mikael Slevinsky ***@***.***> wrote: For starters, the factorization itself behaves as I'd expect: half the memory but nearly the same time. julia> using BenchmarkTools julia> @Btime qr(A); 72.829 ms (6 allocations: 76.29 MiB) julia> @Btime qr(A32); 65.758 ms (6 allocations: 38.15 MiB) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX62IE662T7I3RQVXPTLSOBIXLANCNFSM4QALJVZQ> .

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) [email protected] https://ctk.math.ncsu.edu

ctkelley · 2020-11-03T19:57:17Z

LAPACK does it in both Matlab and Julia and the timings/allocations are what one would expect.

dlfivefifty · 2020-11-04T10:22:47Z

This has nothing to do with BandedMatrices.jl: StdLib/LinearAlgebra.jl converts the factorization to the higher precision:

In \(A, B) at /Users/sheehanolver/Projects/julia-1.5/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/qr.jl:870
 870  function (\)(A::Union{QR{TA},QRCompactWY{TA},QRPivoted{TA}}, B::AbstractVecOrMat{TB}) where {TA,TB}
 871      require_one_based_indexing(B)
 872      S = promote_type(TA,TB)
 873      m, n = size(A)
874      m == size(B,1) || throw(DimensionMismatch("Both inputs should have the same number of rows"))
 875  
 876      AA = Factorization{S}(A)

ctkelley · 2020-11-04T10:29:44Z

This does not seem to be happening in the dense case. What am I missing? julia> A=rand(1000,1000); julia> A32=Float32.(A); julia> AF=qr!(A); julia> AF32=qr!(A32); julia> b=rand(1000); julia> @time c=AF\b; 0.196028 seconds (548.89 k allocations: 35.333 MiB, 8.81% gc time) julia> @time c=AF\b; 0.003345 seconds (6 allocations: 7.645 MiB) julia> @time d=AF32\b; 0.014593 seconds (10 allocations: 15.549 MiB, 34.44% gc time)julia> @time d=AF32\b; ```

…

On Wed, Nov 4, 2020 at 5:23 AM Sheehan Olver ***@***.***> wrote: This has nothing to do with BandedMatrices.jl: StdLib/LinearAlgebra.jl converts the factorization to the higher precision: In \(A, B) at /Users/sheehanolver/Projects/julia-1.5/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/qr.jl:870 870 function (\)(A::Union{QR{TA},QRCompactWY{TA},QRPivoted{TA}}, B::AbstractVecOrMat{TB}) where {TA,TB} 871 require_one_based_indexing(B) 872 S = promote_type(TA,TB) 873 m, n = size(A)874 m == size(B,1) || throw(DimensionMismatch("Both inputs should have the same number of rows")) 875 876 AA = Factorization{S}(A) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACOEX6ZDM2DH4DO7NLEPDS3SOETQNANCNFSM4QALJVZQ> .

-- C. T. Kelley Department of Mathematics, Box 8205 SAS Hall 2311 Stinson Drive North Carolina State University Raleigh, NC 27695-8205 (919) 515-7163, (919) 513-7336 (FAX) [email protected] https://ctk.math.ncsu.edu

ctkelley · 2020-11-04T15:29:31Z

Oops. It seems that it's a problem in the dense case as well. It would seem to be that one should not have to cast b to Float32 to get this to work. That was certainly not the case with LINPACK and I doubt it is with LAPACK.

I will ask discourse about this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear solve in Float32 #196

Linear solve in Float32 #196

ctkelley commented Aug 15, 2020

dlfivefifty commented Aug 15, 2020

ctkelley commented Aug 15, 2020 via email

dlfivefifty commented Aug 15, 2020 •

edited

Loading

dlfivefifty commented Aug 15, 2020

ctkelley commented Aug 15, 2020 via email

dlfivefifty commented Aug 15, 2020

ctkelley commented Aug 15, 2020 via email

ctkelley commented Nov 3, 2020

MikaelSlevinsky commented Nov 3, 2020

ctkelley commented Nov 3, 2020 via email

ctkelley commented Nov 3, 2020 via email

MikaelSlevinsky commented Nov 3, 2020

MikaelSlevinsky commented Nov 3, 2020

ctkelley commented Nov 3, 2020 via email

ctkelley commented Nov 3, 2020

dlfivefifty commented Nov 4, 2020

ctkelley commented Nov 4, 2020 via email

ctkelley commented Nov 4, 2020

Linear solve in Float32 #196

Linear solve in Float32 #196

Comments

ctkelley commented Aug 15, 2020

dlfivefifty commented Aug 15, 2020

ctkelley commented Aug 15, 2020 via email

dlfivefifty commented Aug 15, 2020 • edited Loading

dlfivefifty commented Aug 15, 2020

ctkelley commented Aug 15, 2020 via email

dlfivefifty commented Aug 15, 2020

ctkelley commented Aug 15, 2020 via email

ctkelley commented Nov 3, 2020

MikaelSlevinsky commented Nov 3, 2020

ctkelley commented Nov 3, 2020 via email

ctkelley commented Nov 3, 2020 via email

MikaelSlevinsky commented Nov 3, 2020

MikaelSlevinsky commented Nov 3, 2020

ctkelley commented Nov 3, 2020 via email

ctkelley commented Nov 3, 2020

dlfivefifty commented Nov 4, 2020

ctkelley commented Nov 4, 2020 via email

ctkelley commented Nov 4, 2020

dlfivefifty commented Aug 15, 2020 •

edited

Loading