Memory allocations of 5-argumets `mul!` when involving `struct` #566

albertomercurio · 2024-10-13T11:13:19Z

Hello,

I show a simple example where I find some memory allocations when performing the 5-arguments mul!. In some cases only when the Sparse matrix A is inside a struct, in other cases also without struct.

julia> using LinearAlgebra

julia> using SparseArrays

julia> using BenchmarkTools

julia> T = Float64;

julia> N = 1000;

julia> A = sprand(T, N, N, 5 / N);

julia> x = rand(T, N);

julia> y = similar(x);

julia> α = rand(T);

julia> β = rand(T);

julia> @benchmark mul!($y, $A, $x, $α, false) # This is ok
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.092 μs …  3.589 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.146 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.184 μs ± 68.468 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▁▇█▄                                               
  ▂▂▂▂▂▂▂▂▄████▇▄▂▂▂▂▁▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▃▄▆██▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  2.09 μs        Histogram: frequency by time        2.35 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!($y, $A, $x, $α, $β) # This is not ok
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.164 μs …   5.236 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.193 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.261 μs ± 138.081 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▇▆▄▂▂▂▁           ▁▁    ▁  ▂▄▆▆▆▄▃                ▁▁    ▂
  ████████████▆▅▄▃▅▄▅▃▆██████▇███████████▆▆▅▅▄▄▃▁▅▄▃▅▅▆█████▆ █
  2.16 μs      Histogram: log(frequency) by time      2.54 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

Involving a simple constructor

julia> struct my_MatrixOperator{MT}
           A::MT
       end

julia> function LinearAlgebra.mul!(v::AbstractVecOrMat, L::my_MatrixOperator, u::AbstractVecOrMat)
           mul!(v, L.A, u)
       end

julia> function LinearAlgebra.mul!(v::AbstractVecOrMat,
               L::my_MatrixOperator,
               u::AbstractVecOrMat,
               α,
               β)
           mul!(v, L.A, u, α, β)
       end

julia> A_op_my = my_MatrixOperator(A);

julia> @benchmark mul!($y, $A_op_my, $x, $α, false) # This is no longer ok
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.553 μs …   5.645 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.627 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.653 μs ± 116.667 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆▇▇▆▃▁▂▅██▇▅▂              ▁ ▂▃▄▄▃▂   ▂▃▃▂            ▁▂▂▂ ▃
  ██████████████▇▄▅▄▃▇████▇▇▇█████████▇▆██████▅▅▄▁▄▁▄▅▆▅▆████ █
  2.55 μs      Histogram: log(frequency) by time         3 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> @benchmark mul!($y, $A_op_my, $x, $α, $β) # This is not ok
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.160 μs …  12.129 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.190 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.254 μs ± 176.229 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅██▆▄▂          ▁▅▆▆▅▃▂▁  ▁▃▄▃▂              ▁▂▁           ▂
  ████████▆▅▅▃▃▄▄▅▇████████▇▇██████▆▆▅▅▅▁▅▅▅▅▅▇█████▆▆▅▅▄▅▄▁▃ █
  2.16 μs      Histogram: log(frequency) by time       2.6 μs <

 Memory estimate: 80 bytes, allocs estimate: 2.

julia> @benchmark mul!($y, $A_op_my, $x) # This is ok
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.030 μs …  3.575 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.053 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.064 μs ± 54.170 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▄▆███▇▅▄▁                                        ▁▁       ▂
  ██████████▇▅▄▄▃▁▁▄▄▁▁▄▃▁▄▆▇███▇█▇▇▇▅▆▅▄▄▅▅▄▄▅▄▅▄▄▇███▇▅▄▄▆ █
  2.03 μs      Histogram: log(frequency) by time     2.31 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Possible partial fix

I've noticed that, by defining the functions above with the @inline or Base.@constprop :aggressive macros, the mul!($y, $A_op_my, $x, $α, false) case is fixed, but it is still a mystery to me.

The text was updated successfully, but these errors were encountered:

dkarrasch · 2024-10-14T08:17:43Z

The memory allocation does not scale with the size of the matrix/vector, and given that it depends on the scalar factors alpha and beta, I strongly suspect this is related to the MulAddMul construction. That, however, benefits from aggressive constant propagation or inlining.

Without inlining (current):

julia> @code_typed mul!(y, A, x, α, β)
CodeInfo(
1 ─ %1  = Base.eq_float(alpha, 1.0)::Bool
└──       goto #5 if not %1
2 ─ %3  = Base.eq_float(beta, 0.0)::Bool
└──       goto #4 if not %3
3 ─ %5  = %new(LinearAlgebra.MulAddMul{true, true, Float64, Float64}, alpha, beta)::LinearAlgebra.MulAddMul{true, true, Float64, Float64}
└──       goto #8
4 ─ %7  = %new(LinearAlgebra.MulAddMul{true, false, Float64, Float64}, alpha, beta)::LinearAlgebra.MulAddMul{true, false, Float64, Float64}
└──       goto #8
5 ─ %9  = Base.eq_float(beta, 0.0)::Bool
└──       goto #7 if not %9
6 ─ %11 = %new(LinearAlgebra.MulAddMul{false, true, Float64, Float64}, alpha, beta)::LinearAlgebra.MulAddMul{false, true, Float64, Float64}
└──       goto #8
7 ─ %13 = %new(LinearAlgebra.MulAddMul{false, false, Float64, Float64}, alpha, beta)::LinearAlgebra.MulAddMul{false, false, Float64, Float64}
└──       goto #8
8 ┄ %15 = φ (#3 => %5, #4 => %7, #6 => %11, #7 => %13)::LinearAlgebra.MulAddMul{ais1, bis0, Float64, Float64} where {ais1, bis0}
│   %16 = LinearAlgebra.generic_matvecmul!(y, 'N', A, x, %15)::Vector{Float64}
└──       goto #9
9 ─       return %16
) => Vector{Float64}

With inlining:

julia> @code_typed mul!(y, A, x, α, β)
CodeInfo(
1 ─ %1 = Base.eq_float(alpha, 1.0)::Bool
└──      goto #5 if not %1
2 ─ %3 = Base.eq_float(beta, 0.0)::Bool
└──      goto #4 if not %3
3 ─      goto #8
4 ─      goto #8
5 ─ %7 = Base.eq_float(beta, 0.0)::Bool
└──      goto #7 if not %7
6 ─      goto #8
7 ─      goto #8
8 ┄      invoke SparseArrays._spmatmul!(y::Vector{Float64}, A::SparseMatrixCSC{Float64, Int64}, x::Vector{Float64}, alpha::Float64, beta::Float64)::Any
└──      goto #9
9 ─      return y
) => Vector{Float64}

You see, it doesn't even mess with the MulAddMul stuff! I think it's okay to inline up until _spmatmul!, because all the previous calls in the call chain, exist only for redirection, so we're not at risk to pile up more and more code.

So, if you have a minute, please submit a PR.

albertomercurio · 2024-10-14T10:10:33Z

So, which functions should I inline? All up to _spmatmul!? So does this fix also the general case mul!(y, A_op_my, x, α, β) or just the mul!(y, A_op_my, x, α, false)? It seems both.

dkarrasch · 2024-10-14T10:33:59Z

For what I posted, the generic_matvecmul! method is enough, the others down the call chain may already have an annotation, or are automatically inlined.

albertomercurio mentioned this issue Oct 13, 2024

Improve mul!, AddedOperator, and update_coefficients! to remove memory allocations SciML/SciMLOperators.jl#249

Open

dkarrasch linked a pull request Oct 14, 2024 that will close this issue

Inline sparse-times-dense in-place multiplication #567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory allocations of 5-argumets `mul!` when involving `struct` #566

Memory allocations of 5-argumets `mul!` when involving `struct` #566

albertomercurio commented Oct 13, 2024

dkarrasch commented Oct 14, 2024

albertomercurio commented Oct 14, 2024

dkarrasch commented Oct 14, 2024

Memory allocations of 5-argumets mul! when involving struct #566

Memory allocations of 5-argumets mul! when involving struct #566

Comments

albertomercurio commented Oct 13, 2024

Involving a simple constructor

Possible partial fix

dkarrasch commented Oct 14, 2024

albertomercurio commented Oct 14, 2024

dkarrasch commented Oct 14, 2024

Memory allocations of 5-argumets `mul!` when involving `struct` #566

Memory allocations of 5-argumets `mul!` when involving `struct` #566