Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dispatch routines for ldiv! with transposed LU and transposed or Hermitian/Symmetric B #55760

Conversation

aravindh-krishnamoorthy
Copy link
Contributor

@aravindh-krishnamoorthy aravindh-krishnamoorthy commented Sep 12, 2024

Introduction

This PR is a possible fix for the issue #55422, where it is observed that, with LU factorisation, rdiv! is much slower than ldiv!, see the issue for details and analysis.

This PR adds four new dispatch routines (methods) for ldiv! with LU: for Adjoint, Transpose, Hermitian, and Symmetric.

Test Suite

using LinearAlgebra, BenchmarkTools
id(A,T) = T(Matrix{eltype(A)}(I, size(A)...))

A = randn(1000,1000);

@btime inv($A);
@btime ldiv!(lu(A), $(id(A,Matrix)));
@btime rdiv!($(id(A,Matrix)), lu(A));
@btime rdiv!($(id(A,Symmetric)), lu(A));

inv(A)  ldiv!(lu(A), id(A,Matrix))  rdiv!(id(A,Matrix), lu(A))  rdiv!(id(A,Symmetric), lu(A))

A = complex.(randn(1000,1000),randn(1000,1000));

@btime inv($A);
@btime ldiv!(lu(A), $(id(A,Matrix)));
@btime rdiv!($(id(A,Matrix)), lu(A));
@btime rdiv!($(id(A,Hermitian)), lu(A));

inv(A)  ldiv!(lu(A), id(A,Matrix))  rdiv!(id(A,Matrix), lu(A))  rdiv!(id(A,Hermitian), lu(A))

Test Results

See #55760 (comment)

@aravindh-krishnamoorthy
Copy link
Contributor Author

aravindh-krishnamoorthy commented Sep 12, 2024

Test Results

julia> using LinearAlgebra, BenchmarkTools
julia> id(A,T) = T(Matrix{eltype(A)}(I, size(A)...))
id (generic function with 1 method)

julia> A = randn(1000,1000);

julia> @btime inv($A);
  29.407 ms (10 allocations: 8.13 MiB)
julia> @btime ldiv!(lu(A), $(id(A,Matrix)));
  35.406 ms (7 allocations: 7.64 MiB)
julia> @btime rdiv!($(id(A,Matrix)), lu(A));
  32.574 ms (11 allocations: 15.27 MiB)
julia> @btime rdiv!($(id(A,Symmetric)), lu(A));
  32.137 ms (12 allocations: 15.27 MiB)

julia> inv(A)  ldiv!(lu(A), id(A,Matrix))  rdiv!(id(A,Matrix), lu(A))  rdiv!(id(A,Symmetric), lu(A))
true

julia> A = complex.(randn(1000,1000),randn(1000,1000));

julia> @btime inv($A);
  66.822 ms (10 allocations: 16.24 MiB)
julia> @btime ldiv!(lu(A), $(id(A,Matrix)));
  96.932 ms (7 allocations: 15.27 MiB)
julia> @btime rdiv!($(id(A,Matrix)), lu(A));
  113.056 ms (11 allocations: 30.53 MiB)
julia> @btime rdiv!($(id(A,Hermitian)), lu(A));
  93.395 ms (12 allocations: 30.53 MiB)

julia> inv(A)  ldiv!(lu(A), id(A,Matrix))  rdiv!(id(A,Matrix), lu(A))  rdiv!(id(A,Hermitian), lu(A))
true

@dkarrasch
Copy link
Member

Wait, but these are not in-place, are they? The assumption is that ldiv! is acting on the second argument, without allocating new memory.

@dkarrasch dkarrasch added the linear algebra Linear algebra label Sep 12, 2024
@aravindh-krishnamoorthy
Copy link
Contributor Author

Wait, but these are not in-place, are they? The assumption is that ldiv! is acting on the second argument, without allocating new memory.

Good evening, @dkarrasch. You're right. Unfortunately, these are not in-place. The original implementation used generic matrix functions, which were about $50$ times slower. On the other hand, the LAPACK versions cannot apply transpose on the second matrix B and so require a copy. Hence, the choice is between a slower implementation and one that has to copy. I chose the latter.

If you have any suggestions for in-place expansion of B, I will be very happy to incorporate them. Unfortunately, I couldn't find any other way other than copy(B) and Matrix(B) to get rid of Transpose and Symmetric/Hermitian, respectively, so that B can be fed as the second matrix to LAPACK.

@martinholters
Copy link
Member

martinholters commented Sep 13, 2024

Could maybe e.g. replace the copy(B) for Transpose with _transpose!(B.parent) where

function _transpose!(M::AbstractMatrix)
    Mt = reshape(M, reverse(size(M)))
    for i in axes(M, 1), j in firstindex(M, 2)+i:lastindex(M, 2)
        (Mt[j,i], M[i,j]) = (transpose(M[i,j]), transpose(Mt[j,i]))
    end
    return Mt
end

And similar for Hermitian and maybe Symmetric.

Have't thought this through completely, but maybe its an idea to work with.

EDIT: The indexing is horribly wrong; this doesn't do a proper transpose.

@dkarrasch
Copy link
Member

dkarrasch commented Sep 13, 2024

The "promise" of ldiv! is not to be fast, but to operate in-place. The usage pattern is like this:

A = factorize(rand(3,3)) # for instance
B = rand(3, 5)
ldiv!(A, B)
# continue to work with B

If this wasn't in-place, you would need C = ldiv!(A, B), to have a handle for the result, which users don't expect. So they would miss the result, and believe the result is stored in B. It does happen already that in-place methods are slower than allocating ones: e.g. that is the case for mul! vs. * with different BLAS-eltypes.

@martinholters
Copy link
Member

I think my proposal should in principle achieve this, as the reshaped array shares memory with the original one. I'll try to fix the indexing to actually do a transposition and then give a more complete implementation of what I have in mind.

@aravindh-krishnamoorthy
Copy link
Contributor Author

The "promise" of ldiv! is not to be fast, but to operate in-place. The usage pattern is like this:

A = factorize(rand(3,3)) # for instance B = rand(3, 5) ldiv!(A, B)
# continue to work with B

If this wasn't in-place, you would need C = ldiv!(A, B), to have a handle for the result, which users don't expect. So they would miss the result, and believe the result is stored in B. It does happen already that in-place methods are slower than allocating ones: e.g. that is the case for mul! vs. * with different BLAS-eltypes.

Thank you, @dkarrasch. I fully agree that in the final version, B must be in-place. To this end, I will evaluate various options of in-place transposition and filling, including the proposal by @martinholters, and other options in generic_matmul.

I feel that it may be best to use LAPACK for rdiv! and ldiv! when the type is BlasComplex or BlasFloat. Would you agree with this assessment?

@dkarrasch
Copy link
Member

Would you agree with this assessment?

It depends. When there are different BLAS-types invovled, there are no such BLAS-methods. So we will need to fall back to generic methods. The point is that we have two orthogonal goals: performance vs memory allocation. When users don't mind the memory allocation, then they can use \, and that may in fact promote eltypes or materialize a wrapper, to finally reach a faster method.

@martinholters
Copy link
Member

So my _transpose! from #55760 (comment) is wrong for rectangular matrices. Indeed, doing an inplace transposition (such that the reshape that gives the correct transpose) is trickier then I though at first as it is more than just pairwise swapping. Not sure there exists a nice solution, but assume there does for the moment, then what I have in mind is something along these lines

function ldiv!(A::TransposeFactorization{T,<:LU{T,<:StridedMatrix}}, B::Transpose{T,<:StridedVecOrMat{T}}) where {T<:Union{BlasFloat,BlasComplex}}
    Bt = reshape(B.parent, size(B))
    copyto!(Bt, copy(B)) # see below
    LAPACK.getrs!('T', A.parent.factors, A.parent.ipiv, Bt)
    copyto!(Bt, copy(B)) # see below
    return B
end

If there is a way to express the copyto!(Bt, copy(B)) (noting the aliasing of Bt and B) as an inplace operation permuting the elements, this could be an attractive alternative.

@aravindh-krishnamoorthy
Copy link
Contributor Author

Would you agree with this assessment?

It depends. When there are different BLAS-types invovled, there are no such BLAS-methods. So we will need to fall back to generic methods. The point is that we have two orthogonal goals: performance vs memory allocation. When users don't mind the memory allocation, then they can use \, and that may in fact promote eltypes or materialize a wrapper, to finally reach a faster method.

Thank you, @dkarrasch. I agree with your response. If you feel it is not worth taking forward, perhaps we can make it clear in the rdiv! documentation that it might be up to $50\mathrm{x}$ slower than ldiv! even for BLAS types. Alternatively, we can move the copy to rdiv! and explain in the documentation about the allocation when A is not a transpose-type and eltypes are BLAS types.

I understand the surprise of OP and the poster on Discourse when rdiv! is $50\mathrm{x}$ slower than ldiv!. I feel that this will be a repeat issue that will be raised several times as users discover it. Hence, it may be better to fix this issue permanently.

@dkarrasch
Copy link
Member

It turns out this was a silly dispatch issue, so this PR should be superseded by #55764.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants