-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LOBPCG GPU-compatible #711
Changes from 8 commits
947f4a8
a8efb25
26e226b
48bb4be
91a8b09
c2a874c
9268d68
d882b91
38eae10
8926568
da2cb59
26b760b
8d424e1
ce5da66
a0e3b8e
380f3cf
2278432
1ba75d4
0406d6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,19 +43,96 @@ | |
vprintln(args...) = nothing | ||
|
||
using LinearAlgebra | ||
using BlockArrays # used for the `mortar` command which makes block matrices | ||
mfherbst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
import Base: * | ||
import Base.size, Base.adjoint, Base.Array | ||
|
||
# when X or Y are BlockArrays, this makes the return value be a proper array (not a BlockArray) | ||
function array_mul(X::AbstractArray{T}, Y) where {T} | ||
Z = Array{T}(undef, size(X, 1), size(Y, 2)) | ||
mul!(Z, X, Y) | ||
include("../workarounds/gpu_arrays.jl") | ||
|
||
# For now, BlockMatrix can store arrays of different types (for example, an element | ||
# of type views and one of type Matrix). Maybe for performance issues it should only | ||
# store arrays of the same type? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that's fine, why would it be an issue? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really know how type conversion can affect the computations. The way I see it, we can either code the struct like a Tuple (can hold different types of arrays, possibly with different types of elements in them) or like a Vector (everything gets converted to a common type). I was wondering if mixing all the types together (ie coding the struct like a Tuple) wouldn't induce computational errors, as we keep converting things back and forth. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's no conversion at all here. You just use the tuple elements in |
||
|
||
struct BlockMatrix{T <: Number, D <: Tuple} <: AbstractMatrix{T} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Document it here instead of at the constructor There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In particular it would be nice if it was more clear that it is column blocks, but I don't find a good name, so just make it clear from a comment that it is a horizontal concatenation of blocks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note that these are relatively simple wrappers only supporting some multiplication routines, and in particular operations with them materialize to plain arrays There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also explain relationship to BlockArrays (it's a lightweight subset of functionality, it materializes to plain arrays, it supports GPU) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re name : LazyHcat? (because it's basically equivalent to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think LazyHcat is not too bad: it's clearer to have "hcat" than to just have the generic "block" which doesn't give a lot of information. And we probably never will have to implement a true block-matrix structure for LOBPCG. |
||
blocks::D | ||
end | ||
|
||
""" | ||
Build a BlockMatrix containing the given arrays, from left to right. | ||
This function will fail (for now) if: | ||
-the arrays do not all have the same "height" (ie size[1] must match). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're beginning an enumeration with one item There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. height -> number of rows There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I rewrote the documentation, all of this has been summed up when defining the struct. |
||
""" | ||
function BlockMatrix(arrays::AbstractArray...) | ||
length(arrays) ==0 && error("Empty BlockMatrix is not currently implemented") | ||
n_ref= size(arrays[1], 1) | ||
mfherbst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for array in arrays | ||
n_i = size(array, 1) | ||
n_ref != n_i && error("The given arrays do not have matching 'height': "* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is very internal, no need to have a nice error message, just say @Assert n_ref == n_i There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or just @Assert all(size.(arrays, 1) .== n_ref) |
||
"cannot build a BlockMatrix out of them.") | ||
end | ||
|
||
T = promote_type(map(eltype, arrays)...) | ||
|
||
BlockMatrix{T, typeof(arrays)}(arrays) | ||
end | ||
|
||
function Base.size(A::BlockMatrix) | ||
n = size(A.blocks[1], 1) | ||
m = sum(size(block, 2) for block in A.blocks) | ||
(n,m) | ||
mfherbst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
end | ||
|
||
Base.Array(A::BlockMatrix) = hcat(A.blocks...) | ||
|
||
Base.adjoint(A::BlockMatrix) = Adjoint(A) | ||
|
||
""" | ||
Given A and B as two BlockMatrices [A1, A2, A3], [B1, B2, B3] form the matrix | ||
A'B. Return an array, not a BlockMatrix. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove docstring (* does what * does) |
||
""" | ||
@views function Base.:*(Aadj::Adjoint{T, <: BlockMatrix}, B::BlockMatrix) where {T} | ||
A = Aadj.parent | ||
rows = size(A)[2] | ||
cols = size(B)[2] | ||
ret = similar(A.blocks[1], rows, cols) | ||
|
||
orow = 0 # row offset | ||
for (iA, blA) in enumerate(A.blocks) | ||
ocol = 0 # column offset | ||
for (iB, blB) in enumerate(B.blocks) | ||
ret[orow .+ (1:size(blA, 2)), ocol .+ (1:size(blB, 2))] = blA' * blB | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. .= for this type of stuff (possibly equivalent here, but better practice for this type of operaitons) |
||
ocol += size(blB, 2) | ||
end | ||
orow += size(blA, 2) | ||
end | ||
ret | ||
end | ||
|
||
Base.:*(Aadj::Adjoint{T, <: BlockMatrix}, B::AbstractMatrix) where {T} = Aadj * BlockMatrix(B) | ||
|
||
""" | ||
Given A as a BlockMatrix [A1, A2, A3] and B a Matrix, compute the matrix-matrix product | ||
A * B avoiding a concatenation of A's blocks to a dense array. | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove docstring (see above) |
||
@views function *(Ablock::BlockMatrix, B::AbstractMatrix) | ||
res = Ablock.blocks[1] * B[1:size(Ablock.blocks[1], 2), :] # First multiplication | ||
offset = size(Ablock.blocks[1], 2) | ||
for block in Ablock.blocks[2:end] | ||
mul!(res, block, B[offset .+ (1:size(block, 2)), :], 1, 1) | ||
offset += size(block, 2) | ||
end | ||
res | ||
end | ||
|
||
function LinearAlgebra.mul!(res::AbstractMatrix, Ablock::BlockMatrix, | ||
B::AbstractVecOrMat, α::Number, β::Number) | ||
# Has slightly better performances than a naive res = α*A*B - β*res | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really? even with dots? I find that surprising There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did some more testing, and you are right: it is indeed the same when using dots. I removed the comment. |
||
mul!(res, Ablock*B, I, α, β) | ||
end | ||
|
||
# Perform a Rayleigh-Ritz for the N first eigenvectors. | ||
@timing function rayleigh_ritz(X, AX, N) | ||
XAX = array_mul(X', AX) | ||
@assert all(!isnan, XAX) | ||
F = eigen(Hermitian(XAX)) | ||
mfherbst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
@timing function rayleigh_ritz(X::BlockMatrix, AX::BlockMatrix, N) | ||
# Multiplying two BlockMatrices yields an array, not a BlockMatrix | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove comment, remove typing of variables |
||
F = eigen(Hermitian(X' * AX)) | ||
F.vectors[:,1:N], F.values[1:N] | ||
end | ||
|
||
|
@@ -174,16 +251,16 @@ end | |
niter = 1 | ||
ninners = zeros(Int,0) | ||
while true | ||
BYX = BY'X | ||
# XXX the one(T) instead of plain old 1 is because of https://github.com/JuliaArrays/BlockArrays.jl/issues/176 | ||
mfherbst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
mul!(X, Y, BYX, -one(T), one(T)) # X -= Y*BY'X | ||
BYX = BY' * X | ||
mul!(X, Y, BYX, -one(T), one(T)) # X -= Y*BY'X | ||
# If the orthogonalization has produced results below 2eps, we drop them | ||
# This is to be able to orthogonalize eg [1;0] against [e^iθ;0], | ||
# as can happen in extreme cases in the ortho!(cP, cX) | ||
dropped = drop!(X) | ||
if dropped != [] | ||
@views mul!(X[:, dropped], Y, BY' * (X[:, dropped]), -one(T), one(T)) # X -= Y*BY'X | ||
X[:, dropped] .-= Y * (BY' * X[:, dropped]) # X = X - Y*BY'*X | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove comment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. technically here we should probably do as above, but who cares |
||
end | ||
|
||
if norm(BYX) < tol && niter > 1 | ||
push!(ninners, 0) | ||
break | ||
|
@@ -219,11 +296,9 @@ function final_retval(X, AX, resid_history, niter, n_matvec) | |
residuals = AX .- X*Diagonal(λ) | ||
(λ=λ, X=X, | ||
residual_norms=[norm(residuals[:, i]) for i in 1:size(residuals, 2)], | ||
residual_history=resid_history[:, 1:niter+1], | ||
n_matvec=n_matvec) | ||
residual_history=resid_history[:, 1:niter+1], n_matvec=n_matvec) | ||
end | ||
|
||
|
||
### The algorithm is Xn+1 = rayleigh_ritz(hcat(Xn, A*Xn, Xn-Xn-1)) | ||
### We follow the strategy of Hetmaniuk and Lehoucq, and maintain a B-orthonormal basis Y = (X,R,P) | ||
### After each rayleigh_ritz step, the B-orthonormal X and P are deduced by an orthogonal rotation from Y | ||
|
@@ -234,6 +309,7 @@ end | |
miniter=1, ortho_tol=2eps(real(eltype(X))), | ||
n_conv_check=nothing, display_progress=false) | ||
N, M = size(X) | ||
|
||
# If N is too small, we will likely get in trouble | ||
error_message(verb) = "The eigenproblem is too small, and the iterative " * | ||
"eigensolver $verb fail; increase the number of " * | ||
|
@@ -252,7 +328,7 @@ end | |
B_ortho!(X, BX) | ||
end | ||
|
||
n_matvec = M # Count number of matrix-vector products | ||
n_matvec = M # Count number of matrix-vector products | ||
AX = similar(X) | ||
AX = mul!(AX, A, X) | ||
@assert all(!isnan, AX) | ||
|
@@ -275,6 +351,7 @@ end | |
nlocked = 0 | ||
niter = 0 # the first iteration is fake | ||
λs = @views [(X[:,n]'*AX[:,n]) / (X[:,n]'BX[:,n]) for n=1:M] | ||
λs = oftype(X[:,1], λs) # Offload to GPU if needed | ||
new_X = X | ||
new_AX = AX | ||
new_BX = BX | ||
|
@@ -290,23 +367,23 @@ end | |
|
||
# Form Rayleigh-Ritz subspace | ||
if niter > 1 | ||
Y = mortar((X, R, P)) | ||
AY = mortar((AX, AR, AP)) | ||
BY = mortar((BX, BR, BP)) # data shared with (X, R, P) in non-general case | ||
Y = BlockMatrix(X, R, P) | ||
AY = BlockMatrix(AX, AR, AP) | ||
BY = BlockMatrix(BX, BR, BP) # data shared with (X, R, P) in non-general case | ||
else | ||
Y = mortar((X, R)) | ||
AY = mortar((AX, AR)) | ||
BY = mortar((BX, BR)) # data shared with (X, R) in non-general case | ||
Y = BlockMatrix(X, R) | ||
AY = BlockMatrix(AX, AR) | ||
BY = BlockMatrix(BX, BR) # data shared with (X, R) in non-general case | ||
end | ||
cX, λs = rayleigh_ritz(Y, AY, M-nlocked) | ||
|
||
# Update X. By contrast to some other implementations, we | ||
# wait on updating P because we have to know which vectors | ||
# to lock (and therefore the residuals) before computing P | ||
# only for the unlocked vectors. This results in better convergence. | ||
new_X = array_mul(Y, cX) | ||
new_AX = array_mul(AY, cX) # no accuracy loss, since cX orthogonal | ||
new_BX = (B == I) ? new_X : array_mul(BY, cX) | ||
new_X = Y * cX | ||
new_AX = AY * cX # no accuracy loss, since cX orthogonal | ||
new_BX = (B == I) ? new_X : BY * cX | ||
end | ||
|
||
### Compute new residuals | ||
|
@@ -320,7 +397,7 @@ end | |
vprintln(niter, " ", resid_history[:, niter+1]) | ||
if precon !== I | ||
@timing "preconditioning" begin | ||
precondprep!(precon, X) # update preconditioner if needed; defaults to noop | ||
precondprep!(precon, X) # update preconditioner if needed; defaults to noop | ||
ldiv!(precon, new_R) | ||
end | ||
end | ||
|
@@ -360,20 +437,23 @@ end | |
# orthogonalization, see Hetmaniuk & Lehoucq, and Duersch et. al. | ||
# cP = copy(cX) | ||
# cP[Xn_indices,:] .= 0 | ||
e = zeros(eltype(X), size(cX, 1), M - prev_nlocked) | ||
for i in 1:length(Xn_indices) | ||
e[Xn_indices[i], i] = 1 | ||
end | ||
|
||
lenXn = length(Xn_indices) | ||
e = zero(similar(X, size(cX, 1), M - prev_nlocked)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ping @vchuravy who might have an opinion There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
But yeah it's annoying that there is no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm. OK then let's just have our own |
||
lower_diag = one(similar(X, lenXn, lenXn)) | ||
# e has zeros everywhere except on one of its lower diagonal | ||
e[Xn_indices[1] : last(Xn_indices), 1 : lenXn] = lower_diag | ||
mfherbst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
cP = cX .- e | ||
cP = cP[:, Xn_indices] | ||
# orthogonalize against all Xn (including newly locked) | ||
ortho!(cP, cX, cX, tol=ortho_tol) | ||
|
||
# Get new P | ||
new_P = array_mul( Y, cP) | ||
new_AP = array_mul(AY, cP) | ||
new_P = Y * cP | ||
new_AP = AY * cP | ||
if B != I | ||
new_BP = array_mul(BY, cP) | ||
new_BP = BY * cP | ||
else | ||
new_BP = new_P | ||
end | ||
|
@@ -418,8 +498,8 @@ end | |
|
||
# Orthogonalize R wrt all X, newly active P | ||
if niter > 0 | ||
Z = mortar((full_X, P)) | ||
BZ = mortar((full_BX, BP)) # data shared with (full_X, P) in non-general case | ||
Z = BlockMatrix(full_X, P) | ||
BZ = BlockMatrix(full_BX, BP) # data shared with (full_X, P) in non-general case | ||
else | ||
Z = full_X | ||
BZ = full_BX | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# TODO: remove this when it is implemented in GPUArrays and CUDA | ||
import LinearAlgebra.dot, LinearAlgebra.eigen | ||
using LinearAlgebra | ||
using GPUArrays | ||
using CUDA | ||
|
||
mfherbst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# https://github.com/JuliaGPU/CUDA.jl/issues/1565 | ||
LinearAlgebra.dot(x::AbstractGPUArray, D::Diagonal,y::AbstractGPUArray) = x'*(D*y) | ||
|
||
# https://github.com/JuliaGPU/CUDA.jl/issues/1572 | ||
function LinearAlgebra.eigen(A::Hermitian{T,AT}) where {T <: Complex,AT <: CuArray} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should open a PR to CUDA.jl to implement this there. |
||
vals, vects = CUDA.CUSOLVER.heevd!('V','U', A.data) | ||
(vectors = vects, values = vals) | ||
end | ||
|
||
function LinearAlgebra.eigen(A::Hermitian{T,AT}) where {T <: Real,AT <: CuArray} | ||
vals, vects = CUDA.CUSOLVER.syevd!('V','U', A.data) | ||
(vectors = vects, values = vals) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good if we can avoid this direct dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need it because of the functions in
workaround/gpu_arrays.jl
? For theeigen
workaround for example, we explicitly use CUDA functions, so if I understand correcly, we need to have CUDA as a dependency for DFTK. As soon as the workarounds are implemented in CUDA, we could remove this.