-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LOBPCG GPU-compatible #711
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments. Also check the formatting.
src/eigen/lobpcg_hyper_impl.jl
Outdated
"eigensolver $verb fail; increase the number of " * | ||
"degrees of freedom, or use a dense eigensolver." | ||
N > 3M || error(error_message("will")) | ||
N >= 3M+5 || @warn error_message("might") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like an unintentional whitespace commit.
src/workarounds/gpu_arrays.jl
Outdated
function LinearAlgebra.eigen(A::RealHermSymComplexHerm{T,AT}) where {T,AT <: CuArray} | ||
if eltype(A) <: Complex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use multiple dispatch on element type of CuArray.
src/eigen/lobpcg_hyper_impl.jl
Outdated
A'B (which is not a BlockMatrix). block_overlap also has compatible versions with two Arrays. | ||
block_overlap always compute some form of adjoint, ie the product A'*B. | ||
""" | ||
@views function block_overlap(A::BlockMatrix, B::BlockMatrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this function not covered by tests? Did you check you end up dispatching correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not too big a fan of having this function. Why not define
Base.:*(A::Adjoint{T, <: BlockMatrix}, B::BlockMatrix) where {T})
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I thought defining it this way would be clearer: as it turns out, it's just weird and confusing. I re-wrote the block_overlap
functions, now we only overload Base.*
.
src/eigen/lobpcg_hyper_impl.jl
Outdated
This function will fail (for now) if: | ||
-the arrays do not all have the same "height" (ie size[1] must match). | ||
""" | ||
function make_block_vector(arrays::AbstractArray...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make this a constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right: no need for a function with a weird name.
src/eigen/lobpcg_hyper_impl.jl
Outdated
struct BlockMatrix | ||
blocks::Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not type stable, which matters here because of performance. You have to make Tuple
a concrete type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought this would also be an issue. The thing is, we sometimes build a BlockMatrix with arrays which are views of other arrays. And taking the view of an array builds a SubArray
which derives from AbstractArray.
So mixed in BlockMatrix we can have either arrays or subarrays which do not have a concrete type in common: the only thing they have in common is that they derive from AbstractArray. I may be wrong, but I can't think of an easy way to make this field concrete by keeping the code the way it is: maybe we shouldn't give a BlockMatrix views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this?
struct BlockMatrix{T <: AbstractFloat, D <: Tuple} <: AbstractMatrix{T}
blocks::D
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw. Not sure we should have the size
as a member. It's easily computed on the fly from the constituents of blocks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right: I was trying to describe more precisely what was hidden behind the Tuple, but doing D <: Tuple
will work.
As for the size, I created this member so we don't have to recompute every time the size of the BlockMatrix. As it turns out, we are not using it much and it can be computed very easily, so we can drop it.
Thanks for the feedback! I did some modifications to take into account what you said. I also added a few small tests for the BlockMatrix structure in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice! Very cute little struct and nicer code than we had before.
src/eigen/lobpcg_hyper_impl.jl
Outdated
|
||
# For now, BlockMatrix can store arrays of different types (for example, an element | ||
# of type views and one of type Matrix). Maybe for performance issues it should only | ||
# store arrays of the same type? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine, why would it be an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know how type conversion can affect the computations. The way I see it, we can either code the struct like a Tuple (can hold different types of arrays, possibly with different types of elements in them) or like a Vector (everything gets converted to a common type). I was wondering if mixing all the types together (ie coding the struct like a Tuple) wouldn't induce computational errors, as we keep converting things back and forth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no conversion at all here. You just use the tuple elements in mul!
functions, which dispatches to the correct function. I don't think there's any need to change what you have now
src/eigen/lobpcg_hyper_impl.jl
Outdated
# of type views and one of type Matrix). Maybe for performance issues it should only | ||
# store arrays of the same type? | ||
|
||
struct BlockMatrix{T <: Number, D <: Tuple} <: AbstractMatrix{T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document it here instead of at the constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular it would be nice if it was more clear that it is column blocks, but I don't find a good name, so just make it clear from a comment that it is a horizontal concatenation of blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that these are relatively simple wrappers only supporting some multiplication routines, and in particular operations with them materialize to plain arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also explain relationship to BlockArrays (it's a lightweight subset of functionality, it materializes to plain arrays, it supports GPU)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re name : LazyHcat? (because it's basically equivalent to hcat
but better optimized)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think LazyHcat is not too bad: it's clearer to have "hcat" than to just have the generic "block" which doesn't give a lot of information. And we probably never will have to implement a true block-matrix structure for LOBPCG.
src/eigen/lobpcg_hyper_impl.jl
Outdated
""" | ||
Build a BlockMatrix containing the given arrays, from left to right. | ||
This function will fail (for now) if: | ||
-the arrays do not all have the same "height" (ie size[1] must match). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're beginning an enumeration with one item
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
height -> number of rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the documentation, all of this has been summed up when defining the struct.
src/eigen/lobpcg_hyper_impl.jl
Outdated
n_ref= size(arrays[1], 1) | ||
for array in arrays | ||
n_i = size(array, 1) | ||
n_ref != n_i && error("The given arrays do not have matching 'height': "* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very internal, no need to have a nice error message, just say @Assert n_ref == n_i
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or just @Assert all(size.(arrays, 1) .== n_ref)
src/eigen/lobpcg_hyper_impl.jl
Outdated
|
||
""" | ||
Given A and B as two BlockMatrices [A1, A2, A3], [B1, B2, B3] form the matrix | ||
A'B. Return an array, not a BlockMatrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove docstring (* does what * does)
src/eigen/lobpcg_hyper_impl.jl
Outdated
|
||
function LinearAlgebra.mul!(res::AbstractMatrix, Ablock::BlockMatrix, | ||
B::AbstractVecOrMat, α::Number, β::Number) | ||
# Has slightly better performances than a naive res = α*A*B - β*res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really? even with dots? I find that surprising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more testing, and you are right: it is indeed the same when using dots. I removed the comment.
src/eigen/lobpcg_hyper_impl.jl
Outdated
@assert all(!isnan, XAX) | ||
F = eigen(Hermitian(XAX)) | ||
@timing function rayleigh_ritz(X::BlockMatrix, AX::BlockMatrix, N) | ||
# Multiplying two BlockMatrices yields an array, not a BlockMatrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment, remove typing of variables
src/eigen/lobpcg_hyper_impl.jl
Outdated
# If the orthogonalization has produced results below 2eps, we drop them | ||
# This is to be able to orthogonalize eg [1;0] against [e^iθ;0], | ||
# as can happen in extreme cases in the ortho!(cP, cX) | ||
dropped = drop!(X) | ||
if dropped != [] | ||
@views mul!(X[:, dropped], Y, BY' * (X[:, dropped]), -one(T), one(T)) # X -= Y*BY'X | ||
X[:, dropped] .-= Y * (BY' * X[:, dropped]) # X = X - Y*BY'*X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically here we should probably do as above, but who cares
src/eigen/lobpcg_hyper_impl.jl
Outdated
end | ||
|
||
lenXn = length(Xn_indices) | ||
e = zero(similar(X, size(cX, 1), M - prev_nlocked)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero(similar(
allocates twice. Should we have a zeros_like
function or something, that could fallback to zero(similar(
but be better optimized in specific cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @vchuravy who might have an opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e = similar(X, size(cX, 1), M - prev_nlocked)
e .= 0
But yeah it's annoying that there is no similar(....; init=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. OK then let's just have our own zeros_like
function that does this, and later as we figure out the patterns we use we can design our own mini-API (eg something to replace comprehensions). Sucks there's no standard way to do this.
I've updated the PR following @antoine-levitt's comments. Is there anything you want to add @mfherbst ? |
Brillouin = "23470ee3-d0df-4052-8b1a-8cbd6363e7f0" | ||
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good if we can avoid this direct dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need it because of the functions in workaround/gpu_arrays.jl
? For the eigen
workaround for example, we explicitly use CUDA functions, so if I understand correcly, we need to have CUDA as a dependency for DFTK. As soon as the workarounds are implemented in CUDA, we could remove this.
LinearAlgebra.dot(x::AbstractGPUArray, D::Diagonal,y::AbstractGPUArray) = x'*(D*y) | ||
|
||
# https://github.com/JuliaGPU/CUDA.jl/issues/1572 | ||
function LinearAlgebra.eigen(A::Hermitian{T,AT}) where {T <: Complex,AT <: CuArray} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should open a PR to CUDA.jl to implement this there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only some nits with spaces left on my end.
src/common/zeros_like.jl
Outdated
function zeros_like(X::AT, n, m) where AT <: AbstractArray | ||
Z = similar(X, n, m) | ||
Z .= 0 | ||
Z | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this more general directly:
function zeros_like(X::AT, n, m) where AT <: AbstractArray | |
Z = similar(X, n, m) | |
Z .= 0 | |
Z | |
end | |
function zeros_like(X::AbstractArray, T::Type=eltype(X), dims::Integer...=size(X)...) | |
Z = similar(X, T, dims...) | |
Z .= 0 | |
Z | |
end | |
zeros_like(X::AbstractArray, dims::Integer...) = zeros_like(X, eltype(X), dims...) | |
zeros_like(X::Array, T::Type=eltype(X), dims::Integer...=size(X)...) = zeros(T, dims...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to work because of the default value for dims
.
I'm a bit confused by the slurping operator. I understand it's purpose, but I don't find the syntax very clear. From what I understand by reading the docs, writing dims::Integer...
means we want n integers to be merged into a tuple called dims: so implicitly, dims
is here a Tuple of Integers. Is this correct? If it is, then it would explain why we can't write dims::Integer=size(X)
as size(X)
is a Tuple, not an Integer.
What syntax can use to explicitly say that we want dims
(the tuple) to be size(X)
(a tuple) by default? Or is there a way to flatten size(X)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to work because of the default value for dims.
Indeed, I had some typos in my code, corrected now.
so implicitly, dims is here a Tuple of Integers. Is this correct?
yes
Or is there a way to flatten size(X)?
Yes, more ...
;), see the updated code snippet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ok, makes sense!
I will merge this now, even though we don't yet have a way to test the GPU stuff. This will hopefully follow soon. |
Thanks for your efforts @GVigne ! |
This draft PR implements GPU compatibility for LOBPCG.
At the highest level, the goal is to be able to call LOBPCG with either Arrays or any type of GPU Arrays. This means the user should be able to make the following calls (A is a Hermitian matrix and X is vector)
LOBPCG(A,X) #CPU
LOBPCG(CuArray(A),CuArray(X)) #GPU
LOBPCG(ROCArray(A),ROCArray(X)) #GPU
LOBPCG returns a named tuple. One of the field is an array of the eigenvalues found, and another is the corresponding eigenvectors. These two fields have the same array type as the input: if the user calls LOBPCG with CuArrays, then the eigenvalues and eigenvectors will be stored in a CuArray.
I tried to make the code as abstract as possible so as not to rely on a hardware-specific library. The only part where I needed to call specific CUDA functions (I am using an NVIDIA GPU) is when computing eigenvalues. If I am not mistaken, there is no generic
eigen
function in CUDA.jl, so I had to manually call the eigendecomposition functions (CUDA.syevd! and CUDA.heevd!).