Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LOBPCG GPU-compatible #711

Merged
merged 19 commits into from
Sep 28, 2022
Merged

Make LOBPCG GPU-compatible #711

merged 19 commits into from
Sep 28, 2022

Conversation

GVigne
Copy link
Contributor

@GVigne GVigne commented Aug 23, 2022

This draft PR implements GPU compatibility for LOBPCG.
At the highest level, the goal is to be able to call LOBPCG with either Arrays or any type of GPU Arrays. This means the user should be able to make the following calls (A is a Hermitian matrix and X is vector)
LOBPCG(A,X) #CPU
LOBPCG(CuArray(A),CuArray(X)) #GPU
LOBPCG(ROCArray(A),ROCArray(X)) #GPU
LOBPCG returns a named tuple. One of the field is an array of the eigenvalues found, and another is the corresponding eigenvectors. These two fields have the same array type as the input: if the user calls LOBPCG with CuArrays, then the eigenvalues and eigenvectors will be stored in a CuArray.

I tried to make the code as abstract as possible so as not to rely on a hardware-specific library. The only part where I needed to call specific CUDA functions (I am using an NVIDIA GPU) is when computing eigenvalues. If I am not mistaken, there is no generic eigen function in CUDA.jl, so I had to manually call the eigendecomposition functions (CUDA.syevd! and CUDA.heevd!).

@GVigne GVigne marked this pull request as ready for review September 7, 2022 07:49
Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Also check the formatting.

Comment on lines 301 to 304
"eigensolver $verb fail; increase the number of " *
"degrees of freedom, or use a dense eigensolver."
N > 3M || error(error_message("will"))
N >= 3M+5 || @warn error_message("might")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like an unintentional whitespace commit.

Comment on lines 9 to 10
function LinearAlgebra.eigen(A::RealHermSymComplexHerm{T,AT}) where {T,AT <: CuArray}
if eltype(A) <: Complex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use multiple dispatch on element type of CuArray.

src/workarounds/gpu_arrays.jl Show resolved Hide resolved
A'B (which is not a BlockMatrix). block_overlap also has compatible versions with two Arrays.
block_overlap always compute some form of adjoint, ie the product A'*B.
"""
@views function block_overlap(A::BlockMatrix, B::BlockMatrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function not covered by tests? Did you check you end up dispatching correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not too big a fan of having this function. Why not define

Base.:*(A::Adjoint{T, <: BlockMatrix}, B::BlockMatrix) where {T})

instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought defining it this way would be clearer: as it turns out, it's just weird and confusing. I re-wrote the block_overlap functions, now we only overload Base.*.

This function will fail (for now) if:
-the arrays do not all have the same "height" (ie size[1] must match).
"""
function make_block_vector(arrays::AbstractArray...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make this a constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: no need for a function with a weird name.

src/eigen/lobpcg_hyper_impl.jl Show resolved Hide resolved
Comment on lines 53 to 54
struct BlockMatrix
blocks::Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not type stable, which matters here because of performance. You have to make Tuple a concrete type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought this would also be an issue. The thing is, we sometimes build a BlockMatrix with arrays which are views of other arrays. And taking the view of an array builds a SubArray which derives from AbstractArray.
So mixed in BlockMatrix we can have either arrays or subarrays which do not have a concrete type in common: the only thing they have in common is that they derive from AbstractArray. I may be wrong, but I can't think of an easy way to make this field concrete by keeping the code the way it is: maybe we shouldn't give a BlockMatrix views?

Copy link
Member

@mfherbst mfherbst Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this?

struct BlockMatrix{T <: AbstractFloat, D <: Tuple} <: AbstractMatrix{T}
    blocks::D
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. Not sure we should have the size as a member. It's easily computed on the fly from the constituents of blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right: I was trying to describe more precisely what was hidden behind the Tuple, but doing D <: Tuple will work.
As for the size, I created this member so we don't have to recompute every time the size of the BlockMatrix. As it turns out, we are not using it much and it can be computed very easily, so we can drop it.

@GVigne
Copy link
Contributor Author

GVigne commented Sep 14, 2022

Thanks for the feedback! I did some modifications to take into account what you said. I also added a few small tests for the BlockMatrix structure in test/lobpcg to check if nothing is broken.

Copy link
Member

@antoine-levitt antoine-levitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice! Very cute little struct and nicer code than we had before.


# For now, BlockMatrix can store arrays of different types (for example, an element
# of type views and one of type Matrix). Maybe for performance issues it should only
# store arrays of the same type?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, why would it be an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know how type conversion can affect the computations. The way I see it, we can either code the struct like a Tuple (can hold different types of arrays, possibly with different types of elements in them) or like a Vector (everything gets converted to a common type). I was wondering if mixing all the types together (ie coding the struct like a Tuple) wouldn't induce computational errors, as we keep converting things back and forth.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no conversion at all here. You just use the tuple elements in mul! functions, which dispatches to the correct function. I don't think there's any need to change what you have now

# of type views and one of type Matrix). Maybe for performance issues it should only
# store arrays of the same type?

struct BlockMatrix{T <: Number, D <: Tuple} <: AbstractMatrix{T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document it here instead of at the constructor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular it would be nice if it was more clear that it is column blocks, but I don't find a good name, so just make it clear from a comment that it is a horizontal concatenation of blocks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that these are relatively simple wrappers only supporting some multiplication routines, and in particular operations with them materialize to plain arrays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also explain relationship to BlockArrays (it's a lightweight subset of functionality, it materializes to plain arrays, it supports GPU)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re name : LazyHcat? (because it's basically equivalent to hcat but better optimized)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LazyHcat is not too bad: it's clearer to have "hcat" than to just have the generic "block" which doesn't give a lot of information. And we probably never will have to implement a true block-matrix structure for LOBPCG.

"""
Build a BlockMatrix containing the given arrays, from left to right.
This function will fail (for now) if:
-the arrays do not all have the same "height" (ie size[1] must match).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're beginning an enumeration with one item

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

height -> number of rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the documentation, all of this has been summed up when defining the struct.

n_ref= size(arrays[1], 1)
for array in arrays
n_i = size(array, 1)
n_ref != n_i && error("The given arrays do not have matching 'height': "*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very internal, no need to have a nice error message, just say @Assert n_ref == n_i

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just @Assert all(size.(arrays, 1) .== n_ref)


"""
Given A and B as two BlockMatrices [A1, A2, A3], [B1, B2, B3] form the matrix
A'B. Return an array, not a BlockMatrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove docstring (* does what * does)


function LinearAlgebra.mul!(res::AbstractMatrix, Ablock::BlockMatrix,
B::AbstractVecOrMat, α::Number, β::Number)
# Has slightly better performances than a naive res = α*A*B - β*res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? even with dots? I find that surprising

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more testing, and you are right: it is indeed the same when using dots. I removed the comment.

@assert all(!isnan, XAX)
F = eigen(Hermitian(XAX))
@timing function rayleigh_ritz(X::BlockMatrix, AX::BlockMatrix, N)
# Multiplying two BlockMatrices yields an array, not a BlockMatrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment, remove typing of variables

src/eigen/lobpcg_hyper_impl.jl Show resolved Hide resolved
# If the orthogonalization has produced results below 2eps, we drop them
# This is to be able to orthogonalize eg [1;0] against [e^iθ;0],
# as can happen in extreme cases in the ortho!(cP, cX)
dropped = drop!(X)
if dropped != []
@views mul!(X[:, dropped], Y, BY' * (X[:, dropped]), -one(T), one(T)) # X -= Y*BY'X
X[:, dropped] .-= Y * (BY' * X[:, dropped]) # X = X - Y*BY'*X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically here we should probably do as above, but who cares

end

lenXn = length(Xn_indices)
e = zero(similar(X, size(cX, 1), M - prev_nlocked))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero(similar( allocates twice. Should we have a zeros_like function or something, that could fallback to zero(similar( but be better optimized in specific cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @vchuravy who might have an opinion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e = similar(X, size(cX, 1), M - prev_nlocked)
e .= 0

But yeah it's annoying that there is no similar(....; init=0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. OK then let's just have our own zeros_like function that does this, and later as we figure out the patterns we use we can design our own mini-API (eg something to replace comprehensions). Sucks there's no standard way to do this.

@GVigne
Copy link
Contributor Author

GVigne commented Sep 26, 2022

I've updated the PR following @antoine-levitt's comments. Is there anything you want to add @mfherbst ?

Brillouin = "23470ee3-d0df-4052-8b1a-8cbd6363e7f0"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good if we can avoid this direct dependency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need it because of the functions in workaround/gpu_arrays.jl? For the eigen workaround for example, we explicitly use CUDA functions, so if I understand correcly, we need to have CUDA as a dependency for DFTK. As soon as the workarounds are implemented in CUDA, we could remove this.

Project.toml Outdated Show resolved Hide resolved
LinearAlgebra.dot(x::AbstractGPUArray, D::Diagonal,y::AbstractGPUArray) = x'*(D*y)

# https://github.com/JuliaGPU/CUDA.jl/issues/1572
function LinearAlgebra.eigen(A::Hermitian{T,AT}) where {T <: Complex,AT <: CuArray}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should open a PR to CUDA.jl to implement this there.

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some nits with spaces left on my end.

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
test/lobpcg.jl Outdated Show resolved Hide resolved
src/workarounds/gpu_arrays.jl Outdated Show resolved Hide resolved
src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved
Comment on lines 3 to 7
function zeros_like(X::AT, n, m) where AT <: AbstractArray
Z = similar(X, n, m)
Z .= 0
Z
end
Copy link
Member

@mfherbst mfherbst Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this more general directly:

Suggested change
function zeros_like(X::AT, n, m) where AT <: AbstractArray
Z = similar(X, n, m)
Z .= 0
Z
end
function zeros_like(X::AbstractArray, T::Type=eltype(X), dims::Integer...=size(X)...)
Z = similar(X, T, dims...)
Z .= 0
Z
end
zeros_like(X::AbstractArray, dims::Integer...) = zeros_like(X, eltype(X), dims...)
zeros_like(X::Array, T::Type=eltype(X), dims::Integer...=size(X)...) = zeros(T, dims...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to work because of the default value for dims.
I'm a bit confused by the slurping operator. I understand it's purpose, but I don't find the syntax very clear. From what I understand by reading the docs, writing dims::Integer... means we want n integers to be merged into a tuple called dims: so implicitly, dims is here a Tuple of Integers. Is this correct? If it is, then it would explain why we can't write dims::Integer=size(X) as size(X) is a Tuple, not an Integer.
What syntax can use to explicitly say that we want dims (the tuple) to be size(X) (a tuple) by default? Or is there a way to flatten size(X)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to work because of the default value for dims.

Indeed, I had some typos in my code, corrected now.

so implicitly, dims is here a Tuple of Integers. Is this correct?

yes

Or is there a way to flatten size(X)?

Yes, more ... ;), see the updated code snippet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, makes sense!

@mfherbst
Copy link
Member

I will merge this now, even though we don't yet have a way to test the GPU stuff. This will hopefully follow soon.

@mfherbst mfherbst enabled auto-merge (squash) September 28, 2022 09:27
@mfherbst
Copy link
Member

Thanks for your efforts @GVigne !

@mfherbst mfherbst merged commit d28391e into JuliaMolSim:master Sep 28, 2022
@mfherbst mfherbst mentioned this pull request Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants