Make LOBPCG GPU-compatible #711

GVigne · 2022-08-23T08:26:49Z

This draft PR implements GPU compatibility for LOBPCG.
At the highest level, the goal is to be able to call LOBPCG with either Arrays or any type of GPU Arrays. This means the user should be able to make the following calls (A is a Hermitian matrix and X is vector)
LOBPCG(A,X) #CPU
LOBPCG(CuArray(A),CuArray(X)) #GPU
LOBPCG(ROCArray(A),ROCArray(X)) #GPU
LOBPCG returns a named tuple. One of the field is an array of the eigenvalues found, and another is the corresponding eigenvectors. These two fields have the same array type as the input: if the user calls LOBPCG with CuArrays, then the eigenvalues and eigenvectors will be stored in a CuArray.

I tried to make the code as abstract as possible so as not to rely on a hardware-specific library. The only part where I needed to call specific CUDA functions (I am using an NVIDIA GPU) is when computing eigenvalues. If I am not mistaken, there is no generic eigen function in CUDA.jl, so I had to manually call the eigendecomposition functions (CUDA.syevd! and CUDA.heevd!).

mfherbst

A few comments. Also check the formatting.

mfherbst · 2022-09-11T12:10:44Z

src/eigen/lobpcg_hyper_impl.jl

+                            "eigensolver $verb fail; increase the number of " *
+                            "degrees of freedom, or use a dense eigensolver."
+     N > 3M    || error(error_message("will"))
+     N >= 3M+5 || @warn error_message("might")


That looks like an unintentional whitespace commit.

mfherbst · 2022-09-11T12:11:42Z

src/workarounds/gpu_arrays.jl

+function LinearAlgebra.eigen(A::RealHermSymComplexHerm{T,AT}) where {T,AT <: CuArray}
+    if eltype(A) <: Complex


Use multiple dispatch on element type of CuArray.

src/workarounds/gpu_arrays.jl

mfherbst · 2022-09-11T12:14:01Z

src/eigen/lobpcg_hyper_impl.jl

+A'B (which is not a BlockMatrix). block_overlap also has compatible versions with two Arrays. 
+block_overlap always compute some form of adjoint, ie the product A'*B.
+"""
+@views function block_overlap(A::BlockMatrix, B::BlockMatrix)


Why is this function not covered by tests? Did you check you end up dispatching correctly?

Also not too big a fan of having this function. Why not define

Base.:*(A::Adjoint{T, <: BlockMatrix}, B::BlockMatrix) where {T})

instead?

At first I thought defining it this way would be clearer: as it turns out, it's just weird and confusing. I re-wrote the block_overlap functions, now we only overload Base.*.

mfherbst · 2022-09-11T12:15:32Z

src/eigen/lobpcg_hyper_impl.jl

+This function will fail (for now) if:
+    -the arrays do not all have the same "height" (ie size[1] must match).
+"""
+function make_block_vector(arrays::AbstractArray...)


Just make this a constructor?

You're right: no need for a function with a weird name.

src/eigen/lobpcg_hyper_impl.jl

mfherbst · 2022-09-11T12:23:53Z

src/eigen/lobpcg_hyper_impl.jl

+struct BlockMatrix
+    blocks::Tuple


This is not type stable, which matters here because of performance. You have to make Tuple a concrete type.

Yes, I thought this would also be an issue. The thing is, we sometimes build a BlockMatrix with arrays which are views of other arrays. And taking the view of an array builds a SubArray which derives from AbstractArray.
So mixed in BlockMatrix we can have either arrays or subarrays which do not have a concrete type in common: the only thing they have in common is that they derive from AbstractArray. I may be wrong, but I can't think of an easy way to make this field concrete by keeping the code the way it is: maybe we shouldn't give a BlockMatrix views?

Like this?

struct BlockMatrix{T <: AbstractFloat, D <: Tuple} <: AbstractMatrix{T} blocks::D end

btw. Not sure we should have the size as a member. It's easily computed on the fly from the constituents of blocks.

You're right: I was trying to describe more precisely what was hidden behind the Tuple, but doing D <: Tuple will work.
As for the size, I created this member so we don't have to recompute every time the size of the BlockMatrix. As it turns out, we are not using it much and it can be computed very easily, so we can drop it.

GVigne · 2022-09-14T10:09:47Z

Thanks for the feedback! I did some modifications to take into account what you said. I also added a few small tests for the BlockMatrix structure in test/lobpcg to check if nothing is broken.

antoine-levitt

Super nice! Very cute little struct and nicer code than we had before.

antoine-levitt · 2022-09-22T11:39:29Z

src/eigen/lobpcg_hyper_impl.jl

+
+# For now, BlockMatrix can store arrays of different types (for example, an element 
+# of type views and one of type Matrix). Maybe for performance issues it should only
+# store arrays of the same type?


I think that's fine, why would it be an issue?

I don't really know how type conversion can affect the computations. The way I see it, we can either code the struct like a Tuple (can hold different types of arrays, possibly with different types of elements in them) or like a Vector (everything gets converted to a common type). I was wondering if mixing all the types together (ie coding the struct like a Tuple) wouldn't induce computational errors, as we keep converting things back and forth.

There's no conversion at all here. You just use the tuple elements in mul! functions, which dispatches to the correct function. I don't think there's any need to change what you have now

antoine-levitt · 2022-09-22T11:41:37Z

src/eigen/lobpcg_hyper_impl.jl

+# of type views and one of type Matrix). Maybe for performance issues it should only
+# store arrays of the same type?
+
+struct BlockMatrix{T <: Number, D <: Tuple} <: AbstractMatrix{T}


Document it here instead of at the constructor

In particular it would be nice if it was more clear that it is column blocks, but I don't find a good name, so just make it clear from a comment that it is a horizontal concatenation of blocks

note that these are relatively simple wrappers only supporting some multiplication routines, and in particular operations with them materialize to plain arrays

also explain relationship to BlockArrays (it's a lightweight subset of functionality, it materializes to plain arrays, it supports GPU)

Re name : LazyHcat? (because it's basically equivalent to hcat but better optimized)

I think LazyHcat is not too bad: it's clearer to have "hcat" than to just have the generic "block" which doesn't give a lot of information. And we probably never will have to implement a true block-matrix structure for LOBPCG.

antoine-levitt · 2022-09-22T15:09:49Z

src/eigen/lobpcg_hyper_impl.jl

+"""
+Build a BlockMatrix containing the given arrays, from left to right.
+This function will fail (for now) if:
+    -the arrays do not all have the same "height" (ie size[1] must match).


You're beginning an enumeration with one item

height -> number of rows

I rewrote the documentation, all of this has been summed up when defining the struct.

antoine-levitt · 2022-09-22T15:10:42Z

src/eigen/lobpcg_hyper_impl.jl

+    n_ref= size(arrays[1], 1)
+    for array in arrays
+        n_i = size(array, 1)
+        n_ref != n_i && error("The given arrays do not have matching 'height': "*


this is very internal, no need to have a nice error message, just say @Assert n_ref == n_i

or just @Assert all(size.(arrays, 1) .== n_ref)

antoine-levitt · 2022-09-22T15:14:33Z

src/eigen/lobpcg_hyper_impl.jl

+
+"""
+Given A and B as two BlockMatrices [A1, A2, A3], [B1, B2, B3] form the matrix
+A'B. Return an array, not a BlockMatrix.


remove docstring (* does what * does)

antoine-levitt · 2022-09-22T15:17:41Z

src/eigen/lobpcg_hyper_impl.jl

+
+function LinearAlgebra.mul!(res::AbstractMatrix, Ablock::BlockMatrix,
+                            B::AbstractVecOrMat, α::Number, β::Number)
+    # Has slightly better performances than a naive res = α*A*B - β*res


really? even with dots? I find that surprising

I did some more testing, and you are right: it is indeed the same when using dots. I removed the comment.

antoine-levitt · 2022-09-22T15:18:06Z

src/eigen/lobpcg_hyper_impl.jl

-    @assert all(!isnan, XAX)
-    F = eigen(Hermitian(XAX))
+@timing function rayleigh_ritz(X::BlockMatrix, AX::BlockMatrix, N)
+    # Multiplying two BlockMatrices yields an array, not a BlockMatrix


remove comment, remove typing of variables

src/eigen/lobpcg_hyper_impl.jl

antoine-levitt · 2022-09-22T15:19:21Z

src/eigen/lobpcg_hyper_impl.jl

        # If the orthogonalization has produced results below 2eps, we drop them
        # This is to be able to orthogonalize eg [1;0] against [e^iθ;0],
        # as can happen in extreme cases in the ortho!(cP, cX)
        dropped = drop!(X)
        if dropped != []
-            @views mul!(X[:, dropped], Y, BY' * (X[:, dropped]), -one(T), one(T)) # X -= Y*BY'X
+            X[:, dropped] .-= Y * (BY' * X[:, dropped])  # X = X - Y*BY'*X


remove comment

technically here we should probably do as above, but who cares

antoine-levitt · 2022-09-22T15:22:24Z

src/eigen/lobpcg_hyper_impl.jl

-            end
+
+            lenXn = length(Xn_indices)
+            e = zero(similar(X, size(cX, 1), M - prev_nlocked))


zero(similar( allocates twice. Should we have a zeros_like function or something, that could fallback to zero(similar( but be better optimized in specific cases?

ping @vchuravy who might have an opinion

e = similar(X, size(cX, 1), M - prev_nlocked) e .= 0

But yeah it's annoying that there is no similar(....; init=0)

Hm. OK then let's just have our own zeros_like function that does this, and later as we figure out the patterns we use we can design our own mini-API (eg something to replace comprehensions). Sucks there's no standard way to do this.

GVigne · 2022-09-26T08:48:54Z

I've updated the PR following @antoine-levitt's comments. Is there anything you want to add @mfherbst ?

vchuravy · 2022-09-26T15:22:37Z

Project.toml

 Brillouin = "23470ee3-d0df-4052-8b1a-8cbd6363e7f0"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"


Would be good if we can avoid this direct dependency

Don't we need it because of the functions in workaround/gpu_arrays.jl? For the eigen workaround for example, we explicitly use CUDA functions, so if I understand correcly, we need to have CUDA as a dependency for DFTK. As soon as the workarounds are implemented in CUDA, we could remove this.

Project.toml

vchuravy · 2022-09-26T15:23:57Z

src/workarounds/gpu_arrays.jl

+LinearAlgebra.dot(x::AbstractGPUArray, D::Diagonal,y::AbstractGPUArray) = x'*(D*y)
+
+# https://github.com/JuliaGPU/CUDA.jl/issues/1572
+function LinearAlgebra.eigen(A::Hermitian{T,AT}) where {T <: Complex,AT <: CuArray}


We should open a PR to CUDA.jl to implement this there.

mfherbst

Only some nits with spaces left on my end.

src/eigen/lobpcg_hyper_impl.jl

test/lobpcg.jl

src/workarounds/gpu_arrays.jl

src/eigen/lobpcg_hyper_impl.jl

mfherbst · 2022-09-27T14:50:00Z

src/common/zeros_like.jl

+function zeros_like(X::AT, n, m) where AT <: AbstractArray
+    Z = similar(X, n, m)
+    Z .= 0
+    Z
+end


I would make this more general directly:

Suggested change

function zeros_like(X::AT, n, m) where AT <: AbstractArray

Z = similar(X, n, m)

Z .= 0

Z

end

function zeros_like(X::AbstractArray, T::Type=eltype(X), dims::Integer...=size(X)...)

Z = similar(X, T, dims...)

Z .= 0

Z

end

zeros_like(X::AbstractArray, dims::Integer...) = zeros_like(X, eltype(X), dims...)

zeros_like(X::Array, T::Type=eltype(X), dims::Integer...=size(X)...) = zeros(T, dims...)

This doesn't seem to work because of the default value for dims.
I'm a bit confused by the slurping operator. I understand it's purpose, but I don't find the syntax very clear. From what I understand by reading the docs, writing dims::Integer... means we want n integers to be merged into a tuple called dims: so implicitly, dims is here a Tuple of Integers. Is this correct? If it is, then it would explain why we can't write dims::Integer=size(X) as size(X) is a Tuple, not an Integer.
What syntax can use to explicitly say that we want dims (the tuple) to be size(X) (a tuple) by default? Or is there a way to flatten size(X)?

This doesn't seem to work because of the default value for dims.

Indeed, I had some typos in my code, corrected now.

so implicitly, dims is here a Tuple of Integers. Is this correct?

yes

Or is there a way to flatten size(X)?

Yes, more ... ;), see the updated code snippet.

Oh, ok, makes sense!

mfherbst · 2022-09-28T09:26:49Z

I will merge this now, even though we don't yet have a way to test the GPU stuff. This will hopefully follow soon.

mfherbst · 2022-09-28T09:27:07Z

Thanks for your efforts @GVigne !

Make LOBPCG GPU-compatible

947f4a8

GVigne mentioned this pull request Aug 24, 2022

Make some computations in DFTK GPU-compatible #712

Merged

Merge branch 'master' into LOBPCG_GPU

a8efb25

GVigne marked this pull request as ready for review September 7, 2022 07:49

mfherbst reviewed Sep 11, 2022

View reviewed changes

GVigne added 4 commits September 13, 2022 12:39

Formatting and coded cleanup

26e226b

Fix type instability in BlockMatrix

48bb4be

Replace block_overlap by Base.*

91a8b09

Add a few tests for the BlockMatrix structure.

c2a874c

GVigne added 2 commits September 22, 2022 10:05

Update Project.toml

9268d68

Merge branch 'master' into LOBPCG_GPU

d882b91

antoine-levitt reviewed Sep 22, 2022

View reviewed changes

GVigne added 3 commits September 23, 2022 12:10

Update docstrings + remove deprecated code

38eae10

Rename BlockMatrix into LazyHcat

8926568

Merge branch 'master' into LOBPCG_GPU

da2cb59

vchuravy reviewed Sep 26, 2022

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

vchuravy reviewed Sep 26, 2022

View reviewed changes

mfherbst reviewed Sep 26, 2022

View reviewed changes

GVigne added 2 commits September 27, 2022 09:35

Restore NaN checks + whitespaces.

26b760b

Add zeros_like function + update Project.toml

8d424e1

mfherbst reviewed Sep 27, 2022

View reviewed changes

src/workarounds/gpu_arrays.jl Outdated Show resolved Hide resolved

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

Move zeros_like to src/common + whitespaces

ce5da66

mfherbst reviewed Sep 27, 2022

View reviewed changes

src/eigen/lobpcg_hyper_impl.jl Outdated Show resolved Hide resolved

mfherbst added 2 commits September 27, 2022 16:40

Update lobpcg.jl

a0e3b8e

Update lobpcg_hyper_impl.jl

380f3cf

mfherbst reviewed Sep 27, 2022

View reviewed changes

Update zeros_like

2278432

Merge branch 'master' into LOBPCG_GPU

1ba75d4

mfherbst enabled auto-merge (squash) September 28, 2022 09:27

Update Project.toml

0406d6a

mfherbst merged commit d28391e into JuliaMolSim:master Sep 28, 2022

mfherbst mentioned this pull request Oct 4, 2022

GPU discussion #350

Closed

		function LinearAlgebra.eigen(A::RealHermSymComplexHerm{T,AT}) where {T,AT <: CuArray}
		if eltype(A) <: Complex

		Brillouin = "23470ee3-d0df-4052-8b1a-8cbd6363e7f0"
		CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

-function zeros_like(X::AT, n, m) where AT <: AbstractArray
-    Z = similar(X, n, m)
-    Z .= 0
-    Z
-end
+function zeros_like(X::AbstractArray, T::Type=eltype(X), dims::Integer...=size(X)...)
+    Z = similar(X, T, dims...)
+    Z .= 0
+    Z
+end
+zeros_like(X::AbstractArray, dims::Integer...) = zeros_like(X, eltype(X), dims...)
+zeros_like(X::Array, T::Type=eltype(X), dims::Integer...=size(X)...) = zeros(T, dims...)

Make LOBPCG GPU-compatible #711

Make LOBPCG GPU-compatible #711

Conversation

GVigne commented Aug 23, 2022

mfherbst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfherbst Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GVigne commented Sep 14, 2022

antoine-levitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GVigne commented Sep 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfherbst left a comment

Choose a reason for hiding this comment

mfherbst Sep 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfherbst commented Sep 28, 2022

mfherbst commented Sep 28, 2022

mfherbst Sep 13, 2022 •

edited

Loading

mfherbst Sep 27, 2022 •

edited

Loading