GPU preparation work #652

jvermaas · 2020-07-21T22:18:00Z

jvermaas
Jul 21, 2020
Collaborator

COVID work has slowed down a bit, so now I have time to think about how to better integrate colvars onto GPUs to avoid data transfers across the relatively slow PCI bus. As a discussion point, there are a few parts to colvars that aren't ideal from a GPU perspective as it has currently been designed.

Right now, quantities like the atomic coordinates are stored as arrays of rvectors, so the data is stored xyzxyzxyz, etc. On the GPU, the coordinates are stored as independent arrays, (xxxyyyzzz), since this better exposes the parallelism on the hardware. rvector seems to be used everywhere within the module, so in the short term it may be worth rearranging the data within the GPU every timestep into an array of double3 (or float3 depending on the cvm::real typedef).
There are some code patterns that would probably vectorize well, but instead need to resort to simple for loop patterns because std::transform and their thrust equivalents are verboten until VMD allows us to use C++11.

My inclination is to use as much of the existing code-base as is practical at the expense of potential GPU performance, so long as no memory needs to move across the PCI-bus, but this is still at the beginning stages so it would be useful to gather feedback. Thoughts and opinions welcome!

jdmaia · 2020-07-21T23:15:42Z

jdmaia
Jul 21, 2020

This would be a great addition. I'd be glad to collaborate in any way to make it happen and get it integrated with NAMD 3.

Your plan sounds fine to me. Avoiding the latency copying the values over and having the CPU perform the relevant math is where most of the performance will come from, so I wouldn't care about GPU performance of the code right now.

0 replies

jhenin · 2020-07-22T10:07:31Z

jhenin
Jul 22, 2020
Maintainer

This is very much connected to issue #61 - the plan described there to move to flat arrays of shape xyzxyz could be amended.

0 replies

jcphill · 2020-07-22T18:23:52Z

jcphill
Jul 22, 2020

I'd suggest just getting colvars running on the GPU in whatever way is possible with the fewest code changes using whatever bleeding-edge features might help, even if it only runs on Volta and Ampere under CUDA 11. Don't try to optimize performance before you know what the actual bottlenecks are.

0 replies

giacomofiorin · 2020-07-22T22:30:29Z

giacomofiorin
Jul 22, 2020
Maintainer

As @jhenin pointed out there is already a discussion on the matter of refactoring. In some part, it reflects also conversations with another UIUC programmer (John). Regarding details, I am of course open to changes. The only plans that shouldn't be changed are those that are so good that they are carried out quickly... And clearly that plan isn't (see the date on issue #61).

One thing that stands out is that the choice between xyzxyz and xxyyzz is dictated by how the Cartesian coordinates are arranged in NAMD. I put xyzxyz there because that is the arrangement in VMD, which is was the only platform so far where the entire system is stored contiguously (haven't tried yet interfacing with other single-node codes). As far as many collective variables are concerned (e.g. the RMSD function) it is a linear loop over 3N numbers anyway.

We all agree that ruling out C++11 or other modern resources is not useful any more. Eventually even VMD will get there. But then the question is how much of the Colvars GPU code, which will most likely be written with NAMD in mind, could be reused. If it needs to be tied to NAMD, colvarproxy_namd is the right class to handle the GPU-optimized memory. I am happy to help make (some of) the variables use this alternative storage instead of the sub-optimal builtin one.

Alternatively, there is also the option to just wrap NAMD objects, like we do already for centers of mass or GridForces maps, both of which are computed entirely by NAMD code (Colvars is just a wrapper to those features). This could give you guys more flexibility, e.g. to handle the transition from the current single-GPU scheme to a (new and better than before) multi-GPU scheme. Not to mention that you could probably also reuse said variables with tclForces based restraints, Plumed, etc.

0 replies

giacomofiorin · 2020-07-22T22:32:50Z

giacomofiorin
Jul 22, 2020
Maintainer

I guess I put stuff down without actually engaging the conversation, so with that in mind here is a question. What are the variables that NAMD 3 most sorely needs to have on the GPU?

0 replies

HanatoK · 2020-07-22T22:48:45Z

HanatoK
Jul 22, 2020
Collaborator

I think the final goal may be totally avoiding the data transfer between CPU and GPU, so the variables include all the atomic coordinates, colvar grids and biases. Is that possible?

I guess I put stuff down without actually engaging the conversation, so with that in mind here is a question. What are the variables that NAMD 3 most sorely needs to have on the GPU?

0 replies

jvermaas · 2020-07-22T22:50:40Z

jvermaas
Jul 22, 2020
Collaborator Author

All of them I think would be ideal, to make colvars properly GPU-resident the whole time the simulation is running (except for file I/O). Otherwise you still incur a memory transfer across the PCI-bus or whatever its successor is, and the simulation would need to wait for that step to complete. Right now its those memory transfers that appear to be the bottleneck, since calculating relatively simple colvars substantially reduced performance.

0 replies

HanatoK · 2020-07-24T02:11:08Z

HanatoK
Jul 24, 2020
Collaborator

I have put an example code of computing optimal RMSD with a reference frame using CUDA in https://github.com/HanatoK/RMSD_CUDA . It does all kinds of calculations (COM, matrix F, eigenvalues and reduction) on GPU, but of course the code is not optimal.
I don't think the choice of xxxyyyzzz or xyzxyzxyz is a problem. Actually double3 and cvm::rvector are the same, so an array of double3 is still stored in xyzxyzxyz.

0 replies

giacomofiorin · 2021-02-12T18:59:03Z

giacomofiorin
Feb 12, 2021
Maintainer

Hi, here are a few comments that I hope will contribute constructively to this conversation.

To mitigate the additional latency and slowdown generated by a Colvars computation, a general workaround is multiple time-stepping, which is already implemented. The framework is the same as the one used in Plumed, where it is intended to mitigate a similar problem.
Some applications may be reformulated by using better-performing configurations. For example, @jvermaas pointed out privately that the example input that I gave him and Abhi's group as a basis for this application paper is not very scalable. As I noted already to him, that input contains both a center-of-mass restraint (parallelized) and an orientation restraint (not parallelized). However, because he and Abhi's group ended up adding many extraBonds restraints anyway, one could safely convert the orientional restraint into multiple centers-of-mass restraints for individual protein domains. This approach will help immediately for multi-node parallelism (Charm++ in NAMD-speak), but at a later stage also for the GPU-resident parallelism (NAMD 3).
There is a bit of redundancy between this issue, @HanatoK's issue (which at least identifies more specific targets) and private conversations including NAMD Slack. Ultimately this topic greatly exceeds the scope of one issue on this repository, and should probably involve also the new NAMD GitLab repository. For example, I have pointed out on the NAMD Slack that because Colvars is already using NAMD code to compute centers-of-mass over multiple nodes, the natural approach would be doing the same on GPUs. Similarly, when I added parallel communication for GridForces to develop the Multi-Map variable, the most appropriate way to handle it was to get the code reviewed and introduced directly into NAMD.
In general, I encourage everybody interested in working toward this issue to narrow down the problem by ruling out all other options (points 1 and 2 above), and identify which code sections to attack first (point 3). Sound plans of action will include both innovative code solutions and continuous integration with the complete package, which is NAMD in this case.

To those of you who are endeavoring to make complex plans, we will need much more frequent and accurate communication than there ever was in the NAMD project. Whether you are making such plans or are reviewing them, do keep this in mind.

0 replies

HanatoK · 2021-02-13T02:04:24Z

HanatoK
Feb 13, 2021
Collaborator

@giacomofiorin
The issue in my repository is for bookkeeping myself, and it is not a final plan. I would not progress too much with the actual coding for I am busy on writing the doctoral thesis and preparing for defense.
There is a merge request on the NAMD repository (https://gitlab.com/tcbgUIUC/namd/-/merge_requests/22) for supporting GlobalMaster on the single-node GPU version, but it is not merged. As for the MTS feature, it should help to improve the performance, but I think that should also be done in the NAMD side to avoid GPU-CPU copying and context switch every step.

0 replies

giacomofiorin · 2021-02-14T16:52:26Z

giacomofiorin
Feb 14, 2021
Maintainer

Good luck with your defense!

I'll take a closer look again at Julio's draft changes to GlobalMaster, but from the short discussion with him on the NAMD Slack, this is still very early stage. For completeness, I'll copy here what I said there, which is that I had recently very good multi-node scaling by dynamically adapting the list of atoms requested by Colvars to NAMD via the GlobalMaster mechanism. When I'm sure that it doesn't break anything else, I'll post it as a PR here in the coming weeks.

If the GPU-aware implementation of GlobalMaster works the same way as it did in the past, that would immediately also disable the GPU-CPU copying that you refer to. But it also seems the right moment for some streamlining of GlobalMaster, which has been discussed quite a few times at UIUC and I'm totally in support of.

0 replies

HanatoK · 2024-05-02T20:55:59Z

HanatoK
May 2, 2024
Collaborator

I am now a NAMD developer so I can do more on this aspect. For the NAMD side, I have proposed the new interface for GPU-resident NAMD, which avoids the CPU-GPU copy (see https://gitlab.com/tcbgUIUC/namd/-/wikis/Developer-Notes/CudaGlobalMaster). PytorchForces serves as an example of the new interface (see https://gitlab.com/tcbgUIUC/namd/-/wikis/Developer-Notes/PytorchForces). Code are ready (see the feature_cuda_global branch of NAMD) but I can change the interface to fit Colvars' needs.

Any updates from the Colvars side? I am willing to help, but I don't know how to get the ball rolling. @giacomofiorin Do you have any ideas? Thanks!

2 replies

giacomofiorin May 2, 2024
Maintainer

Hi @HanatoK , what you did looks great, as I and others have already commented on the (private) NAMD repository!

What needs to be done on the Colvars side is fairly similar to what was done on the NAMD side over past 5-6 years, i.e. refactoring the computational kernels in the classes derived by colvar::cvc to make them not reliant on a fixed sequential CPU-only loop. The early stages of this are #643 and #644. More refactoring is needed, of course.

Here are a couple of questions:

The development of NAMD 3.0 benefited massively, in my opinion, from allowing people to directly download ready-to-use builds from here. A lot of bugs in NAMD 3.0 were fixed that way. Is this something that UIUC could allow to do here as well, i.e. attaching NAMD builds to releases on this (publicly accessible) repository? The terms of agreement could specify a limited scope, to comply with the UIUC restrictive license. A similar arrangement (though not an identical purpose) is in place for VMD-XPLOR.
Migrating variables to GPU is clearly necessary, but for biases it seems less practical (their code is logically more complex). Can you confirm that your preliminary interface supports calls to CPU code with minimal overhead when the data passed to them is very small (i.e. passing the values of the CVs)?

HanatoK May 2, 2024
Collaborator

Sorry, I missed your comments! I just moved the markdown file in the developer-doc directory in my branch to the wiki page. Maybe the comments are lost. I checked the history (https://gitlab.com/tcbgUIUC/namd/-/blob/8766300507db82cb8a884484eab634fa2f69a484/developer-doc/CudaGlobalMaster.md) but they are still missing.

I am not sure about the license issue, will ask David if that is possible.
I have indeed made a 2D metadynamics example using PytorchForces here. Although the RMSD test case is fast, the problem of that metadynamics example is the performance is hit (slower than Colvars by 50%) by many small pytorch tensor operations, which pytorch is unable to fuse these small tensor operations into a single CUDA kernel. But I think calculating biasing forces (projecting hills) is still efficient. As for the interface, the idea of the new interface is basically passing the atom information to GPU buffers of the client, and then reading biasing forces back from the GPU buffers managed by the client. It is unaware of how CVs are passed inside the client, so I think it is up to Colvars to pass the CVs calculated on GPU to biases on CPU, but that should be easily done by a memcpy (or calling APIs if 3rd libraries such as Kokkos or arrayfire are used). As for Colvars, I suspect that we can add a new item in features_biases describing the support of GPU (maybe f_cvb_gpu), and copy the CVs to CPU if it is not true.

By the way, the SSAGES team starts the PySAGES to migrate to GPU by using JAX, but I think there is a similar performance problem as PytorchForces using pytorch (from their paper):

For the unbiased, pure OpenMM
simulation we achieve a performance of ≈136 ns per day. For the
PySAGES biased simulation, we achieve a performance of ≈75 ns
per day, equating to a biasing overhead of approximately 50%. Figure
9b shows a similar time series analysis as for HOOMD-blue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU preparation work #652

{{title}}

Replies: 12 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

GPU preparation work #652

jvermaas Jul 21, 2020 Collaborator

Replies: 12 comments · 2 replies

jdmaia Jul 21, 2020

jhenin Jul 22, 2020 Maintainer

jcphill Jul 22, 2020

giacomofiorin Jul 22, 2020 Maintainer

giacomofiorin Jul 22, 2020 Maintainer

HanatoK Jul 22, 2020 Collaborator

jvermaas Jul 22, 2020 Collaborator Author

HanatoK Jul 24, 2020 Collaborator

giacomofiorin Feb 12, 2021 Maintainer

HanatoK Feb 13, 2021 Collaborator

giacomofiorin Feb 14, 2021 Maintainer

HanatoK May 2, 2024 Collaborator

giacomofiorin May 2, 2024 Maintainer

HanatoK May 2, 2024 Collaborator

jvermaas
Jul 21, 2020
Collaborator

Replies: 12 comments 2 replies

jdmaia
Jul 21, 2020

jhenin
Jul 22, 2020
Maintainer

jcphill
Jul 22, 2020

giacomofiorin
Jul 22, 2020
Maintainer

giacomofiorin
Jul 22, 2020
Maintainer

HanatoK
Jul 22, 2020
Collaborator

jvermaas
Jul 22, 2020
Collaborator Author

HanatoK
Jul 24, 2020
Collaborator

giacomofiorin
Feb 12, 2021
Maintainer

HanatoK
Feb 13, 2021
Collaborator

giacomofiorin
Feb 14, 2021
Maintainer

HanatoK
May 2, 2024
Collaborator

giacomofiorin May 2, 2024
Maintainer

HanatoK May 2, 2024
Collaborator