Implementation of Parallelization to analysis.hydrogenbonds.hbond_analysis #4718

talagayev · 2024-09-25T09:05:07Z

Changes made in this Pull Request:

Parallelization of the backend support to the class HydrogenBondAnalysis in hbond_analysis.py
Addition of parallelization tests in test_hydrogenbonds_analysis.py and fixtures in conftest.py
Updated Changelog

There is still the case that one of the pytests raises an Failure, while the remainder work fine, I am currently not sure what the reason for this is.

The Failure is due to this:
AttributeError: 'HydrogenBondAnalysis' object has no attribute '_hydrogens' and appears only in test_no_hydrogens

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

Developers certificate of origin

I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4718.org.readthedocs.build/en/4718/

Added ResultsGroup for hbond and the supported backends

Added the client for HydrogenBondAnalysis

Added client_HydrogenBondAnalysis to the tests

Added to changelog entry about hbond parallelization

pep8speaks · 2024-09-25T09:05:15Z

Hello @talagayev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py:

Line 345:80: E501 line too long (88 > 79 characters)

In the file testsuite/MDAnalysisTests/analysis/conftest.py:

Line 131:1: W293 blank line contains whitespace
Line 132:1: E302 expected 2 blank lines, found 1
Line 134:25: W292 no newline at end of file

Comment last updated at 2024-10-06 15:23:19 UTC

github-actions · 2024-09-25T09:07:44Z

Linter Bot Results:

Hi @talagayev! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location	Outcome
main package	⚠️ Possible failure
testsuite	⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11203036680/job/31139825891

Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

adjusted for PEP

orbeckst · 2024-09-26T01:26:28Z

Exciting! Did you happen to benchmark?

talagayev · 2024-09-26T15:29:23Z

Exciting! Did you happen to benchmark?

hey @orbeckst, no sadly not, since I wasn't sure if it is correct, due to this one Failure for the specific pytest, but I can check out the performance of the parallel hbond_analysis :)

talagayev · 2024-09-27T17:17:19Z

I ran some local benchmarking on a local system and it did perform quite well, tested it with 3, 5 & 10 CPUs.
The sytem that needed 10 minutes without parallelization showed the following times:

multiprocessing:
3 CPU: 4 minutes 23 seconds
5 CPU: 2 minutes 56 seconds
10 CPU: 2 minutes 6 seconds

dask:
3 CPU: 4 minutes 19 seconds
5 CPU: 2 minutes 49 seconds
10 CPU: 1 minute 55 seconds

The sytem that needed 20 minutes without parallelization showed the following times:

multiprocessing:
3 CPU: 9 minutes 13 seconds
5 CPU: 6 minutes 26 seconds
10 CPU: 3 minutes 39 seconds

dask:
3 CPU: 8 minutes 45 seconds
5 CPU: 5 minutes 58 seconds
10 CPU: 3 minute 38 seconds

Looks reasonable from the times and dask performs a little bit faster, at least for the case that I tested :)

orbeckst · 2024-09-27T17:56:48Z

These are very encouraging results, it looks like pretty good scaling! Nice. (Now we just have to ensure that the results are also correct...)

What kind of systems were you testing, i.e., number of atoms, number of trajectory frames, type (eg water only, protein in water, ...)?

What machine did you test on (CPU?) and where did you store the trajectory (SSD?)?

cc @RMeli @marinegor

talagayev · 2024-09-27T19:51:30Z

These are very encouraging results, it looks like pretty good scaling! Nice. (Now we just have to ensure that the results are also correct...)

What kind of systems were you testing, i.e., number of atoms, number of trajectory frames, type (eg water only, protein in water, ...)?

What machine did you test on (CPU?) and where did you store the trajectory (SSD?)?

cc @RMeli @marinegor

@orbeckst I mainly tried to concatenete the MDAnalysisTests Files to get something that would be reproducable, for the tests I mentioned I ran the PSF and DCD (only noticed after that that there were no H bonds calculated with the settings I used 🙈 ), but I would use the waterPSF and waterDCD Files and just use HydrogenBondAnalysis options displayed in there so that it would be than easier to reproduce, as for the frames amount, I currently concatanated 1.000.000 Frames so that the analysis takes 5 minutes 48 seconds on my PC on one processor and checking could be to test the hbonds.results of the obtained results with the non parallelized analysis and the parallelized analysis, I checked the hbonds.results of the multiprocessing with

hbonds_non_parallized.results == hbonds_multiprocessing_3.results and the arrays were identical.

https://userguide.mdanalysis.org/stable/examples/analysis/hydrogen_bonds/hbonds.html

As for the system that I have its:
CPU: AMD Ryzen 5 5600H with 12 cores
GPU: Nvidia Geforce RTX 3060

I run everything through a jupyter notebook to check always if it works and saved the topology and trajectory with the 1 million frames on my SSD :)

RMeli · 2024-09-30T07:27:22Z

Very nice results @talagayev, thank you for sharing. It's really helpful to have small benchmarks for these PRs, so that we can see that the parallelization indeed speed things up.

marinegor · 2024-10-01T21:18:53Z

Thanks for sharing indeed @talagayev ! Interesting to see that dask performs slightly better (though I'm not sure if it'd hold up if you run the tests multiple times). Also, did you reset the filesystem cache when doing the tests? I assume that might influence the results. As a test, you can either change the order of tested backends (I assume you did dask first, multiprocessing second?), or simply copy file to a new place every time before running the benchmark on it.

talagayev · 2024-10-01T23:33:38Z

Happy to help :)

@marinegor I didn't reset the cache, so that would influence the tests and I only ran them only once, so I could check for multiple runs how it would perform and also reset the filesystem cache after the runs, as for the order it was first without the parallelization, afterwards with multiprocessing for the different amount of workers and then dask for the different amount of workers :)

@marinegor @RMeli for the benchmarks, should I somehow use then directly the files that are present in the MDAnalysisTests for all of those benchmarks or what would be the best approach to add benchmarks for the parallelization?

yuxuanzhuang · 2024-10-03T20:01:29Z

The reason it was failing was that _donors and _hydrogens were defined in _prepare, which only runs for each computational group and, therefore, were not stored in the original instance.

RMeli · 2024-10-03T20:07:34Z

for the benchmarks, should I somehow use then directly the files that are present in the MDAnalysisTests for all of those benchmarks or what would be the best approach to add benchmarks for the parallelization?

I think it is OK the use the provided files and report the results here, so other people can easily run the benchmark on their system too, if they wish. However, if you have real systems you are interested in, it would definitely be good to see the results too.

I don't think at this stage we discussed about automated benchmarks using Airspeed Velocity (or other frameworks); maybe that is something to keep in mind and discuss.

codecov · 2024-10-03T20:09:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.53%. Comparing base (84ee67b) to head (e36ac7b).
Report is 4 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #4718      +/-   ##
===========================================
- Coverage    93.55%   93.53%   -0.02%     
===========================================
  Files          173      185      +12     
  Lines        21451    22523    +1072     
  Branches      3985     3986       +1     
===========================================
+ Hits         20068    21067     +999     
- Misses         929     1002      +73     
  Partials       454      454

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

marinegor · 2024-10-04T06:48:38Z

hi @talagayev , I don't think it's actually necessary to run the benchmarks on MDAnalysis test files. Though if you still decide to, here's how I ran my benchmarks earlier: https://gist.github.com/marinegor/17558d1685cd2f24a6de65aa99cf5c9e

It also contains zenodo link for the trajectory itself.

testsuite/MDAnalysisTests/analysis/conftest.py

yuxuanzhuang

lgtm!

orbeckst

lgtm, just waiting for CI (after having added the versionchanged)

orbeckst · 2024-10-06T15:25:47Z

@yuxuanzhuang @marinegor @talagayev can you briefly explain why the definitions of donors and acceptors needed to be moved from _prepare to __init__? I am particularly asking because this may be something that we need to add to the docs: what can you do in _prepare and what needs to be elsewhere.

talagayev · 2024-10-07T17:45:04Z

@RMeli @marinegor Ah I see, thanks for the information, yes that makes sense and thanks @marinegor for for the link how you ran the benchmarks, I can try to apply it in the future :)

marinegor · 2024-10-07T19:09:41Z

@orbeckst

can you briefly explain why the definitions of donors and acceptors needed to be moved from _prepare to init?

I think in general it's ok to define attributes in either of those, unless you'll need them after you run _conclude -- then you certainly want to do it in __init__. In this particular case, tests are written in a way that they're checking a "private" fields _donors, _acceptors and _hydrogens (link), hence the failures.

A general rule here might be "if an attribute small and you'll need it later, put it in __init__, if it's either size and you won't need it later -- in _prepare, if it's big and you'll need it later -- check again, or be prepared for a memory overhead".

talagayev added 4 commits September 25, 2024 00:55

Update hbond_analysis.py

cad93ad

Added ResultsGroup for hbond and the supported backends

Update conftest.py

77eacc7

Added the client for HydrogenBondAnalysis

Update test_hydrogenbonds_analysis.py

6bce9c1

Added client_HydrogenBondAnalysis to the tests

Update CHANGELOG

facfa99

Added to changelog entry about hbond parallelization

talagayev added 3 commits September 25, 2024 11:14

Update hbond_analysis.py

88f0c9c

Update test_hydrogenbonds_analysis.py

b2a8a80

adjusted for PEP

Update conftest.py

5c04fe5

adjusted for PEP

orbeckst requested a review from yuxuanzhuang September 26, 2024 01:25

orbeckst self-assigned this Sep 26, 2024

Merge branch 'develop' into hbond_analysis_parallel

6c4881f

yuxuanzhuang added 2 commits October 3, 2024 15:47

move atomgroup def to init

02ac215

fix test

867565c

yuxuanzhuang reviewed Oct 5, 2024

View reviewed changes

testsuite/MDAnalysisTests/analysis/conftest.py Show resolved Hide resolved

conftest add hbond section

4a9f060

yuxuanzhuang approved these changes Oct 5, 2024

View reviewed changes

add versionchanged for hbond_analysis.py

e36ac7b

orbeckst approved these changes Oct 6, 2024

View reviewed changes

orbeckst merged commit 474be5b into MDAnalysis:develop Oct 7, 2024
22 of 23 checks passed

orbeckst mentioned this pull request Oct 7, 2024

Implementation of parallelization to MDAnalysis.analysis.dssp #4721

Merged

5 tasks

talagayev mentioned this pull request Oct 8, 2024

'MDAnalysis.analysis.density' parallelization #4729

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Parallelization to analysis.hydrogenbonds.hbond_analysis #4718

Implementation of Parallelization to analysis.hydrogenbonds.hbond_analysis #4718

talagayev commented Sep 25, 2024 •

edited

Loading

pep8speaks commented Sep 25, 2024 •

edited

Loading

github-actions bot commented Sep 25, 2024 •

edited

Loading

orbeckst commented Sep 26, 2024

talagayev commented Sep 26, 2024

talagayev commented Sep 27, 2024 •

edited

Loading

orbeckst commented Sep 27, 2024

talagayev commented Sep 27, 2024

RMeli commented Sep 30, 2024

marinegor commented Oct 1, 2024

talagayev commented Oct 1, 2024

yuxuanzhuang commented Oct 3, 2024

RMeli commented Oct 3, 2024

codecov bot commented Oct 3, 2024 •

edited

Loading

marinegor commented Oct 4, 2024

yuxuanzhuang left a comment

orbeckst left a comment

orbeckst commented Oct 6, 2024

talagayev commented Oct 7, 2024

marinegor commented Oct 7, 2024

Implementation of Parallelization to analysis.hydrogenbonds.hbond_analysis #4718

Implementation of Parallelization to analysis.hydrogenbonds.hbond_analysis #4718

Conversation

talagayev commented Sep 25, 2024 • edited Loading

PR Checklist

Developers certificate of origin

pep8speaks commented Sep 25, 2024 • edited Loading

Comment last updated at 2024-10-06 15:23:19 UTC

github-actions bot commented Sep 25, 2024 • edited Loading

Linter Bot Results:

orbeckst commented Sep 26, 2024

talagayev commented Sep 26, 2024

talagayev commented Sep 27, 2024 • edited Loading

orbeckst commented Sep 27, 2024

talagayev commented Sep 27, 2024

RMeli commented Sep 30, 2024

marinegor commented Oct 1, 2024

talagayev commented Oct 1, 2024

yuxuanzhuang commented Oct 3, 2024

RMeli commented Oct 3, 2024

codecov bot commented Oct 3, 2024 • edited Loading

Codecov Report

marinegor commented Oct 4, 2024

yuxuanzhuang left a comment

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

orbeckst commented Oct 6, 2024

talagayev commented Oct 7, 2024

marinegor commented Oct 7, 2024

talagayev commented Sep 25, 2024 •

edited

Loading

pep8speaks commented Sep 25, 2024 •

edited

Loading

github-actions bot commented Sep 25, 2024 •

edited

Loading

talagayev commented Sep 27, 2024 •

edited

Loading

codecov bot commented Oct 3, 2024 •

edited

Loading