Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ENH: add ability to spawn MPI jobs from parent job #506

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

rythorpe
Copy link
Contributor

@rythorpe rythorpe commented Jul 17, 2022

The goal here is to allow MPIBackend to spawn child jobs from an existing MPI job/communicator. A user could then run any simulation script (e.g., plot_simulate_mpi_backend.py) from a computing cluster with the following command and it will spawn as many sub-processes as are specified by MPIBackend(n_procs=...) to complete the simulation job:

$ mpiexec -np 1 --oversubscribe python -m mpi4py /path/to/script.py

I'm currently interested in this use-case for two reasons:

  1. It will allow a user to implement some more complex yet faster MPI configurations on a computing cluster where one can e.g. parallelize across both simulations and neurons, where currently only either scenario or the other is possible.
  2. It solves (I think) some the security concerns with allowing a process to instantiate its own MPI parent job, which is what MPIBackend currently does.

closes #477

@rythorpe
Copy link
Contributor Author

For the time being, I've created a new file called mpi_comm_spawn_child.py that more-or-less mirrors the style of mpi_child.py. Eventually, these files should be combined into one. It also currently runs, so give it a try!

@rythorpe rythorpe added enhancement New feature or request parallelization Parallel processing labels Jul 17, 2022
Comment on lines +12 to +25
"""The MPISimulation class.

Parameters
----------
skip_mpi_import : bool | None
Skip importing MPI. Only useful for testing with pytest.

Attributes
----------
comm : mpi4py.Comm object
The handle used for communicating among MPI processes
rank : int
The rank for each processor part of the MPI communicator
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring

Comment on lines 745 to 746
if (not self.mpi_comm_spawn) and self.n_procs > 1:
kill_proc_name('nrniv')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that kill_proc_name only works when a subprocess is started with Popen and appended to proc_queue.

@codecov-commenter
Copy link

codecov-commenter commented Jul 17, 2022

Codecov Report

Merging #506 (c616558) into master (c7460ba) will decrease coverage by 0.87%.
The diff coverage is 34.78%.

❗ Current head c616558 differs from pull request most recent head 99c7baf. Consider uploading reports for the commit 99c7baf to get more accurate results

@@            Coverage Diff             @@
##           master     #506      +/-   ##
==========================================
- Coverage   87.63%   86.75%   -0.88%     
==========================================
  Files          19       20       +1     
  Lines        3792     3843      +51     
==========================================
+ Hits         3323     3334      +11     
- Misses        469      509      +40     
Impacted Files Coverage Δ
hnn_core/mpi_child.py 96.47% <ø> (ø)
hnn_core/mpi_comm_spawn_child.py 0.00% <0.00%> (ø)
hnn_core/parallel_backends.py 79.68% <60.00%> (-1.81%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7460ba...99c7baf. Read the comment docs.

@jasmainak
Copy link
Collaborator

I'm interested! You need to educate me a little over a video call what was the motivation, what are the pros/cons of spawning jobs through MPI vs Python etc

@rythorpe
Copy link
Contributor Author

Let's setup a time to chat more about this!

@ntolley
Copy link
Contributor

ntolley commented Jul 18, 2022

I'd definitely like to be apart of the meeting as well. I'm wondering if this is also the reason why MPI doesn't work in WSL?

@jasmainak
Copy link
Collaborator

Propose a time! I'm okay with Wed/Thu afternoon and anytime Friday.

@jasmainak
Copy link
Collaborator

@rythorpe I added the comment that it closes #477

@rythorpe
Copy link
Contributor Author

FYI, I think the reason MPIBackend is setup the way it is currently (i.e., prior to this PR) is specifically to provide timeouts + tracebacks when errors occur prior to a blocking MPI call. (See mpi4py/mpi4py#53 for a discussion on this topic.) An MPIBackend subprocess created via subprocess.Popen enforces a timeout when waiting for communications to/from child MPI processes and terminates them when a timeout monitored by the parent process is exceeded, capturing stderr and stdout on the way out.

Correspondingly, unit tests that explicitly test the guts of MPIBackend should use the original codepath (i.e., calling mpiexec internally when mpi_comm_spawn=False) so that tracebacks can be preserved and logged. Tests for the new functionality introduced here will also be written, however they should not replace the previous tests.

@jasmainak
Copy link
Collaborator

jasmainak commented Jul 23, 2022

seems like this might be a way to go: https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html

using an -m flag? Try to avoid peppering the codebase with try/except blocks ... it's really anti-pattern. But if that's the only way to get MPI to work gracefully, you can put in one function ...

@rythorpe
Copy link
Contributor Author

seems like this might be a way to go: https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html
using an -m flag? Try to avoid peppering the codebase with try/except blocks ... it's really anti-pattern. But if that's the only way to get MPI to work gracefully, you can put in one function ...

Yes, this is what I've been doing when calling mpiexec, however, I haven't really explored using try/except statements. In order to get this to be a robust and dependable solution for handling errors, we'd have to put try/except statements prior to every blocking MPI call in the codebase. This might work though since there aren't many MPI calls... I'll need to look into this more.

@rythorpe
Copy link
Contributor Author

Minor update, I finally got this branch to run off of a master MPI process on Oscar (Brown's HPC). See here for a demo.

@jasmainak jasmainak mentioned this pull request Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parallelization Parallel processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run mpi in main process
4 participants