Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI on HPC #870

Open
gtdang opened this issue Aug 22, 2024 · 3 comments · May be fixed by #871
Open

MPI on HPC #870

gtdang opened this issue Aug 22, 2024 · 3 comments · May be fixed by #871

Comments

@gtdang
Copy link
Collaborator

gtdang commented Aug 22, 2024

I've been testing out the GUI on Brown's HPC system (OSCAR). Running simulations with the MPI backend is not working because it's requesting too many processors than the instance allows. The node that my instance is running on has 48 cores, but my instance is not allotted access to all the node's cores.

hnn-core/hnn_core/gui/gui.py

Lines 1916 to 1918 in 18830b5

if backend_selection.value == "MPI":
backend = MPIBackend(
n_procs=multiprocessing.cpu_count() - 1, mpi_cmd=mpi_cmd.value)

The GUI initializes the backend at the lines above using multiprocessing.cpu_count, which returns the node's total but not my instance's allotment.

The joblib backend allows you to specify the number of cores in with the GUI. Is there a reason why this option is not exposed for MPI?

joblib options
Screenshot 2024-08-22 at 5 27 07 PM

MPI options
Screenshot 2024-08-22 at 5 26 40 PM

This stack overflow answer also has a way to get the number of available cores instead of total.

@gtdang gtdang linked a pull request Aug 23, 2024 that will close this issue
@rythorpe
Copy link
Contributor

This shouldn't be an issue if mpiexec is called with oversubscription allowed. Any idea why it's still failing?

If I recall correctly, the goal with the GUI was to expose as little of the parallel backend API as possible while still running the tutorials in a timely manner. Since almost all of the tutorials run single trials, JoblibBackend isn't very useful and MPIBackend can easily be run under the hood by defaulting to the maximal number of parallel jobs.

We can still convert to len(os.sched_getaffinity(0)), but I personally don't think it's necessary to expose the n_procs argument in the GUI. Big picture, I think there's something to be said for the GUI running without too much technical bloat that will confuse new users.

@dylansdaniels
Copy link
Collaborator

@rythorpe Are you suggesting we also remove the Cores: option from the GUI when using the JoblibBackend? If the goal is to remove the technical bloat, should we also remove the MPI cmd: option from the GUI? Not sure if we expect users to actually change this from the default

For what it's worth, I personally don't find it too technical to expose the number of cores. It can be nice to see what you have access to on your machine, as long as the max is set to the # of cores available on the instance to prevent user input error

@rythorpe
Copy link
Contributor

@rythorpe Are you suggesting we also remove the Cores: option from the GUI when using the JoblibBackend? If the goal is to remove the technical bloat, should we also remove the MPI cmd: option from the GUI? Not sure if we expect users to actually change this from the default

For what it's worth, I personally don't find it too technical to expose the number of cores. It can be nice to see what you have access to on your machine, as long as the max is set to the # of cores available on the instance to prevent user input error

See the conversation on the PR for more details. I guess I see the GUI primarily an educational tool, so while I agree that adding the number of cores as a simulation parameter in the GUI isn't, in itself, a deal breaker, I think there's something to be said for a GUI simulation that "just runs" without having to sift through a myriad of parameters that don't directly relate to the scientific discourse of our workshops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants