[help] Cluster SLURM set up for targets #1186

noamross · 2023-11-20T17:30:27Z

noamross
Nov 20, 2023
Maintainer

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

Hi, I'm looking for some thoughts on how best to set up SLURM and use targets on local hardware. We have a handful of large machines, and users are used to logging on directly to them to run jobs. We're setting up SLURM so that people don't run into each other so much, and would like to make good use of this with targets.

In general, our larger pipelines have many parallel jobs (e.g., 1000+ Stan model fits), and we have used both tar_make_future() and more recently crew_controller_local() for this, running on the 40-128 cores available on each machine. This works great, especially as a small number of jobs may fail but we can re-run only those, having the rest cached.

For now we expect only to have a handful of large, multi-CPU SLURM nodes - there are plenty of large, non-targets jobs to run that make use of the RAM and many CPUs, and there seems a lot of overhead and rigidity to set up many nodes as containers on each machine. In this case, can any of the the back-ends make use of multiple cores within a SLURM node for multiple targets? Or should we just plan to submit our whole targets pipeline as a single job on a single node?

@espirado

wlandau · 2023-11-29T13:33:49Z

wlandau
Nov 29, 2023
Maintainer

In a targets pipeline backed by tar_option_set(controller = crew.cluster::crew_controller_slurm(slurm_cpus_per_task = ...)), each semi-persistent worker is an R process that runs as a SLURM job. Each worker R process can run one or more targets and use slurm_cpus_per_task CPUs. If some targets require more CPUs than others, then you can supply a crew controller group where each controller in the group has a different value of slurm_cpus_per_task. You can label each target so the pipeline knows which controller to run it on. Details:

2 replies

noamross Nov 29, 2023
Maintainer Author

Thanks! Using a controller group so there are larger and smaller jobs makes a lot of sense. I think this may mean that want a set of smaller SLURM nodes to send smaller jobs to, or more likely, small jobs can be sent to a local controller. I'm still wrapping my head around the best way to run jobs where there large number of parallel branched processes. We would need lots of small SLURM workers available?

wlandau Nov 29, 2023
Maintainer

Whether you want a local controller, a controller for small SLURM workers, or both, depends on how quickly the targets run and how many workers you need. There is (small) overhead for starting up SLURM workers and sending tasks, so the local controller may be slightly faster, but most setups can usually run many more SLURM workers at a time than local processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] Cluster SLURM set up for targets #1186

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

[help] Cluster SLURM set up for targets #1186

noamross Nov 20, 2023 Maintainer

Help

Description

Replies: 1 comment · 2 replies

wlandau Nov 29, 2023 Maintainer

noamross Nov 29, 2023 Maintainer Author

wlandau Nov 29, 2023 Maintainer

noamross
Nov 20, 2023
Maintainer

Replies: 1 comment 2 replies

wlandau
Nov 29, 2023
Maintainer

noamross Nov 29, 2023
Maintainer Author

wlandau Nov 29, 2023
Maintainer