Replies: 1 comment 2 replies
-
In a |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Help
Description
Hi, I'm looking for some thoughts on how best to set up SLURM and use targets on local hardware. We have a handful of large machines, and users are used to logging on directly to them to run jobs. We're setting up SLURM so that people don't run into each other so much, and would like to make good use of this with targets.
In general, our larger pipelines have many parallel jobs (e.g., 1000+ Stan model fits), and we have used both
tar_make_future()
and more recentlycrew_controller_local()
for this, running on the 40-128 cores available on each machine. This works great, especially as a small number of jobs may fail but we can re-run only those, having the rest cached.For now we expect only to have a handful of large, multi-CPU SLURM nodes - there are plenty of large, non-targets jobs to run that make use of the RAM and many CPUs, and there seems a lot of overhead and rigidity to set up many nodes as containers on each machine. In this case, can any of the the back-ends make use of multiple cores within a SLURM node for multiple targets? Or should we just plan to submit our whole targets pipeline as a single job on a single node?
@espirado
Beta Was this translation helpful? Give feedback.
All reactions