This library provides some useful decorators for
dask_jobqueue
. It also expands its scope to
include MPI workloads, including extending configuration options for such workloads and
heterogeneous resources.
The documentation for the library is a WIP but we have a tutorial repository which goes into quite some detail about the scope and capabilities of the package.
To help people try out this library, we have created a set of Docker containers that allow you to test the usage from within a notebook on a (toy) SLURM cluster.
The commands below will start a set of docker containers consisting of a login node and two compute nodes. The SLURM resource manager and a JupyterLab instance are linked to the login node. Feel free to try, learn and explore using the tutorial notebooks you find in JupyterLab.
Requirements:
- The setup requires ~5GB of diskspace and will download ~1GB of data over the internet.
- Docker
- docker-compose
- You may also need to
manage
docker
as a non-root user depending on the permissions you have on the system where you are executing the commands
The dockers containers can be found in the
tutorial
subdirectory and leverage the continuous integration infrastructure used by the project.
We maintain a separate
tutorial repository
which is home to the notebooks used within the tutorial setup.
We have packaged the various commands necessary to set up the infrastructure into a set
of bash functions. In order to use these bash functions you need to source
the script
that defines them. This requires you to first clone the repository:
# Clone the repository
git clone https://github.com/E-CAM/jobqueue_features.git
# Enter the directory
cd jobqueue_features/tutorial
# Configure our commands to start/stop/clean our containers
source jupyter.sh
The bash functions hide away the details of what is done to start, stop and clean up
the infrastructure. If you are curious as to what is actually happening, you can look
into the file tutorial/jupyter.sh
or use the type
command to see how the function is
defined (e.g., type start_slurm
).
There are three bash functions. The first of which sets the environment up:
# Start and configure the cluster
start_tutorial
This step includes cloning the tutorial (which can be found at https://github.com/E-CAM/jobqueue_features_workshop_materials) inside the cluster.
You should now be able to access the JupyterLab instance from your browser via the
link that is printed when the command has completed (something similar to
http://localhost:SOME_PORT/lab/workspaces/lab
) and will find a number of notebooks for
you to work through there.
If you would like to stop the tutorial you can use
# Stop containers
stop_slurm
This command also erases the containers, but does not remove the docker images (which means you won't need to make a big download again) or user data (which means any changes you made to notebooks or files will still be available).
You may want to reset the tutorials for some reason, you can do that with:
# Do a complete clean up of the infrastructure
clean_tutorials
This will remove all data related to the tutorial, but will leave the docker images
on your machine (so you will not need to do major downloads to run start_tutorial
)
The infrastructure consumes quite a bit of space and once you have completed the tutorial, you will probably want to reclaim this. The following command will completely remove all traces of the infrastructure from your system:
# Do a complete clean up of the infrastructure
clean_slurm
- Please be aware that each docker image uses quite a lot of disk space, as a rule of thumb you should have at least 5GB of diskspace available.
- This setup is designed only for your local machine and tutorial usage, it is not intended to be used for a heavy workload.
- If your configuration does not allow you to start
docker
/docker-compose
without sudo, you can work around this withinstead of simplysudo bash -c "$(declare -f start_slurm); start_slurm"
start_slurm
(with the same approach forstop_slurm
andclean_slurm
).