Skip to content

Commit

Permalink
chore: add make slurmcluster for single-node dev slurm clusters (#824)
Browse files Browse the repository at this point in the history
* chore: add `make slurmcluster` for dev slurm clusters

* gen ssh keys for user, add better docs

* Update tools/slurm/README.md

* Update slurm-ci image configuration

- Update to launcher 3.3.0 required by -ee master
- Set the proper protocol in master.yaml
- Specify JVM options to reduce memory consumption
- Update slurm memory config to enable node launch
- Increase disk space to enable multiple image download (32g -> 48g)

---------

Co-authored-by: Jerry J. Harrow <[email protected]>
  • Loading branch information
2 people authored and determined-ci committed Feb 2, 2024
1 parent 9d76adc commit 14af14d
Show file tree
Hide file tree
Showing 22 changed files with 818 additions and 0 deletions.
8 changes: 8 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,11 @@ local: build-bindings get-deps-webui
.PHONY: devcluster
devcluster:
devcluster -c tools/devcluster.yaml

.PHONY: slurmcluster
slurmcluster:
$(MAKE) -C tools/slurm slurmcluster

.PHONY: unslurmcluster
unslurmcluster:
$(MAKE) -C tools/slurm unslurmcluster
7 changes: 7 additions & 0 deletions tools/slurm/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.terraform
.terraform.lock.hcl
*.tfstate*
*.tfvars*
tf.plan

*.pem
9 changes: 9 additions & 0 deletions tools/slurm/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.PHONY: slurmcluster
slurmcluster:
mkdir -p ~/.slurmcluster
$(MAKE) -C terraform build
./scripts/slurmcluster.sh

.PHONY: unslurmcluster
unslurmcluster:
$(MAKE) -C terraform clean
33 changes: 33 additions & 0 deletions tools/slurm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## Quick start

1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads).
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials.
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start.
4. Step 2 will ultimately launch a local devcluster. Use this as you typically would [1].
5. Release the resources with `make unslurmcluster` when you are done.

[1] It is fine to exit this and restart it with `make slurmcluster` again as you please.

## Alternatives

The `make slurmcluster` flow is fast and convenient. Alternatively, if you use
`make -C terraform build` and then just use the resulting instance as a dev box after
installing Determined and VS Code Remote on it, the experience is better after getting it
setup (though this somewhat is a matter of preference).

## `make slurmcluster` notes

To run Determined + Slurm, you have a few options:

- Spin up a Linux development machine, install all the prerequisite software and run a cluster as a
customer would, following our publically available documentation.
- Use `tools/slurmcluster.sh` by following the usage documentation for the script after getting
access to one of the systems it supports.
- Or use `make slurmcluster` (the code contained within this directory and its children).

Under the hood, this launches a compute instance with Slurm, Singularity (Apptainer), the Cray
Launcher component and many other dependencies pre-installed. Then, SSH tunnels are opened so that
`localhost:8081` on your machine points at `localhost:8081` on compute instance and
`localhost:8080` on the compute instance points at `localhost:8080` on your machine. Last,
`devcluster` is started with the Slurm RM pointed at the remote instance, and local development
with `devcluster` works from here as always.
19 changes: 19 additions & 0 deletions tools/slurm/packer/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
.PHONY: init
init:
packer init .

.PHONY: fmt
fmt:
packer fmt .

.PHONY: check
check:
packer fmt -check .

.PHONY: build-debug
build-debug: main.pkr.hcl
PACKER_LOG=10 packer build -debug main.pkr.hcl

.PHONY: build
build: main.pkr.hcl
packer build main.pkr.hcl
21 changes: 21 additions & 0 deletions tools/slurm/packer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
## Building the image

This sub-repository builds the image `make slurmcluster` uses. The Makefile is the best
documentation for interacting with and building this code. To build, you will need to install
`packer` and have `gcloud`, then run `make build`.

It was last built with `Packer v1.8.6`.

## 'Publishing' the udpated image

To update the image `make slurmcluster` uses, after building, change the default value for
`vars.boot_disk` in `../terraform/variables.tf` to the new image and commit the change.

`make slurmcluster` is pinned to a specific image, not the image family, so just building will
not cause (potentially destructive) updates to anyone using it. If you do publish the change
by committing it and someone picks up your change, by default, `make slurmcluster` does not
`--auto-approve` its Terraform plans so others will get a warning if it affects them.

# When to do this

Whenever it breaks, or you want to add something.
95 changes: 95 additions & 0 deletions tools/slurm/packer/ansible-playbook.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
- hosts: all
become: true

tasks:
# Our base image comes preinstalled with slurm, but if you are following this to setup
# a cluster from scratch, you will need to install slurm.

- name: apt-get update, apt-get upgrade.
apt:
update_cache: yes
upgrade: yes

- name: Install utility packages.
apt:
name:
- curl
- default-jre
- git
- htop
- hwloc
- iftop
- iotop
- jq
- lsof
- net-tools
- screen
- tmux
- tree
- unzip
- wget
- zip
- nfs-common
- apt-transport-https
- ca-certificates
- curl
- software-properties-common
- python3-pip
- virtualenv
- python3-setuptools
state: latest

- name: Add Docker GPG apt Key
apt_key:
url: https://download.docker.com/linux/ubuntu/gpg
state: present
- name: Add Docker Repository
apt_repository:
repo: deb https://download.docker.com/linux/ubuntu focal stable
state: present
- name: Update apt and install docker-ce
apt:
name: docker-ce
state: latest
update_cache: true

- name: Install Podman
apt:
name: podman
state: latest

- name: Install Singularity
apt:
deb: https://github.com/apptainer/apptainer/releases/download/v1.1.6/apptainer_1.1.6_amd64.deb

- name: Install Launcher
apt:
deb: "{{ launcher_deb }}"
- name: Enable launcher.service
systemd:
name: launcher.service
enabled: yes

- name: Reinstall Munge (uninstall)
apt:
name: munge
state: absent
- name: Reinstall Munge (install)
apt:
name: munge
state: latest

- name: Enable slurmctld.service
systemd:
name: slurmctld.service
enabled: yes
- name: Enable slurmd.service
systemd:
name: slurmd.service
enabled: yes

- name: Restore motd (Slurm image adds a 'slurm not setup' warning if their scripts haven't run)
ansible.builtin.copy:
src: /etc/motd.bak
dest: /etc/motd
91 changes: 91 additions & 0 deletions tools/slurm/packer/main.pkr.hcl
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
packer {
required_plugins {
googlecompute = {
version = ">= 1.0.0"
source = "github.com/hashicorp/googlecompute"
}
}
}

variables {
ssh_username = "packer2"
}

locals {
static_source_path = "static"
static_dest_path = "/tmp/static"
det_conf_dir = "/etc/determined"
slurm_sysconfdir = "/usr/local/etc/slurm"
launcher_job_root = "/var/tmp/launcher"

launcher_deb_name = "hpe-hpc-launcher_3.3.0-0_amd64.deb"
launcher_deb_dest_path = "${local.static_dest_path}/${local.launcher_deb_name}"

slurm_conf_name = "slurm.conf"
slurm_conf_tmp_path = "${local.static_dest_path}/${local.slurm_conf_name}"
slurm_conf_dest_path = "${local.slurm_sysconfdir}/${local.slurm_conf_name}"

slurm_cgroup_conf_name = "cgroup.conf"
slurm_cgroup_conf_tmp_path = "${local.static_dest_path}/${local.slurm_cgroup_conf_name}"
slurm_cgroup_conf_dest_path = "${local.slurm_sysconfdir}/${local.slurm_cgroup_conf_name}"

det_master_conf_name = "master.yaml"
det_master_conf_tmp_path = "${local.static_dest_path}/${local.det_master_conf_name}"
det_master_conf_dest_path = "${local.det_conf_dir}/${local.det_master_conf_name}"
}

source "googlecompute" "determined-hpc-image" {
project_id = "determined-ai"
source_image_project_id = ["schedmd-slurm-public"]
source_image_family = "schedmd-v5-slurm-22-05-8-ubuntu-2204-lts"

image_family = "det-environments-slurm-ci"
image_name = "det-environments-slurm-ci-{{timestamp}}"
image_description = "det environments with hpc tools to test hpc deployments"

machine_type = "n1-standard-1"
disk_size = "48"
// us-central1-c seems to be much faster/more reliable. had intermittent failures in us-west1-b
// with IAP Tunnels being slow to come up.
zone = "us-central1-c"
subnetwork = "default"
metadata = { "block-project-ssh-keys" : "true" }
omit_external_ip = true
use_internal_ip = true
use_iap = true
// ssh_username cannot be 'packer' due to issues with nested packer builds (schedmd-slurm-public
// images are all built with packer), ssh_clear_authorized_keys and how GCP metadata based
// ssh-keys are provisioned.
ssh_username = var.ssh_username
temporary_key_pair_type = "ed25519"
ssh_clear_authorized_keys = true
}

build {
name = "determined-hpc-image"
sources = ["sources.googlecompute.determined-hpc-image"]

provisioner "file" {
source = local.static_source_path
destination = local.static_dest_path
}

provisioner "shell" {
inline = [
"sudo mv ${local.slurm_conf_tmp_path} ${local.slurm_conf_dest_path}",
"sudo mv ${local.slurm_cgroup_conf_tmp_path} ${local.slurm_cgroup_conf_dest_path}",
"sudo mkdir -p ${local.det_conf_dir}",
"sudo mv ${local.det_master_conf_tmp_path} ${local.det_master_conf_dest_path}",
"sudo mkdir -p ${local.launcher_job_root}"
]
}

provisioner "shell" {
script = "scripts/install-ansible.sh"
}

provisioner "ansible-local" {
playbook_file = "ansible-playbook.yml"
extra_arguments = ["--verbose", "--extra-vars \"launcher_deb=${local.launcher_deb_dest_path}\""]
}
}
4 changes: 4 additions & 0 deletions tools/slurm/packer/scripts/install-ansible.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
sudo apt update -y
sudo apt install -y software-properties-common
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install -y ansible
1 change: 1 addition & 0 deletions tools/slurm/packer/static/cgroup.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
CgroupAutomount=yes
7 changes: 7 additions & 0 deletions tools/slurm/packer/static/master.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
resource_manager:
type: slurm
port: 8081
protocol: http
job_storage_root: /var/tmp
container_run_type: singularity
launcher_jvm_args: -Xms1024m -Xmx2048m
Loading

0 comments on commit 14af14d

Please sign in to comment.