-
Notifications
You must be signed in to change notification settings - Fork 356
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: add
make slurmcluster
for single-node dev slurm clusters (#824)
* chore: add `make slurmcluster` for dev slurm clusters * gen ssh keys for user, add better docs * Update tools/slurm/README.md * Update slurm-ci image configuration - Update to launcher 3.3.0 required by -ee master - Set the proper protocol in master.yaml - Specify JVM options to reduce memory consumption - Update slurm memory config to enable node launch - Increase disk space to enable multiple image download (32g -> 48g) --------- Co-authored-by: Jerry J. Harrow <[email protected]>
- Loading branch information
1 parent
9d76adc
commit 14af14d
Showing
22 changed files
with
818 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
.terraform | ||
.terraform.lock.hcl | ||
*.tfstate* | ||
*.tfvars* | ||
tf.plan | ||
|
||
*.pem |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
.PHONY: slurmcluster | ||
slurmcluster: | ||
mkdir -p ~/.slurmcluster | ||
$(MAKE) -C terraform build | ||
./scripts/slurmcluster.sh | ||
|
||
.PHONY: unslurmcluster | ||
unslurmcluster: | ||
$(MAKE) -C terraform clean |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
## Quick start | ||
|
||
1. Install Terraform following [these instructions](https://developer.hashicorp.com/terraform/downloads). | ||
2. Download the [GCP CLI](https://cloud.google.com/sdk/docs/install-sdk) and run `gcloud auth application-default login` to get credentials. | ||
3. Run `make slurmcluster` from the root of the repo and wait (up to 10 minutes) for it to start. | ||
4. Step 2 will ultimately launch a local devcluster. Use this as you typically would [1]. | ||
5. Release the resources with `make unslurmcluster` when you are done. | ||
|
||
[1] It is fine to exit this and restart it with `make slurmcluster` again as you please. | ||
|
||
## Alternatives | ||
|
||
The `make slurmcluster` flow is fast and convenient. Alternatively, if you use | ||
`make -C terraform build` and then just use the resulting instance as a dev box after | ||
installing Determined and VS Code Remote on it, the experience is better after getting it | ||
setup (though this somewhat is a matter of preference). | ||
|
||
## `make slurmcluster` notes | ||
|
||
To run Determined + Slurm, you have a few options: | ||
|
||
- Spin up a Linux development machine, install all the prerequisite software and run a cluster as a | ||
customer would, following our publically available documentation. | ||
- Use `tools/slurmcluster.sh` by following the usage documentation for the script after getting | ||
access to one of the systems it supports. | ||
- Or use `make slurmcluster` (the code contained within this directory and its children). | ||
|
||
Under the hood, this launches a compute instance with Slurm, Singularity (Apptainer), the Cray | ||
Launcher component and many other dependencies pre-installed. Then, SSH tunnels are opened so that | ||
`localhost:8081` on your machine points at `localhost:8081` on compute instance and | ||
`localhost:8080` on the compute instance points at `localhost:8080` on your machine. Last, | ||
`devcluster` is started with the Slurm RM pointed at the remote instance, and local development | ||
with `devcluster` works from here as always. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
.PHONY: init | ||
init: | ||
packer init . | ||
|
||
.PHONY: fmt | ||
fmt: | ||
packer fmt . | ||
|
||
.PHONY: check | ||
check: | ||
packer fmt -check . | ||
|
||
.PHONY: build-debug | ||
build-debug: main.pkr.hcl | ||
PACKER_LOG=10 packer build -debug main.pkr.hcl | ||
|
||
.PHONY: build | ||
build: main.pkr.hcl | ||
packer build main.pkr.hcl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
## Building the image | ||
|
||
This sub-repository builds the image `make slurmcluster` uses. The Makefile is the best | ||
documentation for interacting with and building this code. To build, you will need to install | ||
`packer` and have `gcloud`, then run `make build`. | ||
|
||
It was last built with `Packer v1.8.6`. | ||
|
||
## 'Publishing' the udpated image | ||
|
||
To update the image `make slurmcluster` uses, after building, change the default value for | ||
`vars.boot_disk` in `../terraform/variables.tf` to the new image and commit the change. | ||
|
||
`make slurmcluster` is pinned to a specific image, not the image family, so just building will | ||
not cause (potentially destructive) updates to anyone using it. If you do publish the change | ||
by committing it and someone picks up your change, by default, `make slurmcluster` does not | ||
`--auto-approve` its Terraform plans so others will get a warning if it affects them. | ||
|
||
# When to do this | ||
|
||
Whenever it breaks, or you want to add something. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
--- | ||
- hosts: all | ||
become: true | ||
|
||
tasks: | ||
# Our base image comes preinstalled with slurm, but if you are following this to setup | ||
# a cluster from scratch, you will need to install slurm. | ||
|
||
- name: apt-get update, apt-get upgrade. | ||
apt: | ||
update_cache: yes | ||
upgrade: yes | ||
|
||
- name: Install utility packages. | ||
apt: | ||
name: | ||
- curl | ||
- default-jre | ||
- git | ||
- htop | ||
- hwloc | ||
- iftop | ||
- iotop | ||
- jq | ||
- lsof | ||
- net-tools | ||
- screen | ||
- tmux | ||
- tree | ||
- unzip | ||
- wget | ||
- zip | ||
- nfs-common | ||
- apt-transport-https | ||
- ca-certificates | ||
- curl | ||
- software-properties-common | ||
- python3-pip | ||
- virtualenv | ||
- python3-setuptools | ||
state: latest | ||
|
||
- name: Add Docker GPG apt Key | ||
apt_key: | ||
url: https://download.docker.com/linux/ubuntu/gpg | ||
state: present | ||
- name: Add Docker Repository | ||
apt_repository: | ||
repo: deb https://download.docker.com/linux/ubuntu focal stable | ||
state: present | ||
- name: Update apt and install docker-ce | ||
apt: | ||
name: docker-ce | ||
state: latest | ||
update_cache: true | ||
|
||
- name: Install Podman | ||
apt: | ||
name: podman | ||
state: latest | ||
|
||
- name: Install Singularity | ||
apt: | ||
deb: https://github.com/apptainer/apptainer/releases/download/v1.1.6/apptainer_1.1.6_amd64.deb | ||
|
||
- name: Install Launcher | ||
apt: | ||
deb: "{{ launcher_deb }}" | ||
- name: Enable launcher.service | ||
systemd: | ||
name: launcher.service | ||
enabled: yes | ||
|
||
- name: Reinstall Munge (uninstall) | ||
apt: | ||
name: munge | ||
state: absent | ||
- name: Reinstall Munge (install) | ||
apt: | ||
name: munge | ||
state: latest | ||
|
||
- name: Enable slurmctld.service | ||
systemd: | ||
name: slurmctld.service | ||
enabled: yes | ||
- name: Enable slurmd.service | ||
systemd: | ||
name: slurmd.service | ||
enabled: yes | ||
|
||
- name: Restore motd (Slurm image adds a 'slurm not setup' warning if their scripts haven't run) | ||
ansible.builtin.copy: | ||
src: /etc/motd.bak | ||
dest: /etc/motd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
packer { | ||
required_plugins { | ||
googlecompute = { | ||
version = ">= 1.0.0" | ||
source = "github.com/hashicorp/googlecompute" | ||
} | ||
} | ||
} | ||
|
||
variables { | ||
ssh_username = "packer2" | ||
} | ||
|
||
locals { | ||
static_source_path = "static" | ||
static_dest_path = "/tmp/static" | ||
det_conf_dir = "/etc/determined" | ||
slurm_sysconfdir = "/usr/local/etc/slurm" | ||
launcher_job_root = "/var/tmp/launcher" | ||
|
||
launcher_deb_name = "hpe-hpc-launcher_3.3.0-0_amd64.deb" | ||
launcher_deb_dest_path = "${local.static_dest_path}/${local.launcher_deb_name}" | ||
|
||
slurm_conf_name = "slurm.conf" | ||
slurm_conf_tmp_path = "${local.static_dest_path}/${local.slurm_conf_name}" | ||
slurm_conf_dest_path = "${local.slurm_sysconfdir}/${local.slurm_conf_name}" | ||
|
||
slurm_cgroup_conf_name = "cgroup.conf" | ||
slurm_cgroup_conf_tmp_path = "${local.static_dest_path}/${local.slurm_cgroup_conf_name}" | ||
slurm_cgroup_conf_dest_path = "${local.slurm_sysconfdir}/${local.slurm_cgroup_conf_name}" | ||
|
||
det_master_conf_name = "master.yaml" | ||
det_master_conf_tmp_path = "${local.static_dest_path}/${local.det_master_conf_name}" | ||
det_master_conf_dest_path = "${local.det_conf_dir}/${local.det_master_conf_name}" | ||
} | ||
|
||
source "googlecompute" "determined-hpc-image" { | ||
project_id = "determined-ai" | ||
source_image_project_id = ["schedmd-slurm-public"] | ||
source_image_family = "schedmd-v5-slurm-22-05-8-ubuntu-2204-lts" | ||
|
||
image_family = "det-environments-slurm-ci" | ||
image_name = "det-environments-slurm-ci-{{timestamp}}" | ||
image_description = "det environments with hpc tools to test hpc deployments" | ||
|
||
machine_type = "n1-standard-1" | ||
disk_size = "48" | ||
// us-central1-c seems to be much faster/more reliable. had intermittent failures in us-west1-b | ||
// with IAP Tunnels being slow to come up. | ||
zone = "us-central1-c" | ||
subnetwork = "default" | ||
metadata = { "block-project-ssh-keys" : "true" } | ||
omit_external_ip = true | ||
use_internal_ip = true | ||
use_iap = true | ||
// ssh_username cannot be 'packer' due to issues with nested packer builds (schedmd-slurm-public | ||
// images are all built with packer), ssh_clear_authorized_keys and how GCP metadata based | ||
// ssh-keys are provisioned. | ||
ssh_username = var.ssh_username | ||
temporary_key_pair_type = "ed25519" | ||
ssh_clear_authorized_keys = true | ||
} | ||
|
||
build { | ||
name = "determined-hpc-image" | ||
sources = ["sources.googlecompute.determined-hpc-image"] | ||
|
||
provisioner "file" { | ||
source = local.static_source_path | ||
destination = local.static_dest_path | ||
} | ||
|
||
provisioner "shell" { | ||
inline = [ | ||
"sudo mv ${local.slurm_conf_tmp_path} ${local.slurm_conf_dest_path}", | ||
"sudo mv ${local.slurm_cgroup_conf_tmp_path} ${local.slurm_cgroup_conf_dest_path}", | ||
"sudo mkdir -p ${local.det_conf_dir}", | ||
"sudo mv ${local.det_master_conf_tmp_path} ${local.det_master_conf_dest_path}", | ||
"sudo mkdir -p ${local.launcher_job_root}" | ||
] | ||
} | ||
|
||
provisioner "shell" { | ||
script = "scripts/install-ansible.sh" | ||
} | ||
|
||
provisioner "ansible-local" { | ||
playbook_file = "ansible-playbook.yml" | ||
extra_arguments = ["--verbose", "--extra-vars \"launcher_deb=${local.launcher_deb_dest_path}\""] | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
sudo apt update -y | ||
sudo apt install -y software-properties-common | ||
sudo add-apt-repository --yes --update ppa:ansible/ansible | ||
sudo apt install -y ansible |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
CgroupAutomount=yes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
resource_manager: | ||
type: slurm | ||
port: 8081 | ||
protocol: http | ||
job_storage_root: /var/tmp | ||
container_run_type: singularity | ||
launcher_jvm_args: -Xms1024m -Xmx2048m |
Oops, something went wrong.