Skip to content

Commit

Permalink
Slurm upgrade fixes (#473)
Browse files Browse the repository at this point in the history
* removed slurm errors

* added bibilog to show output log of most recent worker start. Tried fixing the slurm23.11 bug.

* fixed a few vpnwkr -> vpngtw remnants. Excluded vpngtw from slurm setup

* improved comments regarding changes and versions

* removed cgroupautomount as it is defunct

* Moved explicit slurm start to avoid errors caused by resume and suspend programs not being copied to their final location yet

* added word for clarification

* Fixed non-fatal bug that lead to non 0 exits on runs without any error.

* changed slurm apt package to slurm-bibigrid

* set version to 23.11.*

* added a few more checks to make sure everything is set up before installing packages

* Added configuration pinning

* changed ignore_error to failed_when false

* fixed or ignored lint fatals
  • Loading branch information
XaverStiensmeier authored Apr 19, 2024
1 parent 7185ca1 commit ca6cab5
Show file tree
Hide file tree
Showing 19 changed files with 155 additions and 59 deletions.
7 changes: 5 additions & 2 deletions bibigrid/core/actions/terminate.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,10 +148,12 @@ def delete_security_groups(provider, cluster_id, security_groups, log, timeout=5
tmp_success = False
while not tmp_success:
try:
# TODO: Check if security group exists at all
not_found = not provider.get_security_group(security_group_name)
tmp_success = provider.delete_security_group(security_group_name)
except ConflictException:
tmp_success = False
if tmp_success:
if tmp_success or not_found:
break
if attempts < timeout:
attempts += 1
Expand All @@ -162,7 +164,8 @@ def delete_security_groups(provider, cluster_id, security_groups, log, timeout=5
log.error(f"Attempt to delete security group {security_group_name} on "
f"{provider.cloud_specification['identifier']} failed.")
break
log.info(f"Delete security_group {security_group_name} -> {tmp_success}")
log.info(f"Delete security_group {security_group_name} -> {tmp_success or not_found} on "
f"{provider.cloud_specification['identifier']}.")
success = success and tmp_success
return success

Expand Down
8 changes: 8 additions & 0 deletions bibigrid/core/provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,14 @@ def append_rules_to_security_group(self, name_or_id, rules):
@return:
"""

@abstractmethod
def get_security_group(self, name_or_id):
"""
Returns security group if found else None.
@param name_or_id:
@return:
"""

def get_mount_info_from_server(self, server):
volumes = []
for server_volume in server["volumes"]:
Expand Down
8 changes: 8 additions & 0 deletions bibigrid/openstack/openstack_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,3 +320,11 @@ def append_rules_to_security_group(self, name_or_id, rules):
port_range_max=rule["port_range_max"],
remote_ip_prefix=rule["remote_ip_prefix"],
remote_group_id=rule["remote_group_id"])

def get_security_group(self, name_or_id):
"""
Returns security group if found else None.
@param name_or_id:
@return:
"""
return self.conn.get_security_group(name_or_id)
1 change: 1 addition & 0 deletions documentation/markdown/bibigrid_feature_list.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,6 @@
| [Configuration](features/configuration.md) | Contains all data regarding cluster setup for all providers. |
| [Command Line Interface](features/CLI.md) | What command line arguments can be passed into BiBiGrid. |
| [Multi Cloud](features/multi_cloud.md) | Explanation how BiBiGrid's multi-cloud approach works |
| [BiBiGrid Cluster Commands](features/cluster_commands.md) | Short useful commands to get information on the cluster |

![](../images/actions.jpg)
54 changes: 54 additions & 0 deletions documentation/markdown/features/cluster_commands.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# BiBiGrid Cluster Commands

## [bibiinfo](../../../resources/bin/bibiinfo)
Similar to `sinfo` but shows detailed information regarding node features.

## [bibilog](../../../resources/bin/bibilog)
`bibilog` executes `tail -f` on the most recent worker creation out log.
Thereby, it helps you with understanding any worker startup issues.

## [bibiplay](../../../resources/bin/bibiplay)
`bibiplay` is mainly a shortcut for `ansible-playbook /opt/playbook/site.yml -i /opt/playbook/ansible_hosts`
which allows you to execute the ansible playbook more easily.

### Examples
You have changed something in the common configuration and want to propagate this change to the master.
```sh
bibiplay -l master
# executes the playbook only for the master
```

You have changed something in the slurm configuration and want to propagate this change to the master.
```sh
bibiplay -l master -t slurm
```

## [bibiname](../../../resources/playbook/roles/bibigrid/templates/bin/bibiname.j2)[m|v|default: w] [number]

This command creates node names for the user without them needing to copy the cluster-id.
Takes two arguments. The first defines whether a master, vpngtw or worker is meant. Worker is the default.
The second parameter - if vpngtw or worker is selected - defines which vpngtw or worker is meant.

### Examples
Assume the cluster-id `20ozebsutekrjj4`.

```sh
bibiname m
# bibigrid-master-20ozebsutekrjj4
```

```sh
bibiname v 0
# bibigrid-vpngtw-20ozebsutekrjj4-0
```

```sh
bibiname 0 # or bibiname w 0
# bibigrid-worker-20ozebsutekrjj4-0
```

A more advanced use would be to use the generated name to login into a worker:
```sh
ssh $(bibiname 0) # or bibiname w 0
# ssh bibigrid-worker-20ozebsutekrjj4-0
```
Binary file not shown.
File renamed without changes.
16 changes: 16 additions & 0 deletions resources/bin/bibilog
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
if [ "$1" == "err" ]; then
err_out="err"
else
err_out="out"
fi

if [ "$2" == "fail" ]; then
fail_create="fail"
else
fail_create="create"
fi

LOG="/var/log/slurm/worker_logs/$fail_create/$err_out"
RECENT=$(ls -1rt $LOG | tail -n1)
tail -f "$LOG/$RECENT"
13 changes: 5 additions & 8 deletions resources/defaults/slurm/slurm.conf
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,15 @@ AuthAltParameters=jwt_key=/etc/slurm/jwt-secret.key
ClusterName=bibigrid

MpiDefault=none
ProctrackType=proctrack/linuxproc
ProctrackType=proctrack/cgroup # linuxproc # changed for 23.11.0
ReturnToService=2
SwitchType=switch/none
TaskPlugin=task/none
#TaskPlugin=task/cgroup
JobAcctGatherType=jobacct_gather/linux

# see https://slurm.schedmd.com/slurm.conf.html#OPT_cloud_dns:~:text=for%20additional%20details.-,cloud_dns,-By%20default%2C%20Slurm
# SlurmctldParameters=cloud_dns
# Funktioniert nicht wie vermutet. slurmctld versucht mit diesem Parameter schon beim Start alle Clients aufzulösen,
# was natürlich nicht funktioniert.
# didn't work as expected. slurmctld tries to resolve all clients on startup which doesn't work obviously

# PRIORITY
PriorityType=priority/multifactor
Expand All @@ -37,7 +35,6 @@ SlurmctldPort=6817
SlurmdPort=6818

# DIRECTORIES
#JobCheckpointDir=/var/lib/slurm/job_checkpoint
SlurmdSpoolDir=/var/lib/slurm/slurmd
StateSaveLocation=/var/lib/slurm/state_checkpoint

Expand All @@ -61,7 +58,7 @@ AccountingStorageHost={{ hostvars[groups.master.0].name | lower }}
AccountingStorageUser={{ slurm_conf.db_user }}

# LOGGING
SlurmctldDebug=info
SlurmctldDebug=debug # info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
Expand Down Expand Up @@ -102,7 +99,7 @@ SuspendExcNodes={{ hostvars[groups.master.0].name }}
# Maximum number of nodes
TreeWidth= {{ slurm_conf.elastic_scheduling.TreeWidth }}
# Do not cache dns names
CommunicationParameters=NoAddrCache
# CommunicationParameters=NoAddrCache # REMOVED for 23.11.0
# Mark node status idle on suspend so DOWN is removed
SlurmctldParameters=idle_on_node_suspend
# Show slurm nodes all the time
Expand All @@ -113,4 +110,4 @@ ResumeFailProgram=/opt/slurm/fail.sh
# job container
# TO BE TESTED
JobContainerType=job_container/tmpfs
PrologFlags=Contain
PrologFlags=Contain
2 changes: 1 addition & 1 deletion resources/playbook/roles/bibigrid/files/slurm/cgroup.conf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# maybe this causes errors when using 23.11 https://slurm.schedmd.com/faq.html#cgroupv2
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
# CgroupAutomount=yes # REMOVED 23.11.0
ConstrainCores=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
Expand Down
2 changes: 1 addition & 1 deletion resources/playbook/roles/bibigrid/files/slurm/create.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ process_string() {
fifth=${elements[4]}

# Replace undesired characters in the second element
second=$(echo "$second" | sed -E 's/worker-/worker_/; s/vpnwkr-/vpnwkr_/')
second=$(echo "$second" | sed -E 's/worker-/worker_/; s/vpngtw-/vpngtw_/')

# Check if the fifth element is not empty
if [[ ! -z $fifth ]]; then
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -273,13 +273,6 @@ def _run_playbook(cmdline_args):
sys.exit(1)
else:
logging.info(ansible_execution_data)
server_start_data = {"started_servers": [], "other_openstack_exceptions": [], "connection_exceptions": [],
"available_servers": [], "openstack_wait_exceptions": []}
if [key for key in server_start_data if "exception" in key]:
logging.warning(server_start_data)
sys.exit(1)
else:
logging.info(server_start_data)

logging.info("Successful create_server.py execution!")
time_in_s = time.time() - start_time
Expand Down
2 changes: 1 addition & 1 deletion resources/playbook/roles/bibigrid/files/slurm/fail.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ process_string() {
fifth=${elements[4]}

# Replace undesired characters in the second element
second=$(echo "$second" | sed -E 's/worker-/worker_/; s/vpnwkr-/vpnwkr_/')
second=$(echo "$second" | sed -E 's/worker-/worker_/; s/vpngtw-/vpngtw_/')

# Check if the fifth element is not empty
if [[ ! -z $fifth ]]; then
Expand Down
1 change: 1 addition & 0 deletions resources/playbook/roles/bibigrid/handlers/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
systemd:
name: slurmctld
state: restarted
when: "'master' in group_names"

- name: slurmd
systemd:
Expand Down
9 changes: 9 additions & 0 deletions resources/playbook/roles/bibigrid/tasks/001-apt.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,15 @@
group: root
mode: 0644

- name: Wait for cloud-init / user-data to finish
command: cloud-init status --wait
changed_when: false

- name: Wait for /var/lib/dpkg/lock-frontend to be released
shell: while lsof /var/lib/dpkg/lock-frontend ; do sleep 10; done;
tags:
- skip_ansible_lint

- name: Wait for post-launch services to stop
service_facts:
register: result
Expand Down
7 changes: 4 additions & 3 deletions resources/playbook/roles/bibigrid/tasks/020-disk-server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@
- "{{ master.disks }}"
when: master.disks is defined

- block:
- when: volumes is defined and auto_mount
failed_when: false
block:
- name: Make sure disks are available
filesystem:
fstype: ext4
Expand All @@ -36,10 +38,9 @@
with_items: "{{ volumes }}"

- name: Mount disks

mount:
path: "{{ item.name }}"
src: "{{ item.device }}"
state: mounted
with_items: "{{ volumes }}"
when: volumes is defined and auto_mount
ignore_errors: true
50 changes: 25 additions & 25 deletions resources/playbook/roles/bibigrid/tasks/042-slurm-server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,22 +70,6 @@
- slurmdbd
- slurmrestd

- name: Enable slurmdbd and slurmrestd services
systemd:
name: "{{ item }}"
enabled: true
masked: false
state: started
daemon_reload: true
with_items:
- slurmdbd
- slurmrestd

- name: Start slurm explicit after all dependencies are configured
systemd:
name: slurmctld
state: started

- name: Register Slurm users home dir
shell: "set -o pipefail && grep slurm /etc/passwd | cut -d ':' -f 6"
register: slurm_home
Expand Down Expand Up @@ -180,6 +164,31 @@
groups:
- ansible

- name: Generate location specific worker userdata
template:
src: slurm/worker_userdata.j2
dest: "/opt/slurm/userdata_{{ hostvars[item].cloud_identifier }}.txt"
owner: slurm
group: ansible
mode: "0640"
with_items: "{{ groups.vpngtw + groups.master }}"

- name: Enable slurmdbd and slurmrestd services
systemd:
name: "{{ item }}"
enabled: true
masked: false
state: started
daemon_reload: true
with_items:
- slurmdbd
- slurmrestd

- name: Start slurm explicit after all dependencies are configured
systemd:
name: slurmctld
state: started


- when: slurm_home.stdout != '/opt/slurm'
block:
Expand Down Expand Up @@ -210,12 +219,3 @@
- slurmd
- slurmdbd
- slurmrestd

- name: Generate location specific worker userdata
template:
src: slurm/worker_userdata.j2
dest: "/opt/slurm/userdata_{{ hostvars[item].cloud_identifier }}.txt"
owner: slurm
group: ansible
mode: "0640"
with_items: "{{ groups.vpngtw + groups.master }}"
26 changes: 15 additions & 11 deletions resources/playbook/roles/bibigrid/tasks/042-slurm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,25 @@
uid: 64030
group: slurm

- name: Install Slurm package (and dependencies)
- name: Create pinning configuration for slurm-bibigrid version 23.11.*
copy:
content: |
Package: slurm-bibigrid
Pin: version 23.11.*
Pin-Priority: 1001
dest: /etc/apt/preferences.d/slurm-bibigrid
mode: '0311'

- name: Install slurm-bibigrid package
apt:
name: slurm-bibigrid
state: present

- name: Install Slurm package dependencies
apt:
name:
- slurm-full
- munge

# - name: Download Slurm (TEMPORARY)
# get_url:
# url: "https://docs.cebitec.uni-bielefeld.de/s/FjCP3xQPPnBwSy9/download?path=%2F&files=slurm-full_23.11.0-0_amd64.deb" # Replace with your package link
# dest: "/tmp/package.deb" # Destination where the package will be saved
# - name: Install Slurm package
# apt:
# deb: "/tmp/package.deb"
# state: present # Install the package if not already installed

- name: Create new secret (Munge)
copy:
content: '{{ slurm_conf.munge_key }}'
Expand Down
1 change: 1 addition & 0 deletions resources/playbook/roles/bibigrid/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@
- debug:
msg: "[BIBIGRID] Setup Slurm"
- import_tasks: 042-slurm.yml
when: "'vpngtw' not in group_names"
- import_tasks: 042-slurm-server.yml
when: "'master' in group_names"

Expand Down

0 comments on commit ca6cab5

Please sign in to comment.