Skip to content

Commit

Permalink
Fail early if any locally attached physical disks are not ok
Browse files Browse the repository at this point in the history
Flash memory storage can also fail, typically when erase cycles have
exceeded a threshold.

In this commit we add a fixture that checks all locally attached
physical disks across all nodes, via smartctl, and fails
if the self-assessment result is not OK.

This check can be skipped through the FIXTURES env var.

Relates: elastic#153
  • Loading branch information
dliappis authored Dec 4, 2019
1 parent 0e6452f commit 6eb0203
Show file tree
Hide file tree
Showing 6 changed files with 48 additions and 1 deletion.
3 changes: 2 additions & 1 deletion night_rally.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ if [[ -n $VAULT_SECRET_ID && -n $VAULT_ROLE_ID ]]; then
fi
export RALLY_METRICS_STORE_CREDENTIAL_PATH=${RALLY_METRICS_STORE_CREDENTIAL_PATH:-"/secret/rally/cloud/nightly-rally-metrics"}

ANSIBLE_ALL_TAGS=(encryption-at-rest initialize-data-disk trim drop-caches)
ANSIBLE_ALL_TAGS=(check-drive-health encryption-at-rest initialize-data-disk trim drop-caches)
ANSIBLE_SKIP_TAGS=( )
ANSIBLE_SKIP_TAGS_STRING=""
# Don't update night-rally by default, unless specified by env var
Expand Down Expand Up @@ -206,6 +206,7 @@ then

cd ${NIGHT_RALLY_HOME}/night_rally/fixtures/ansible
ansible-playbook -i inventory/production -u rally playbooks/update-rally.yml --extra-vars="rally_environment=${RALLY_ENVIRONMENT} in_vagrant=${IN_VAGRANT} skip_rally_update=${SKIP_RALLY_UPDATE}"
ansible-playbook -i inventory/production -u rally playbooks/check-drive-health.yml ${ANSIBLE_SKIP_TAGS_STRING} --extra-vars="in_vagrant=${IN_VAGRANT}"
ansible-playbook -i inventory/production -u rally playbooks/setup.yml ${ANSIBLE_SKIP_TAGS_STRING}

popd >/dev/null 2>&1
Expand Down
9 changes: 9 additions & 0 deletions night_rally/fixtures/ansible/playbooks/check-drive-health.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---

# NOTE: This is also a fixture, but needs to target all hosts, so doesn't live under setup.yml
- name: check disk health on all hosts
# ====================================
hosts: all
gather_facts: true
roles:
- { role: check-drive-health, tags: check-drive-health }
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Role to check health of locally attached physical disks via smartctl
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---

- name: Fail if any locally attached disks don't pass the smartctl health check
command: smartctl -a /dev/{{ item.name }}
register: smartctloutput
failed_when: "'SMART overall-health self-assessment test result: PASSED' not in smartctloutput.stdout"
changed_when: false
loop: "{{ (phys_disks.stdout | from_json).blockdevices | flatten(levels=1) }}"
when: not in_vagrant | default("false") | bool
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---

- name: Install system packages for lsblk and smartctl
package:
name: "{{ item }}"
state: present
loop:
- util-linux
- smartmontools

- name: Retrieve list of locally attached physical disks
shell: lsblk --json --nodeps
register: phys_disks
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---

- set_fact:
system_tasks: "{{ lookup('first_found', params) }}"
vars:
params:
files:
- "{{ ansible_os_family | lower }}/main.yml"
- unsupported.yml

- block:
- include_tasks: "{{ system_tasks }}"
- include_tasks: "common/check_health.yml"
become: true

0 comments on commit 6eb0203

Please sign in to comment.