Troubleshooting Tasks in Codecollection: 128 Codebundles in Codecollection: 45
This repository is a codecollection that is to be used within the RunWhen platform. It contains codebundles that can be used in SLIs, and TaskSets.
Please see the contributing and code of conduct for details on adding your contributions to this project.
Documentation for each codebundle is maintained in the README.md alongside the robot code and is published at https://docs.runwhen.com/public/v/codebundles/. Please see the readme howto for details on crafting a codebundle readme that can be indexed.
Head on over to our centralized documentation here for detailed information on getting started.
File Structure overview of devcontainer:
-/app/
|- auth/ # store secrets here, it should already be properly gitignored for you
|- codecollection/
| |- codebundles/ # stores codebundles that can be run during development
| |- libraries/ # stores python keyword libraries used by codebundles
|- dev_facade/ # provides interfaces equivalent to those used on the platform, but just dry runs the keywords to assist with development
...
The included script ro
wraps the robot
RobotFramework binary, and includes some extra functionality to write files to a consistent location for viewing in a HTTP server at http://localhost:3000/ that is always running as part of the devcontainer.
Navigate to the codebundle directory
cd codecollection/codebundles/curl-http-ok/
Run the codebundle
ro runbook.robot
Name | Supported Integrations | Tasks | Documentation |
---|---|---|---|
AWS CloudWatch Overutlized EC2 Inspection | AWS , CloudWatch |
Check For Overutilized Ec2 Instances |
Queries AWS CloudWatch for a list of EC2 instances with a high amount of resource utilization, raising issues when overutilized instances are found. Docs |
AWS EKS Nodegroup Status Check | AWS , EKS |
Check EKS Nodegroup Status |
Queries a node group within a EKS cluster to check if the nodegroup has degraded service, indicating ongoing reboots or other issues. Docs |
Azure Internal LoadBalancer Triage | Kubernetes , AKS , Azure |
Health Check Internal Azure Load Balancer |
Triages issues related to a Azure Loadbalancers and its activity logs. Docs |
Azure Monitor Activity Log SLI | Kubernetes , AKS , Azure |
Run Azure Monitor Activity Log Triage |
Measures the count of error activity log entries as a SLI metric for the Azure tenancy. Docs |
Azure Monitor Event Triage | Kubernetes , AKS , Azure |
Run Azure Monitor Activity Log Triage |
Triages issues related to a Azure Loadbalancers, Kubernetes ingress objects and services. Docs |
GCP Gcloud Log Inspection | GCP , Gcloud , Google Monitoring |
Inspect GCP Logs For Common Errors |
Fetches logs from a GCP using a configurable query and raises an issue with details on the most common issues. Docs |
GCP Node Prempt List | GCP , GKE |
Count the number of nodes in active prempt operation |
Check if any GCP nodes have an active preempt operation. Docs |
GKE Kong Ingress Host Triage | GCP , GMP , Ingress , Kong , Metrics |
Check If Kong Ingress HTTP Error Rate Violates HTTP Error Threshold , Check If Kong Ingress HTTP Request Latency Violates Threshold , Check If Kong Ingress Controller Reports Upstream Errors |
Collects Kong ingress host metrics from GMP on GCP and inspects the results for ingress with a HTTP error code rate greater than zero over a configurable duration and raises issues based on the number of ingress with error codes. Docs |
GKE Nginx Ingress Host Triage | GCP , GMP , Ingress , Nginx , Metrics |
Fetch Nginx HTTP Errors From GMP for Ingress ${INGRESS_OBJECT_NAME}, `Find Owner and Service Health for Ingress `${INGRESS_OBJECT_NAME} |
Collects Nginx ingress host controller metrics from GMP on GCP and inspects the results for ingress with a HTTP error code rate greater than zero over a configurable duration and raises issues based on the number of ingress with error codes. Docs |
Kubeprometheus Operator Troubleshoot | Kubernetes , AKS , EKS , GKE , OpenShift , Prometheus |
Check Prometheus Service Monitors , Check For Successful Rule Setup , Verify Prometheus RBAC Can Access ServiceMonitors , Identify Endpoint Scraping Errors , Check Prometheus API Healthy |
This taskset investigates the logs, state and health of Kubernetes Prometheus operator. Docs |
Kubernetes Application Monitor | Kubernetes , AKS , EKS , GKE , OpenShift |
Measure Application Exceptions |
Measures the number of exception stacktraces present in an application's logs over a time period. Docs |
Kubernetes Application Troubleshoot | Kubernetes , AKS , EKS , GKE , OpenShift |
Get Workload Logs , Scan For Misconfigured Environment , Troubleshoot Application Logs |
Triages issues related to a deployment and its replicas. Docs |
Kubernetes ArgoCD Application Health & Troubleshoot | Kubernetes , AKS , EKS , GKE , OpenShift , ArgoCD |
Fetch ArgoCD Application Sync Status & Health , Fetch ArgoCD Application Last Sync Operation Details , Fetch Unhealthy ArgoCD Application Resources , Scan For Errors in Pod Logs Related to ArgoCD Application Deployments , Fully Describe ArgoCD Application |
This taskset collects information and runs general troubleshooting checks against argocd application objects within a namespace. Docs |
Kubernetes ArgoCD HelmRelease TaskSet | Kubernetes , AKS , EKS , GKE , OpenShift , ArgoCD |
Fetch all available ArgoCD Helm releases , Fetch Installed ArgoCD Helm release versions |
This codebundle runs a series of tasks to identify potential helm release issues related to ArgoCD managed Helm objects. Docs |
Kubernetes Artifactory Triage | Kubernetes , AKS , EKS , GKE , OpenShift , Artifactory |
Check Artifactory Liveness and Readiness Endpoints |
Performs a triage on the Open Source version of Artifactory in a Kubernetes cluster. Docs |
Kubernetes CertManager Healthcheck | Kubernetes , AKS , EKS , GKE , OpenShift |
Get Health Score of CertManager Workloads |
Check the health of pods deployed by cert-manager. Docs |
Kubernetes Daemonset Triage | Kubernetes , AKS , EKS , GKE , OpenShift |
Get DaemonSet Log Details For Report , Get Related Daemonset Events , Check Daemonset Replicas |
Triages issues related to a Daemonset and its available replicas. Docs |
Kubernetes Deployment Triage | Kubernetes , AKS , EKS , GKE , OpenShift |
Check Deployment Log For Issues with ${DEPLOYMENT_NAME}, `Check Liveness Probe Configuration for Deployment `${DEPLOYMENT_NAME} , Check Readiness Probe Configuration for Deployment ${DEPLOYMENT_NAME}, `Troubleshoot Deployment Warning Events for `${DEPLOYMENT_NAME} , Get Deployment Workload Details For ${DEPLOYMENT_NAME} and Add to Report , Troubleshoot Deployment Replicas for ${DEPLOYMENT_NAME}, `Check Deployment Event Anomalies for `${DEPLOYMENT_NAME} |
Triages issues related to a deployment and its replicas. Docs |
Kubernetes Flux Choas Testing | Kubernetes , AKS , EKS , GKE , OpenShift |
Suspend the Flux Resource Reconciliation , Find Random FluxCD Workload as Chaos Target , Execute Chaos Command , Execute Additional Chaos Command , Resume Flux Resource Reconciliation |
This taskset is used to suspend a flux resource for the purposes of executing chaos tasks. Docs |
Kubernetes FluxCD HelmRelease TaskSet | Kubernetes , AKS , EKS , GKE , OpenShift , FluxCD |
List all available FluxCD Helmreleases , Fetch Installed FluxCD Helmrelease Versions , Fetch Mismatched FluxCD HelmRelease Version , Fetch FluxCD HelmRelease Error Messages , Check for Available Helm Chart Updates |
This codebundle runs a series of tasks to identify potential helm release issues related to Flux managed Helm objects. Docs |
Kubernetes FluxCD Kustomization TaskSet | Kubernetes , AKS , EKS , GKE , OpenShift , FluxCD |
List all available Kustomization objects , Get details for unready Kustomizations |
This codebundle runs a series of tasks to identify potential Kustomization issues related to Flux managed Kustomization objects. Docs |
Kubernetes Grafana Loki Health Check | k8s |
Check Loki Ring API , Check Loki API Ready |
This taskset checks the health of Grafana Loki and its hash ring. Docs |
Kubernetes Image Check | Kubernetes , AKS , EKS , GKE , OpenShift |
Check Image Rollover Times for Namespace ${NAMESPACE}, `List Images and Tags for Every Container in Running Pods for Namespace `${NAMESPACE} , List Images and Tags for Every Container in Failed Pods for Namespace ${NAMESPACE}, `List ImagePullBackOff Events and Test Path and Tags for Namespace `${NAMESPACE} |
This taskset provides detailed information about the images used in a Kubernetes namespace. Docs |
Kubernetes Ingress GCE & GCP HTTP Load Balancer Healthcheck | Kubernetes , GKE , GCE , GCP |
Search For GCE Ingress Warnings in GKE , Identify Unhealthy GCE HTTP Ingress Backends , Validate GCP HTTP Load Balancer Configurations , Fetch Network Error Logs from GCP Operations Manager for Ingress Backends , Review GCP Operations Logging Dashboard |
Troubleshoot GCE Ingress Resources related to GCP HTTP Load Balancer in GKE Docs |
Kubernetes Ingress Healthcheck | Kubernetes , AKS , EKS , GKE , OpenShift |
Fetch Ingress Object Health in Namespace ${NAMESPACE}, `Check for Ingress and Service Conflicts in Namespace `${NAMESPACE} |
Triages issues related to a ingress objects and services. Docs |
Kubernetes Jenkins Healthcheck | Kubernetes , AKS , EKS , GKE , OpenShift , Jenkins |
Query The Jenkins Kubernetes Workload HTTP Endpoint , Query For Stuck Jenkins Jobs |
This taskset collects information about perstistent volumes and persistent volume claims to validate health or help troubleshoot potential issues. Docs |
Kubernetes Labeled Pod Count | Kubernetes , AKS , EKS , GKE , OpenShift |
Measure Number of Running Pods with Label |
This codebundle fetches the number of running pods with the set of provided labels, letting you measure the number of running pods. Docs |
Kubernetes Namespace Healthcheck | Kubernetes , AKS , EKS , GKE , OpenShift |
Get Event Count and Score , Get Container Restarts and Score , Get NotReady Pods , Generate Namspace Score |
This SLI uses kubectl to score namespace health. Produces a value between 0 (completely failing thet test) and 1 (fully passing the test). Looks for container restarts, events, and pods not ready. Docs |
Kubernetes Namespace Troubleshoot | Kubernetes , AKS , EKS , GKE , OpenShift |
Troubleshoot Warning Events in Namespace ${NAMESPACE}, `Troubleshoot Container Restarts In Namespace `${NAMESPACE} , Troubleshoot Pending Pods In Namespace ${NAMESPACE}, `Troubleshoot Failed Pods In Namespace `${NAMESPACE} , Troubleshoot Workload Status Conditions In Namespace ${NAMESPACE}, `Get Listing Of Resources In Namespace `${NAMESPACE} , Check Event Anomalies in Namespace ${NAMESPACE}, `Troubleshoot Services And Application Workloads in Namespace `${NAMESPACE} , Check Missing or Risky PodDisruptionBudget Policies in Namepace ${NAMESPACE}`` |
This taskset runs general troubleshooting checks against all applicable objects in a namespace. Looks for warning events, odd or frequent normal events, restarting containers and failed or pending pods. Docs |
Kubernetes Persistent Volume Healthcheck | Kubernetes , AKS , EKS , GKE , OpenShift |
Fetch Events for Unhealthy Kubernetes PersistentVolumeClaims in Namespace ${NAMESPACE}, `List PersistentVolumeClaims in Terminating State in Namespace `${NAMESPACE} , List PersistentVolumes in Terminating State in Namespace ${NAMESPACE}, `List Pods with Attached Volumes and Related PersistentVolume Details in Namespace `${NAMESPACE} , Fetch the Storage Utilization for PVC Mounts in Namespace ${NAMESPACE}, `Check for RWO Persistent Volume Node Attachment Issues in Namespace `${NAMESPACE} |
This taskset collects information about storage such as PersistentVolumes and PersistentVolumeClaims to validate health or help troubleshoot potential storage issues. Docs |
Kubernetes Pod Resources Scan | Kubernetes , AKS , EKS , GKE , OpenShift |
Show Pods Without Resource Limit or Resource Requests Set in Namespace ${NAMESPACE}, `Get Pod Resource Utilization with Top in Namespace `${NAMESPACE} |
Inspects the resources provisioned for a given set of pods, selected by their labels and raises issues if no resources were specified. Docs |
Kubernetes Postgres Triage | AKS , EKS , GKE , Kubernetes , Patroni , Postgres |
Get Standard Postgres Resource Information , Describe Postgres Custom Resources , Get Postgres Pod Logs & Events , Get Postgres Pod Resource Utilization , Get Running Postgres Configuration , Get Patroni Output , Run DB Queries |
Runs multiple Kubernetes and psql commands to report on the health of a postgres cluster. Docs |
Kubernetes Redis Healthcheck | Kubernetes , AKS , EKS , GKE , OpenShift , Redis |
Ping ${DEPLOYMENT_NAME} Redis Workload , Verify ${DEPLOYMENT_NAME} Redis Read Write Operation |
This taskset collects information on your redis workload in your Kubernetes cluster and raises issues if any health checks fail. Docs |
Kubernetes Restart resource | Kubernetes , AKS , EKS , GKE , OpenShift |
Get Current Resource State , Get Resource Logs , Restart Resource |
This taskset restarts a resource with a given set of labels, typically used with other tasksets. Docs |
Kubernetes Service Account Check | Kubernetes , AKS , EKS , GKE , OpenShift , Redis |
Test Service Account Access to Kubernetes API Server |
This taskset provides tasks to troubleshoot service accounts in a Kubernetes namespace. Docs |
Kubernetes StatefulSet Triage | Kubernetes , AKS , EKS , GKE , OpenShift |
Fetch StatefulSet ${STATEFULSET_NAME} Logs , Get Related StatefulSet ${STATEFULSET_NAME} Events , Fetch StatefulSet ${STATEFULSET_NAME} Manifest Details , List StatefulSets with Unhealthy Replica Counts In Namespace ${NAMESPACE}`` |
Triages issues related to a StatefulSet and its replicas. Docs |
Kubernetes Vault Triage | AKS , EKS , GKE , Kubernetes , Vault |
Fetch Vault CSI Driver Logs , Get Vault CSI Driver Warning Events , Check Vault CSI Driver Replicas , Fetch Vault Logs , Get Related Vault Events , Fetch Vault StatefulSet Manifest Details , Fetch Vault DaemonSet Manifest Details , Verify Vault Availability , Check Vault StatefulSet Replicas |
A suite of tasks that can be used to triage potential issues in your vault namespace. Docs |
Terraform Cloud Workspace Lock Check | Terraform Cloud |
Checking whether the Terraform Cloud Workspace is in a locked state |
Check whether the Terraform Cloud Workspace is in a locked state. Docs |
Test Issues | Test |
Raise Full Issue |
A codebundle for testing the issues feature. Purely for testing flow. Docs |
cURL HTTP OK | Linux macOS Windows HTTP |
Checking HTTP URL Is Available And Timely |
This taskset uses curl to validate the response code of the endpoint. Returns ascore of 1 if healthy, an 0 if unhealthy. Docs |
cli-test-taskset | cli |
Run CLI and Parse Output For Issues , Exec Test , Local Process Test |
This taskset smoketests the CLI codebundle setup and run process Docs |
cmd-test-taskset | cmd |
Run CLI Command , Run Bash File , Log Suggestion |
This taskset smoketests the CLI codebundle setup and run process by running a bare command Docs |