Updated metrics readme and added dashboard info (#171)

* updated metrics readme and added dashboard info
Azure · Mar 3, 2020 · 2988515 · 2988515
1 parent 0978e28
commit 2988515
Show file tree

Hide file tree

Showing 3 changed files with 55 additions and 50 deletions.
diff --git a/docs/deploy.md b/docs/deploy.md
@@ -101,3 +101,5 @@ kubectl --namespace azure-databricks-operator-system get pods
 # pull the logs
 kubectl --namespace azure-databricks-operator-system logs -f [name_of_the_operator_pod]
 ```
+
+To further aid debugging diagnostic metrics are produced by the operator. Please review [the metrics page](metrics.md) for further information
diff --git a/docs/metrics.md b/docs/metrics.md
@@ -0,0 +1,51 @@
+# Azure-Databricks-Operator Metrics
+
+To help diagnose issues the operator exposes a set of [Prometheus metrics](https://prometheus.io/). Also included with this repo is a ServiceMonitor definition `yaml` that can be deployed to enable an existing (or new) Prometheus deployment to scrape these metrics.
+
+## Operator metrics
+
+- Enabling the Operator to output prometheus metrics is done via the customization of `config/default/kustomization.yaml`:
+- If you don't want Prometheus-Operator configuration generated, it can be disabled by commenting out the line indicated in `config/default/kustomization.yaml`
+> *NOTE:* If you don't have the Prometheus-Operator installed, the ServiceMonitor CRD will not be available to you. Please see the section below for further information about installation.
+- All custom operator metrics exposed on the metrics endpoint are prefixed `databricks_`
+
+In addition to the standard metrics that kubebuilder provides, the following custom metrics have been added.
+
+The `databricks_request_duration_seconds` histogram provides metrics on the duration of calls via the databricks SDK and has the following labels:
+
+|Name|Description|
+|-|-|
+|`object_type`|The type of CRD that the call relates to, e.g. `dcluster`|
+|`action`| The action being performed, e.g. `get`, `create`|
+|`outcome`| `success` or `failure`|
+
+## Accessing Prometheus
+- [Prometheus-Operator](https://github.com/coreos/prometheus-operator) can be installed in your cluster easily via Helm
+> This repo provides an easy `make install-prometheus` to perform the Helm installtion
+- Determine the name of Prometheus service running in your cluster (If you used our `make` command then this will default to `prom-azure-databricks-oper-prometheus`)
+- Port forward localhost:9090 to your service: `kubectl port-forward service/prom-azure-databricks-oper-prometheus 9090:9090`
+>If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container) 
+- Using a browser navigate to `http://localhost:9090` to view the Prometheus dashboard
+- For more information regarding the usage of Prometheus please view the [docs here](https://prometheus.io/)
+
+## Grafana Dashboard
+This repo also includes a Grafana dashboard named `Databricks Operator` that can be installed:
+- If Prometheus-Operator is being used ensure then by default a sidecar is available to automatically install dashboards via `configmap`:
+  - Update `config/prometheus/grafana-dashboard-configmap.yaml` to have a namespace matching your Grafana service
+  - Apply `configmap` into the same namespace as your Grafana service running the sidecar `kubectl apply -f ./config/prometheus/grafana-dashboard-configmap.yaml`
+- If you are not using Grafana/Prometheus-Operator, then the json can be extracted and imported manually
+- The dashboard provides you general metrics regarding the health of your operator (see below for information about interpretting the chart data)
+
+## Dashboard Charts
+
+| Panel Name | Description | Usage |
+|---|---|---|
+| **Reconciliations Per Controller** | Increase/decrease in the total count of reconcile loops that are being performed | Graph is useful to determine the number of Reconcile loops that result in Error vs Success.  <br /><br />A spike in errors can indicate something wrong inside the operator logic such as missing config Secret containing Databricks uri etc.|
+| **Controller Reconcile Time** | Median, 95% and mean time taken to perform a reconciliation loop  | Graph is useful to see how long the reconciliations take to complete as this is the complete lifecycle time and includes execution time in addition to upstream Databricks calls|
+| **Workqueue Adds** | Increase/decrease of new work for the Operator to perform. | Graph is useful as it will show incoming rate of Operator work requests to create CRD's. <br /><br />Operator also re-queues items to re-process (polling runs for completion status for example) and so therefore graph will show rate increase even when not strictly "new work to be performed"<br /><br />Note: The Operator logic will re-queue certain tasks when polling to see if work is complete etc. |
+| **Workqueue Depth** | Increase/decrease of the Operator work queue depth | The work queue shows the number of reconcile loops currently awaiting and opportunity to run. <br /><br />Useful for seeing if the Operator is struggling to cope with incoming demands for work |
+| **Average Databricks Request Duration** | Average and 95% request duration when the Operator calls Databricks via its REST api | Useful for seeing how long Databricks is taking to respond to requests from Operator and can help diagnose network issues from the K8s cluster/potential timeout issues. |
+| **Databricks REST endpoint calls - Success** | Increase/decrease of successful calls to databricks REST endpoints | Useful for identifying the throughput rate of the Operator calls to Databricks |
+| **Databricks REST endpoint calls - Failure** | Increase/decrease of failed calls to databricks REST endpoints | Useful for identifying the error rate of external Databricks calls, a sudden spike could indicated a databricks outage or a potentially breaking change to the Databricks REST services causing all requests for a specific endpoint that is having issues |
+| **Workqueue - Work Duration** | Median and 95% of how long in seconds processing an item from workqueue takes | Useful for measuring if one type of CRD request takes longer than others to complete<br /><br />*Note:* This metric is different to that of Controller Reconcile Time because it includes overhead execution time, not just the time spent executing with the Controller.
+| **Workqueue - Queue Duration** | Median and 95% of how long in seconds an item stays in workqueue before being requested. | Useful for measuring if the work queue is backing up. Can indicate that something is starving the Operator of CPU
diff --git a/docs/resources.md b/docs/resources.md
@@ -34,54 +34,6 @@ More info:
 - [Create a pipeline and add a status badge to Github](https://docs.microsoft.com/en-us/azure/devops/pipelines/create-first-pipeline?view=azure-devops&tabs=tfs-2018-2)
 - [Customize status badge with shields.io](https://shields.io/)
 
-## Operator metrics
+## Controller metrics and dashboards
 
-- Operator telemetry metrics are exposed via standard [Prometheus](https://prometheus.io/) format endpoints. 
-- [Prometheus-Operator](https://github.com/coreos/prometheus-operator) is included as part of the operator deployment via Helm chart.
-    - Prometheus configuration is generated via the `config/default/kustomization.yaml`
-    - Installation of Prometheus-Operator can be manually triggered via command `make install-prometheus`
-    - If you don't want Prometheus-Operator configuration generated, it can be disabled by commenting out the line indicated in `config/default/kustomization.yaml`
-    - *NOTE:* If you don't have the Prometheus-Operator installed, the ServiceMonitor CRD will not be available to you
-- Custom metrics exposed by the Operator can be found by searching for `databricks_` inside the Prometheus web ui
-- Metrics follow the naming guidlines recommended by Prometheus
-
-### How to access the Prometheus instance
-- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
-- Determine the name of Prometheus service running in your cluster (by default this will be prom-azure-databricks-oper-prometheus)
-- Port forward localhost:9090 to your service: `kubectl port-forward service/prom-azure-databricks-oper-prometheus 9090:9090`
-    - If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container) 
-- Using a browser navigate to `http://localhost:9090` to view the Prometheus dashboard
-- For more information regarding the usage of Prometheus please view the [docs here](https://prometheus.io/)
-
-### How To scrape the metrics from a single intance of the Operator running on a Pod: 
-- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
-- Determine the name of the pod running your operator: `kubectl get pods -n azure-databricks-operator-system`
-- Port forward localhost:8080 to your pod: `kubectl port-forward -n azure-databricks-operator-system pod/azure-databricks-operator-controller-manager-<id> 8080:8080`
-- Open another terminal and curl request the metric endpoint: `curl localhost:8080/metrics`
-
-### How to access metrics via Grafana
-- Have the operator installed and running locally. See [deploy.md](https://github.com/microsoft/azure-databricks-operator/blob/master/docs/deploy.md)
-- Determine the name of Grafana service running in your cluster (by default this will be prom-azure-databricks-operator-grafana)
-- Port forward localhost:8080 to your service: `kubectl port-forward service/prom-azure-databricks-operator-grafana 8080:80`
-    - If using VSCode and Dev Container, you may need to expose the internal port out to your host machine (Command Pallete > Remote Containers Forward Port From Container) 
-- Using a browser navigate to `http://localhost:8080` to view the Prometheus dashboard
-- If you are using the default helm installation of the Prometheus-Operator (as provided) then you can find the [default login details here](https://github.com/helm/charts/tree/master/stable/grafana#configuration)
-
-This repo also includes a Grafana dashboard that can be installed:
-- If Prometheus-Operator is being used ensure then by default a sidecar is available to automatically install dashboards via `configmap`:
-    - Update `config/prometheus/grafana-dashboard-configmap.yaml` to have a namespace matching your Grafana service
-    - Apply `configmap` into the same namespace as your Grafana service running the sidecar `kubectl apply -f ./config/prometheus/grafana-dashboard-configmap.yaml`
-- If you are not using Grafana/Prometheus-Operator, then the json can be extracted and imported manually
-- The dashboard provides you general metrics regarding the health of your operator (upstream databricks call success/failure rates and general health of the operator)
-
-### Counter metrics
-
-In addition to the standard metrics that kubebuilder provides, the following custom metrics have been added.
-
-The `databricks_request_duration_seconds` histogram provides metrics on the duration of calls via the databricks SDK and has the following labels:
-
-|Name|Description|
-|-|-|
-|`object_type`|The type of object that the call relatest to, e.g. `dcluster`|
-|`action`| The action being performed, e.g. `get`, `create`|
-|`outcome`| `success` or `failure`|
+For information on how to monitor metrics from published from the operator, please review [the metrics page](metrics.md).