Skip to content

Commit

Permalink
Add missing documentation
Browse files Browse the repository at this point in the history
Signed-off-by: Kemal Akkoyun <[email protected]>
  • Loading branch information
kakkoyun committed Nov 19, 2019
1 parent d14e955 commit 3273f7a
Show file tree
Hide file tree
Showing 4 changed files with 207 additions and 49 deletions.
14 changes: 7 additions & 7 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ slug: /getting-started.md

# Getting started

Thanos provides a global query view, high availability, data backup with historical, cheap data access as its core features in a single binary.
Thanos provides a global query view, high availability, data backup with historical, cheap data access as its core features in a single binary.

Those features can be deployed independently of each other. This allows you to have a subset of Thanos features ready
for immediate benefit or testing, while also making it flexible for gradual roll outs in more complex environments.
Those features can be deployed independently of each other. This allows you to have a subset of Thanos features ready
for immediate benefit or testing, while also making it flexible for gradual roll outs in more complex environments.

In this quick-start guide, we will explain:

Expand All @@ -33,7 +33,7 @@ Thanos aims for a simple deployment and maintenance model. The only dependencies

You can find the latest Thanos release [here](https://github.com/thanos-io/thanos/releases).

Master should be stable and usable. Every commit to master builds docker image named `master-<data>-<sha>` in
Master should be stable and usable. Every commit to master builds docker image named `master-<data>-<sha>` in
[quay.io/thanos/thanos](https://quay.io/repository/thanos/thanos) and [thanosio/thanos dockerhub (mirror)](https://hub.docker.com/r/thanosio/thanos)

We also perform minor releases every 6 weeks.
Expand All @@ -44,7 +44,7 @@ See [release process docs](release-process.md) for details.

## Building from source:

Thanos is built purely in [Golang](https://golang.org/), thus allowing to run Thanos on various x64 operating systems.
Thanos is built purely in [Golang](https://golang.org/), thus allowing to run Thanos on various x64 operating systems.

If you want to build Thanos from source you would need a working installation of the Go 1.12+ [toolchain](https://github.com/golang/tools) (`GOPATH`, `PATH=${GOPATH}/bin:${PATH}`).

Expand Down Expand Up @@ -91,8 +91,8 @@ If you want to add yourself to this list, let us know!

## Operating

See up to date [jsonnet mixins](https://github.com/thanos-io/kube-thanos/tree/master/jsonnet/thanos-mixin)
We also have example Grafana dashboards [here](/examples/grafana/monitoring.md) and some [alerts](/examples/alerts/alerts.md) to get you started.
See up to date [jsonnet mixins](https://github.com/thanos-io/thanos/tree/master/jsonnet/thanos-mixin)
We also have example Grafana dashboards [here](/examples/dashboards/dashboards.md) and some [alerts](/examples/alerts/alerts.md) to get you started.

## Talks

Expand Down
180 changes: 180 additions & 0 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
[//]: # "TODO(kakkoyun): Generate this file using embedmd."

# Alerts

Here are some example alerts configured for Kubernetes environment.

## Compaction

```yaml
- alert: ThanosCompactHalted
expr: thanos_compactor_halted{app="thanos-compact"} == 1
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos compaction has failed to run and now is halted
impact: Long term storage queries will be slower
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: COMPACTION_URL
- alert: ThanosCompactCompactionsFailed
expr: rate(prometheus_tsdb_compactions_failed_total{app="thanos-compact"}[5m]) > 0
labels:
team: TEAM
annotations:
summary: Thanos Compact is failing compaction
impact: Long term storage queries will be slower
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: COMPACTION_URL
- alert: ThanosCompactBucketOperationsFailed
expr: rate(thanos_objstore_bucket_operation_failures_total{app="thanos-compact"}[5m]) > 0
labels:
team: TEAM
annotations:
summary: Thanos Compact bucket operations are failing
impact: Long term storage queries will be slower
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: COMPACTION_URL
- alert: ThanosCompactNotRunIn24Hours
expr: (time() - max(thanos_objstore_bucket_last_successful_upload_time{app="thanos-compact"}) ) /60/60 > 24
labels:
team: TEAM
annotations:
summary: Thanos Compaction has not been run in 24 hours
impact: Long term storage queries will be slower
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: COMPACTION_URL
- alert: ThanosComactionIsNotRunning
expr: up{app="thanos-compact"} == 0 or absent({app="thanos-compact"})
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Compaction is not running
impact: Long term storage queries will be slower
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: COMPACTION_URL
- alert: ThanosComactionMultipleCompactionsAreRunning
expr: sum(up{app="thanos-compact"}) > 1
for: 5m
labels:
team: TEAM
annotations:
summary: Multiple replicas of Thanos compaction shouldn't be running.
impact: Metrics in long term storage may be corrupted
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: COMPACTION_URL

```

## Ruler

For Thanos ruler we run some alerts in local Prometheus, to make sure that Thanos Rule is working:

```yaml
- alert: ThanosRuleIsDown
expr: up{app="thanos-rule"} == 0 or absent(up{app="thanos-rule"})
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Rule is down
impact: Alerts are not working
action: 'check {{ $labels.kubernetes_pod_name }} pod in {{ $labels.kubernetes_namespace}} namespace'
dashboard: RULE_DASHBOARD
- alert: ThanosRuleIsDroppingAlerts
expr: rate(thanos_alert_queue_alerts_dropped_total{app="thanos-rule"}[5m]) > 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Rule is dropping alerts
impact: Alerts are not working
action: 'check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace'
dashboard: RULE_DASHBOARD
- alert: ThanosRuleGrpcErrorRate
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",app="thanos-rule"}[5m]) > 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Rule is returning Internal/Unavailable errors
impact: Recording Rules are not working
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: RULE_DASHBOARD
```
## Store Gateway
```yaml
- alert: ThanosStoreGrpcErrorRate
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",app="thanos-store"}[5m]) > 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Store is returning Internal/Unavailable errors
impact: Long Term Storage Prometheus queries are failing
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: GATEWAY_URL
- alert: ThanosStoreBucketOperationsFailed
expr: rate(thanos_objstore_bucket_operation_failures_total{app="thanos-store"}[5m]) > 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Store is failing to do bucket operations
impact: Long Term Storage Prometheus queries are failing
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: GATEWAY_URL
```
## Sidecar
```
- alert: ThanosSidecarPrometheusDown
expr: thanos_sidecar_prometheus_up{name="prometheus"} == 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Sidecar cannot connect to Prometheus
impact: Prometheus configuration is not being refreshed
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: SIDECAR_URL
- alert: ThanosSidecarBucketOperationsFailed
expr: rate(thanos_objstore_bucket_operation_failures_total{name="prometheus"}[5m]) > 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Sidecar bucket operations are failing
impact: We will lose metrics data if not fixed in 24h
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: SIDECAR_URL
- alert: ThanosSidecarGrpcErrorRate
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",name="prometheus"}[5m]) > 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Sidecar is returning Internal/Unavailable errors
impact: Prometheus queries are failing
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: SIDECAR_URL
```
## Query
```yaml
- alert: ThanosQueryGrpcErrorRate
expr: rate(grpc_server_handled_total{grpc_code=~"Unknown|ResourceExhausted|Internal|Unavailable",name="prometheus"}[5m]) > 0
for: 5m
labels:
team: TEAM
annotations:
summary: Thanos Query is returning Internal/Unavailable errors
impact: Grafana is not showing metrics
action: Check {{ $labels.kubernetes_pod_name }} pod logs in {{ $labels.kubernetes_namespace}} namespace
dashboard: QUERY_URL
```
20 changes: 20 additions & 0 deletions examples/dashboards/dashboards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[//]: # "TODO(kakkoyun): Improve documentation."

# Dashboards

There exists Grafana dashboards for each component (not all of them complete) targeted for environments running Kubernetes:

- [Thanos Overview](thanos-overview.json)
- [Thanos Compact](thanos-compact.json)
- [Thanos Query](thanos-querier.json)
- [Thanos Store](thanos-store.json)
- [Thanos Receive](thanos-receive.json)
- [Thanos Sidecar](thanos-sidecar.json)
- [Thanos Rule](thanos-rule.json)

You can import them via `Import -> Paste JSON` in Grafana.
These dashboards require Grafana 5 or above, importing them in older versions are known not to work.

## Configuration

All dashboards are generated using [`thanos-mixin`](../../jsonnet/thanos-mixin) and can be configured via editing [jsonnet configuration file](../../jsonnet/thanos-mixin/config.libsonnet), which are used to pinpoint Thanos components.
42 changes: 0 additions & 42 deletions examples/monitoring.md

This file was deleted.

0 comments on commit 3273f7a

Please sign in to comment.