Skip to content

Commit

Permalink
K8s cluster robustness features (#414)
Browse files Browse the repository at this point in the history
This commit adds the standard for K8s robustness features, including Kube-API rate limiting, ETCD compaction as well as CA expiration avoidance.

Signed-off-by: Hannes Baum <[email protected]>
  • Loading branch information
cah-hbaum committed Nov 9, 2023
1 parent 442b6a6 commit e6583e9
Showing 1 changed file with 297 additions and 0 deletions.
297 changes: 297 additions & 0 deletions Standards/scs-0215-v1-robustness-features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
---
title: Robustness features for K8s clusters
type: Standard
status: Draft
track: KaaS
---

## Introduction

Kubernetes clusters in a productive environment are under the assumption to always perform perfectly without any major
interruptions. But due to external or unforeseen influences, clusters can be disrupted in their normal workflow, which
could lead to slow responsiveness or even malfunctions.
In order to possibly mitigate some problems for the Kubernetes clusters, robustness features should be introduced into
the SCS standards. These would harden the cluster infrastructure against several problems, making failures less likely.

## Motivation

A typical productive Kubernetes cluster could be hardened in many different ways, whereas probably many of these actions
would overlap and target similar weaknesses of a cluster.
For this version of the standard, the following points should be addressed:

* Kube-API rate limiting
* etcd compaction/defragmentation
* etcd backup
* CA expiration avoidance

These robustness features should mainly increase the up-time of the Kubernetes cluster by avoiding downtimes either
because of internal problems or external threads like "Denial of Service" attacks.
Additionally, the ETCD database should be strengthened with these features in order to provide a secure and robust
backend for the Kubernetes cluster.

## Design Considerations

In order to provide a conclusive standard, some design considerations need to be set beforehand:

### Kube-API rate limiting

Rate limiting is the practice of preventing too many requests to the same server in some time frame. This can help prevent
service interruptions due to congestion and therefore slow responsiveness or even service shutdown.
Kubernetes suggests multiple ways to integrate such a Ratelimit for its API server, a few of which will be mentioned here.
In order to provide a useful Ratelimit for the Kubernetes cluster, combination of these methods should be considered.

#### API server flags

The Kubernetes API server has some flags available to limit the amount of incoming requests that will be accepted by
the server, which should prevent crashing of the API server. This nevertheless shouldn't be the only measure to
introduce a rate limit, since important requests could get blocked during high traffic periods (at least according to
the official documentation).
The following controls are available to tune the server:

* max-requests-inflight
* max-mutating-requests-inflight
* min-request-timeout

More details can be found in the following documents:
[Flow Control](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/)

#### Ratelimit Admission Controller

From version 1.13 onwards, Kubernetes includes a EventRateLimit Admission Controller, which aims to mitigate Ratelimit
problems associated with the API server by providing limits for requests every second either to specific resources or
even the whole API server. If requests are denied due to this Admission Controller, they're either cached or denied
completely and need to be retried; this also depends on the EventRateLimit configuration.
More details can be found in the following documents:
[Rancher rate limiting](https://rke.docs.rancher.com/config-options/rate-limiting)
[EventRateLimit](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#eventratelimit)
It is important to note, that this only helps the API server against event overloads and not necessarily the network
in front of it, which could still be overwhelmed.

#### Flow control

Flow control for the Kubernetes API server can be provided by the API priority and fairness feature, which classifies
and isolates requests in a fine-grained way in order to prevent an overload of the API server.
The package introduces queues in order to not deny requests and dequeue them through Fair Queueing techniques.
Overall, the Flow control package introduces many different features like request queues, rule-based flow control,
different priority levels and rate limit maximums.
The concept documentation offers a more in-depth explanation of the feature:
[Flow Control](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/)

### etcd compaction/defragmentation

etcd is a strongly consistent, distributed key-value store that provides a reliable way to store data that needs to be
accessed by a distributed system or cluster of machines. For these reasons, etcd was chosen as the default database
for Kubernetes.
In order to remain reliable, an etcd cluster needs periodic maintenance. This is necessary to maintain the etcd keyspace;
failure to do so could lead to a cluster-wide alarm, which would put the cluster into a limited-operation mode.
To mitigate this scenario, the etcd keyspace can be compacted. Additionally, an etcd cluster can be defragmented, which
gives back disk space to the underlying file system and can help bring the cluster back into an operable state, if it
ran out of space earlier.

This can be achieved by providing the necessary flags/parameters to etcd, either via the KubeadmControlPlane or in the
configuration file of the etcd cluster, if it is managed independent from the Kubernetes cluster.
Possible flags, that can be set for this feature, are:

* auto-compaction-mode
* auto-compaction-retention

etcd cluster defragmentation unfortunately can't be done automatically. Instead the user would need to manually call
the defrag command on the cluster. In order to mitigate this, a systemd (or similar) job could be created, which
periodically calls the defragmentation procedure. Unfortunately, simultaneous defragmentation of all members of an etcd
cluster would block read and write procedures. A preferable strategy to mitigate this would be the following:

* defragment the non leader etcd members first
* change the leadership to the randomly selected and defragmentation completed etcd member
* defragment the local (ex-leader) etcd member

This example was taken from the [Maintenance and Troubleshooting page](https://github.com/SovereignCloudStack/k8s-cluster-api-provider/blob/main/doc/Maintenance_and_Troubleshooting.md#defragmentation-and-backup)
page of the SCS documentation, which was derived in part from the [OpenShift Host Practices](https://docs.openshift.com/container-platform/4.9/scalability_and_performance/recommended-host-practices.html#automatic-defrag-etcd-data_recommended-host-practices).

An example for a defragmentation job, e.g. as a systemd service, and its helpers could be the following:

```bash
[Unit]
Description=Run etcdctl defrag
Documentation=https://etcd.io/docs/v3.3.12/op-guide/maintenance/#defragmentation
After=network.target
[Service]
Type=oneshot
Environment="LOG_DIR=/var/log"
Environment="ETCDCTL_API=3"
ExecStart=/usr/local/sbin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt defrag
[Install]
WantedBy=multi-user.target
```

```bash
[Unit]
Description=Run etcd-defrag.service every day
After=network.target
[Timer]
OnCalendar=*-*-* 02:00:0
RandomizedDelaySec=10m
[Install]
WantedBy=multi-user.target
```

More information about compaction and defragmentation can be found in the respective etcd documentation
[etcd maintenance](https://etcd.io/docs/v3.3/op-guide/maintenance/)

### etcd backup

An etcd cluster should be regularly backed up in order to be able to restore the cluster to a known good state at an
earlier space in time if a failure or incorrect state happens.
The cluster should be backed up multiple times in order to have different possible states to go back to. This is especially
useful, if data in the newer backups was already corrupted in some way or important data was deleted in them.
For this reason, a backup strategy needs to be developed with a decreasing number of backups in an increasing period of time,
meaning that the previous year should only have 1 backup, but the current week should have multiple.
Information about the backup process can be found in the etcd documentation:
[Upgrade etcd](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/)

### CA expiration avoidance

In order to secure the communication of a Kubernetes cluster, (TLS) certificates signed by a controlled
Certificate Authority (CA) can be used.
Normally, these certificates expire after a set period of time. In order to avoid expiration and failure of a cluster,
these certificates need to be rotated regularly and at best automatically.
It is important to either set `--rotate-server-certificates` as a command line parameter or set `rotateCertificates: true`
in the kubelet config or the `kubeletExtraArgs` of the cluster-template.yaml file. This activates the rotation of the
kubelet server certificate. Another recommendation is to set `serverTLSBootstrap: true`, which also enables the request
and rotation of the certificate for the kubelet according to the documentation.

A clusters certificates can either be rotated by updating the cluster, which according to the Kubernetes documentation
automatically renews the certificates, or by running the kubeadm certs renew command for the certificates that need to
updated, which can be seen in the following inline code

```bash
kubeadm certs renew all
```

Since clusters conformant with the SCS standards would probably be updated within a 14 month time period, this
rotation can probably be assumed to happen. Nevertheless, the alternative can still be mentioned in the standard.
Additionally, the CSR needs to be approved manually due to security reasons with the commands

```bash
kubectl get csr
kubectl certificate approve <CSR>
```

Another option to approve the CSRs would be to use a third-party controller that automates the process. One example for
this would be the [Kubelet CSR approver](https://github.com/postfinance/kubelet-csr-approver), which can be deployed on
a K8s cluster and requires `serverTLSBootstrap` to be set to true. Other controllers with a similar functionality might
have other specific requirements, which won't be explored in this document.

Another problem is that the Certificate Authority (CA) might expire. Unfortunately, kubeadm doesn't have any tooling
at the moment to renew the CA. Instead, there is documentation for manually rotating the CA, which can be found
under [Manual rotation of ca certificate](https://kubernetes.io/docs/tasks/tls/manual-rotation-of-ca-certificates/).

Further information can be found in the Kubernetes documentation:
[Kubeadm certs](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/)
[Kubelete TLS bootstrapping](https://kubernetes.io/docs/reference/access-authn-authz/kubelet-tls-bootstrapping/)

## Decision

Robustness features combine multiple aspects of increasing the security, hardness and
longevity of a Kubernetes cluster. The decisions will be separated into their respective
areas.

### Kube-API rate limiting

The number of requests send to the kube-api or Kubernetes API server SHOULD be limited
in order to protect the server against outages, deceleration or malfunctions due to an
overload of requests.
In order to do so, at least the following parameters SHOULD be set on a Kubernetes cluster:

* max-requests-inflight
* max-mutating-requests-inflight
* min-request-timeout

Values for these flags/parameters SHOULD be adapted to the needs of the environment and
the expected load.

A cluster MUST also activate and configure a Ratelimit admission controller.
This requires an `EventRateLimit` resource to be deployed on the Kubernetes cluster.
The following settings are RECOMMENDED for a cluster-wide deployment, but more
fine-grained rate limiting can also be applied, if this is necessary.

```yaml
kind: Configuration
apiVersion: eventratelimit.admission.k8s.io/v1alpha1
limits:
- burst: 20000
qps: 5000
type: Server
```
It is also RECOMMENDED to activate the Kubernetes API priority and fairness feature,
which also uses the aforementioned cluster parameters to better queue, schedule and
prioritize incoming requests.
### etcd compaction/defragmentation
etcd needs to be cleaned up regularly, so that it functions correctly and doesn't take
up too much space, which happens because of its increase of the keyspace.
To compact the etcd keyspace, the following flags/parameters MUST be set for etcd:
* auto-compaction-mode = periodic
* auto-compaction-retention = 8h
OPTIONALLY, a cluster defragmentation can be carried out regularly.
To do this, it is RECOMMENDED to create a systemd (or similar automatic job) in order
to execute this defragmentation regularly in a fixed timeframe.
An example for such a systemd job can be found in the chapter [Design Considerations].
It is important to note, that such a defragmentation could lead to service interruptions.
Therefore, such a process should at best be carried during times of low traffic in order
to not disrupt normal workflow.
### etcd backup
An etcd cluster MUST be backed up regularly. It is RECOMMENDED to adapt
a strategy of decreasing backups over longer time periods, e.g. keeping snapshots every
10 minutes for the last 120 minutes, then one hourly for 1 day, then one daily for 2 weeks,
then one weekly for 3 months, then one monthly for 2 years, and after that a yearly backup.
These numbers can be adapted to the security setup and concerns like storage or network
usage. It is also RECOMMENDED to encrypt the backups in order to secure them further.
How this is done is up to the operator.
### CA expiration avoidance
It should be avoided, that certificates expire either on the whole cluster or for single components.
To avoid this scenario, certificates SHOULD be rotated regularly; in the
case of SCS, we REQUIRE at least a yearly certificate rotation.
To achieve a complete certificate rotation, the parameters `serverTLSBootstrap` and `rotateCertificates` MUST be set.

The certificates can be rotated by either updating the Kubernetes cluster, which automatically
renews certificates, or by manually renewing them with the command

```bash
kubeadm certs renew all
```

After this, new CSRs MUST be approved manually, normally done with

```bash
kubectl get csr
kubectl certificate approve <CSR>
```

or be approved with a third-party controller, e.g. the [kubelet-csr-approver](https://github.com/postfinance/kubelet-csr-approver).

It is also RECOMMENDED to renew the certificate authority (CA) regularly
to avoid an expiration of the CA. This standard doesn't set a timeline
for this, since it is dependent on the CA.

## Related Documents

[Flow Control](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/)
[Rate limiting](https://rke.docs.rancher.com/config-options/rate-limiting)
[EventRateLimit](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#eventratelimit)
[etcd maintenance](https://etcd.io/docs/v3.3/op-guide/maintenance/)
[Upgrade etcd](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/)
[Kubeadm certs](https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/)
[Kubelet TLS bootstrapping](https://kubernetes.io/docs/reference/access-authn-authz/kubelet-tls-bootstrapping/)

## Conformance Tests

Conformance Tests, OPTIONAL

0 comments on commit e6583e9

Please sign in to comment.