From 7196db6980b71e8c4db2ba616ebfb052277c66b6 Mon Sep 17 00:00:00 2001 From: Tim Bannister Date: Thu, 27 Aug 2020 23:08:56 +0100 Subject: [PATCH] Revise page about multiple zones Drop provider-specific details, in line with current content guide. Plus general rewording. --- .../setup/best-practices/multiple-zones.md | 480 ++++-------------- 1 file changed, 110 insertions(+), 370 deletions(-) diff --git a/content/en/docs/setup/best-practices/multiple-zones.md b/content/en/docs/setup/best-practices/multiple-zones.md index 7c2622641b865..501e9546428bf 100644 --- a/content/en/docs/setup/best-practices/multiple-zones.md +++ b/content/en/docs/setup/best-practices/multiple-zones.md @@ -4,401 +4,141 @@ reviewers: - justinsb - quinton-hoole title: Running in multiple zones -weight: 10 +weight: 20 content_type: concept --- -This page describes how to run a cluster in multiple zones. - - +This page describes running Kubernetes across multiple zones. -## Introduction - -Kubernetes 1.2 adds support for running a single cluster in multiple failure zones -(GCE calls them simply "zones", AWS calls them "availability zones", here we'll refer to them as "zones"). -This is a lightweight version of a broader Cluster Federation feature (previously referred to by the affectionate -nickname ["Ubernetes"](https://github.com/kubernetes/community/blob/{{< param "githubbranch" >}}/contributors/design-proposals/multicluster/federation.md)). -Full Cluster Federation allows combining separate -Kubernetes clusters running in different regions or cloud providers -(or on-premises data centers). However, many -users simply want to run a more available Kubernetes cluster in multiple zones -of their single cloud provider, and this is what the multizone support in 1.2 allows -(this previously went by the nickname "Ubernetes Lite"). - -Multizone support is deliberately limited: a single Kubernetes cluster can run -in multiple zones, but only within the same region (and cloud provider). Only -GCE and AWS are currently supported automatically (though it is easy to -add similar support for other clouds or even bare metal, by simply arranging -for the appropriate labels to be added to nodes and volumes). - - -## Functionality - -When nodes are started, the kubelet automatically adds labels to them with -zone information. - -Kubernetes will automatically spread the pods in a replication controller -or service across nodes in a single-zone cluster (to reduce the impact of -failures.) With multiple-zone clusters, this spreading behavior is -extended across zones (to reduce the impact of zone failures.) (This is -achieved via `SelectorSpreadPriority`). This is a best-effort -placement, and so if the zones in your cluster are heterogeneous -(e.g. different numbers of nodes, different types of nodes, or -different pod resource requirements), this might prevent perfectly -even spreading of your pods across zones. If desired, you can use -homogeneous zones (same number and types of nodes) to reduce the -probability of unequal spreading. - -When persistent volumes are created, the `PersistentVolumeLabel` -admission controller automatically adds zone labels to them. The scheduler (via the -`VolumeZonePredicate` predicate) will then ensure that pods that claim a -given volume are only placed into the same zone as that volume, as volumes -cannot be attached across zones. - -## Limitations - -There are some important limitations of the multizone support: - -* We assume that the different zones are located close to each other in the -network, so we don't perform any zone-aware routing. In particular, traffic -that goes via services might cross zones (even if some pods backing that service -exist in the same zone as the client), and this may incur additional latency and cost. - -* Volume zone-affinity will only work with a `PersistentVolume`, and will not -work if you directly specify an EBS volume in the pod spec (for example). - -* Clusters cannot span clouds or regions (this functionality will require full -federation support). - -* Although your nodes are in multiple zones, kube-up currently builds -a single master node by default. While services are highly -available and can tolerate the loss of a zone, the control plane is -located in a single zone. Users that want a highly available control -plane should follow the [high availability](/docs/setup/production-environment/tools/kubeadm/high-availability/) instructions. - -### Volume limitations -The following limitations are addressed with [topology-aware volume binding](/docs/concepts/storage/storage-classes/#volume-binding-mode). - -* StatefulSet volume zone spreading when using dynamic provisioning is currently not compatible with - pod affinity or anti-affinity policies. - -* If the name of the StatefulSet contains dashes ("-"), volume zone spreading - may not provide a uniform distribution of storage across zones. - -* When specifying multiple PVCs in a Deployment or Pod spec, the StorageClass - needs to be configured for a specific single zone, or the PVs need to be - statically provisioned in a specific zone. Another workaround is to use a - StatefulSet, which will ensure that all the volumes for a replica are - provisioned in the same zone. - -## Walkthrough - -We're now going to walk through setting up and using a multi-zone -cluster on both GCE & AWS. To do so, you bring up a full cluster -(specifying `MULTIZONE=true`), and then you add nodes in additional zones -by running `kube-up` again (specifying `KUBE_USE_EXISTING_MASTER=true`). - -### Bringing up your cluster - -Create the cluster as normal, but pass MULTIZONE to tell the cluster to manage multiple zones; creating nodes in us-central1-a. - -GCE: - -```shell -curl -sS https://get.k8s.io | MULTIZONE=true KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-a NUM_NODES=3 bash -``` - -AWS: - -```shell -curl -sS https://get.k8s.io | MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a NUM_NODES=3 bash -``` - -This step brings up a cluster as normal, still running in a single zone -(but `MULTIZONE=true` has enabled multi-zone capabilities). - -### Nodes are labeled - -View the nodes; you can see that they are labeled with zone information. -They are all in `us-central1-a` (GCE) or `us-west-2a` (AWS) so far. The -labels are `failure-domain.beta.kubernetes.io/region` for the region, -and `failure-domain.beta.kubernetes.io/zone` for the zone: - -```shell -kubectl get nodes --show-labels -``` - -The output is similar to this: - -```shell -NAME STATUS ROLES AGE VERSION LABELS -kubernetes-master Ready,SchedulingDisabled 6m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-1,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-master -kubernetes-minion-87j9 Ready 6m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-87j9 -kubernetes-minion-9vlv Ready 6m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-9vlv -kubernetes-minion-a12q Ready 6m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-a12q -``` - -### Add more nodes in a second zone - -Let's add another set of nodes to the existing cluster, reusing the -existing master, running in a different zone (us-central1-b or us-west-2b). -We run kube-up again, but by specifying `KUBE_USE_EXISTING_MASTER=true` -kube-up will not create a new master, but will reuse one that was previously -created instead. - -GCE: - -```shell -KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-b NUM_NODES=3 kubernetes/cluster/kube-up.sh -``` - -On AWS we also need to specify the network CIDR for the additional -subnet, along with the master internal IP address: - -```shell -KUBE_USE_EXISTING_MASTER=true MULTIZONE=true KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2b NUM_NODES=3 KUBE_SUBNET_CIDR=172.20.1.0/24 MASTER_INTERNAL_IP=172.20.0.9 kubernetes/cluster/kube-up.sh -``` - - -View the nodes again; 3 more nodes should have launched and be tagged -in us-central1-b: +## Background -```shell -kubectl get nodes --show-labels -``` +Kubernetes is designed so that a single Kubernetes cluster can run +across multiple failure zones, typically where these zones fit within +a logical grouping called a _region_. Major cloud providers define a region +as a set of failure zones (also called _availability zones_) that provide +a consistent set of features: within a region, each zone offers the same +APIs and services. -The output is similar to this: +Typical cloud architectures aim to minimize the chance that a failure in +one zone also impairs services in another zone. -```shell -NAME STATUS ROLES AGE VERSION LABELS -kubernetes-master Ready,SchedulingDisabled 16m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-1,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-master -kubernetes-minion-281d Ready 2m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-281d -kubernetes-minion-87j9 Ready 16m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-87j9 -kubernetes-minion-9vlv Ready 16m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-9vlv -kubernetes-minion-a12q Ready 17m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-a12q -kubernetes-minion-pp2f Ready 2m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-pp2f -kubernetes-minion-wf8i Ready 2m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-wf8i -``` +## Control plane behavior -### Volume affinity +All [control plane components](/docs/concepts/overview/components/#control-plane-components) +support running as a pool of interchangable resources, replicated per +component. -Create a volume using the dynamic volume creation (only PersistentVolumes are supported for zone affinity): - -```bash -kubectl apply -f - <}} -For version 1.3+ Kubernetes will distribute dynamic PV claims across -the configured zones. For version 1.2, dynamic persistent volumes were -always created in the zone of the cluster master -(here us-central1-a / us-west-2a); that issue -([#23330](https://github.com/kubernetes/kubernetes/issues/23330)) -was addressed in 1.3+. +Kubernetes does not provide cross-zone resilience for the API server +endpoints. You can use various techniques to improve availability for +the cluster API server, including DNS round-robin, SRV records, or +a third-party load balancing solution with health checking. {{< /note >}} -Now let's validate that Kubernetes automatically labeled the zone & region the PV was created in. - -```shell -kubectl get pv --show-labels -``` - -The output is similar to this: - -```shell -NAME CAPACITY ACCESSMODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE LABELS -pv-gce-mj4gm 5Gi RWO Retain Bound default/claim1 manual 46s failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a -``` - -So now we will create a pod that uses the persistent volume claim. -Because GCE PDs / AWS EBS volumes cannot be attached across zones, -this means that this pod can only be created in the same zone as the volume: - -```yaml -kubectl apply -f - <}} +or {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}}) +across different nodes in a cluster. This spreading helps +reduce the impact of failures. -The pods should be spread across all 3 zones: +When nodes start up, the kubelet on each node automatically adds +{{< glossary_tooltip text="labels" term_id="label" >}} to the Node object +that represents that specific kubelet in the Kubernetes API. +These labels can include +[zone information](/docs/reference/kubernetes-api/labels-annotations-taints/#topologykubernetesiozone). -```shell -kubectl describe pod -l app=guestbook | grep Node -``` +If your cluster spans multiple zones or regions, you can use node labels +in conjunction with +[Pod topology spread constraints](/docs/concepts/workloads/pods/pod-topology-spread-constraints/) +to control how Pods are spread across your cluster among fault domains: +regions, zones, and even specific nodes. +These hints enable the +{{< glossary_tooltip text="scheduler" term_id="kube-scheduler" >}} to place +Pods for better expected availability, reducing the risk that a correlated +failure affects your whole workload. -```shell -Node: kubernetes-minion-9vlv/10.240.0.5 -Node: kubernetes-minion-281d/10.240.0.8 -Node: kubernetes-minion-olsh/10.240.0.11 -``` +For example, you can set a constraint to make sure that the +3 replicas of a StatefulSet are all running in different zones to each +other, whenever that is feasible. You can define this declaratively +without explicitly defining which availability zones are in use for +each workload. -```shell -kubectl get node kubernetes-minion-9vlv kubernetes-minion-281d kubernetes-minion-olsh --show-labels -``` +### Distributing nodes across zones -```shell -NAME STATUS ROLES AGE VERSION LABELS -kubernetes-minion-9vlv Ready 34m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-a,kubernetes.io/hostname=kubernetes-minion-9vlv -kubernetes-minion-281d Ready 20m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-b,kubernetes.io/hostname=kubernetes-minion-281d -kubernetes-minion-olsh Ready 3m v1.13.0 beta.kubernetes.io/instance-type=n1-standard-2,failure-domain.beta.kubernetes.io/region=us-central1,failure-domain.beta.kubernetes.io/zone=us-central1-f,kubernetes.io/hostname=kubernetes-minion-olsh -``` +Kubernetes' core does not create nodes for you; you need to do that yourself, +or use a tool such as the [Cluster API](https://cluster-api.sigs.k8s.io/) to +manage nodes on your behalf. +Using tools such as the Cluster API you can define sets of machines to run as +worker nodes for your cluster across multiple failure domains, and rules to +automatically heal the cluster in case of whole-zone service disruption. -Load-balancers span all zones in a cluster; the guestbook-go example -includes an example load-balanced service: +## Manual zone assignment for Pods -```shell -kubectl describe service guestbook | grep LoadBalancer.Ingress -``` - -The output is similar to this: - -```shell -LoadBalancer Ingress: 130.211.126.21 -``` - -Set the above IP: - -```shell -export IP=130.211.126.21 -``` - -Explore with curl via IP: - -```shell -curl -s http://${IP}:3000/env | grep HOSTNAME -``` - -The output is similar to this: - -```shell - "HOSTNAME": "guestbook-44sep", -``` - -Again, explore multiple times: - -```shell -(for i in `seq 20`; do curl -s http://${IP}:3000/env | grep HOSTNAME; done) | sort | uniq -``` - -The output is similar to this: - -```shell - "HOSTNAME": "guestbook-44sep", - "HOSTNAME": "guestbook-hum5n", - "HOSTNAME": "guestbook-ppm40", -``` - -The load balancer correctly targets all the pods, even though they are in multiple zones. - -### Shutting down the cluster - -When you're done, clean up: - -GCE: - -```shell -KUBERNETES_PROVIDER=gce KUBE_USE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-f kubernetes/cluster/kube-down.sh -KUBERNETES_PROVIDER=gce KUBE_USE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh -KUBERNETES_PROVIDER=gce KUBE_GCE_ZONE=us-central1-a kubernetes/cluster/kube-down.sh -``` - -AWS: - -```shell -KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2c kubernetes/cluster/kube-down.sh -KUBERNETES_PROVIDER=aws KUBE_USE_EXISTING_MASTER=true KUBE_AWS_ZONE=us-west-2b kubernetes/cluster/kube-down.sh -KUBERNETES_PROVIDER=aws KUBE_AWS_ZONE=us-west-2a kubernetes/cluster/kube-down.sh -``` +You can apply [node selector constraints](/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) +to Pods that you create, as well as to Pod templates in workload resources +such as Deployment, StatefulSet, or Job. +## Storage access for zones +When persistent volumes are created, the `PersistentVolumeLabel` +[admission controller](/docs/reference/access-authn-authz/admission-controllers/) +automatically adds zone labels to any PersistentVolumes that are linked to a specific +zone. The {{< glossary_tooltip text="scheduler" term_id="kube-scheduler" >}} then ensures, +through its `NoVolumeZoneConflict` predicate, that pods which claim a given PersistentVolume +are only placed into the same zone as that volume. + +You can specify a {{< glossary_tooltip text="StorageClass" term_id="storage-class" >}} +for PersistentVolumeClaims that specifies the failure domains (zones) that the +storage in that class may use. +To learn about configuring a StorageClass that is aware of failure domains or zones, +see [Allowed topologies](/docs/concepts/storage/storage-classes/#allowed-topologies). + +## Networking + +By itself, Kubernetes does not include zone-aware networking. You can use a +[network plugin](docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/) +to configure cluster networking, and that network solution might have zone-specific +elements. For example, if your cloud provider supports Services with +`type=LoadBalancer`, the load balancer might only send traffic to Pods running in the +same zone as the load balancer element processing a given connection. +Check your cloud provider's documentation for details. + +For custom or on-premises deployments, similar considerations apply. +{{< glossary_tooltip text="Service" term_id="service" >}} and +{{< glossary_tooltip text="Ingress" term_id="ingress" >}} behavior, including handling +of different failure zones, does vary depending on exactly how your cluster is set up. + +## Fault recovery + +When you set up your cluster, you might also need to consider whether and how +your setup can restore service if all of the failure zones in a region go +off-line at the same time. For example, do you rely on there being at least +one node able to run Pods in a zone? +Make sure that any cluster-critical repair work does not rely +on there being at least one healthy node in your cluster. For example: if all nodes +are unhealthy, you might need to run a repair Job with a special +{{< glossary_tooltip text="toleration" term_id="toleration" >}} so that the repair +can complete enough to bring at least one node into service. + +Kubernetes doesn't come with an answer for this challenge; however, it's +something to consider. + +## {{% heading "whatsnext" %}} + +To learn how the scheduler places Pods in a cluster, honoring the configured constraints, +visit [Scheduling and Eviction](/docs/concepts/scheduling-eviction/).