Three different manifests are provided as templates based on different uses cases for a ZooKeeper ensemble.
- zookeeper.yaml provides a manifest that is close to production readiness.
- It provides 5 servers with a disruption budget of 1 planned disruption. This ensemble will tolerate 1 planned and 1 unplanned failure.
- Each server will consume 4 GiB of memory, 3 Gib of which will be dedicated to the ZooKeeper JVM heap.
- Each server will consume 2 CPUs.
- Each server will consume 1 Persistent Volume with 250 GiB of storage.
- You can tune the parameters as nessecary to suite the needs of your deployment.
- The total footprint is 5 Nodes, 10 CPUs, 20 GiB memory, 1250 GiB disk
- zookeeper_mini.yaml provides a manifest that is suitable for a demos, testing, or
development use cases where a single zookeeper server is not desirable.
- It provides 3 servers with a disruption budget of 1 planned disruption. This ensemble will not tolerate any concurrent unplanned failures during a planned disruption.
- Each server will consume 1 GiB of memory, 512 MiB of which will be dedicated to the ZooKeeper JVM heap.
- Each server will consume 0.5 CPUs.
- Each server will consume 1 Persistent Volume with 10 GiB of storage.
- You can, again, tune this manifest to your specific needs.
- The total footprint is 3 Nodes, 1.5 CPUs, 3 GiB memory, 30 GiB disk
- zookeeper_micro.yaml provides a manifest that is suitable for demos, testing, or development
use cases where a single zookeeper server will suffice.
- It provides 1 server with no disruption budget.
- The server will consume 1 GiB of memory, 512 Mib of which will be dedicated to the ZooKeeper JVM heap.
- The server will consume 0.5 CPUs.
- The server will consume 1 Persistent Volume with 10 GiB of storage.
- The total footprint is 1 Node, 0.5 CPUs, 1 GiB memory, 10 GiB disk
This section goes over the basics of creating and updating a ZooKeeper ensemble on a Kubernetes 1.7 cluster. You can clone this repo, or simply download one or more of the manifests. You will need a working cluster and a 1.7 version of kubectl installed. The size of the cluster depends on which manifest you choose to deploy. Below, we deploy zookeeper_mini.yaml.
To create an ensemble use kubectl apply
to create the components in the
zookeeper_mini.yaml manifest.
kubectl apply -f zookeeper_mini.yaml
service "zk-hs" configured
service "zk-cs" configured
poddisruptionbudget "zk-pdb" configured
statefulset "zk" created
You can watch the creation of the Pods containing the ZooKeeper servers as below.
kubectl get po -lapp=zk -w
NAME READY STATUS RESTARTS AGE
zk-0 1/1 Running 0 32s
zk-1 0/1 ContainerCreating 0 12s
zk-1 0/1 Running 0 13s
zk-1 1/1 Running 0 30s
zk-2 0/1 Pending 0 0s
zk-2 0/1 Pending 0 0s
zk-2 0/1 ContainerCreating 0 0s
zk-2 0/1 Running 0 11s
zk-2 1/1 Running 0 30s
After all Pods are Running and Ready you can exit the terminal.
This creates four objects to manage the ZooKeeper ensemble.
- A Headless Service,
zk-hs
, is created to control the network domain of the ZooKeeper ensemble. The leader election and server ports for each ZooKeeper server are configured for the Service. - A Service,
zk-cs
, is created to load balance client connections to ZooKeeper servers that are Running and Ready. You should consider configuring your client to connect directly to the Service instead of connecting to the Pods individually. However, both approaches will work. - A PodDisruptionBudget,
zk-pdb
, is created to manage planned disruptions. Planned disruptions can occur during drains, evictions, and managed node or image upgrades.zk-pdb
specifies that the ensemble will tolerate only 1 planned disruption. Note that, even if you deploy an ensemble of 5 or more ZooKeeper servers, you should only plan for one planned disruption. With 5 or more servers you can tolerate an unplanned disruption that occurs concurrently with a planned disruption. - A StatefulSet,
zk
, is created to launch the ZooKeeper servers, each in its own Pod, and each with unique and stable network identities and storage.
You can use kubectl exec
to run the zkCli.sh script in a bash shell on one of
the Pods.
$ kubectl exec -ti zk-0 -- bash
zookeeper@zk-0:/$ zkCli.
[zk: localhost:2181(CONNECTED) 1] create /hello world
Created /hello
From another Pod you can use the same technique to retrieve the value.
$ kubectl exec -ti zk-1 -- bash
zookeeper@zk-1:/$ zkCli.
zkCli.cmd zkCli.sh
[zk: localhost:2181(CONNECTED) 0] get /hello
world
cZxid = 0x800000003
ctime = Mon Jun 12 21:25:25 UTC 2017
mZxid = 0x800000003
mtime = Mon Jun 12 21:25:25 UTC 2017
pZxid = 0x800000003
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 5
numChildren = 0
You can't scale a ZooKeeper ensemble. However, You can update all of the configuration and resource requests. Edit the zookeeper_mini.yaml manifest to decrease the memory resource request and the jvm heap size.
name: kubernetes-zookeeper
imagePullPolicy: Always
image: "gcr.io/google_samples/kubernetes-zookeeper:1.0-3.4.10"
resources:
requests:
memory: "512Gi"
cpu: "0.5"
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
command:
- sh
- -c
- "start-zookeeper \
--servers=3 \
--data_dir=/var/lib/zookeeper/data \
--data_log_dir=/var/lib/zookeeper/data/log \
--conf_dir=/opt/zookeeper/conf \
--client_port=2181 \
--election_port=3888 \
--server_port=2888 \
--tick_time=2000 \
--init_limit=10 \
--sync_limit=5 \
--heap=256M \
--max_client_cnxns=60 \
--snap_retain_count=3 \
--purge_interval=12 \
--max_session_timeout=40000 \
--min_session_timeout=4000 \
--log_level=INFO"
Apply the manifest using kubectl apply
.
$ kubectl apply -f manifests/zookeeper.yaml
service "zk-hs" configured
service "zk-cs" configured
poddisruptionbudget "zk-pdb" configured
statefulset "zk" configured
Use kubectl rollout status
to watch the update progress.
kubectl rollout status sts/zk
waiting for statefulset rolling update to complete 0 pods at revision zk-1312265925...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 1 pods at revision zk-1312265925...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
waiting for statefulset rolling update to complete 2 pods at revision zk-1312265925...
Waiting for 1 pods to be ready...
Waiting for 1 pods to be ready...
statefulset rolling update complete 3 pods at revision zk-1312265925.
The StatefulSet controller will update each Pod in the StatefulSet, one at a time, and it will wait for the updated Pod to be Running and Ready before updating the next Pod.
Each manifest contains a Headless Service to control the network domain of the ensemble, a client Service to load balance client connections to available servers, and a StatefulSet to provide the required number of unique Pods. zookeeper.yaml and zookeeper_mini.yaml additionally contain a PodDisruptionBudget to manage planned disruptions.
The zk-hs
Headless Service will manage the domain of the ensemble. It has named ports, one for leader election and
one for inter-server communication. These ports must correspond to the container ports specified in the StatefulSet
and the parameters passed to the StatefulSet's Pods' command.
apiVersion: v1
kind: Service
metadata:
name: zk-hs
labels:
app: zk
spec:
ports:
- port: 2888
name: server
- port: 3888
name: leader-election
clusterIP: None
selector:
app: zk
The zk-cs
Service provides a load balanced Service allowing clients to connect to active ZooKeeper instances. You
should consider allowing clients to connect directly to this service instead of using the FQDNs of the individual Pods
in your connection string. The client port on the service must correspond to the container port specified in the
StatefulSet and the parameters passed to the StatefulSet's Pods' command.
apiVersion: v1
kind: Service
metadata:
name: zk-cs
labels:
app: zk
spec:
ports:
- port: 2181
name: client
selector:
app: zk
The zk
StatefulSet creates the requested number of replicas. You can modify the StatefulSet to change the resource
requests and the number of Pods hosting ZooKeeper servers. You have to ensure that the container ports are consistent
with the Services defined above and the parameters passed to the start-zookeeper script. You also have to ensure that
the requested number of replicas corresponds to the --servers
parameter.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: zk
spec:
serviceName: zk-hs
replicas: 3
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app: zk
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- zk
topologyKey: "kubernetes.io/hostname"
containers:
- name: kubernetes-zookeeper
imagePullPolicy: Always
image: "gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10"
resources:
requests:
memory: "4Gi"
cpu: "2"
ports:
- containerPort: 2181
name: client
- containerPort: 2888
name: server
- containerPort: 3888
name: leader-election
command:
- sh
- -c
- "start-zookeeper \
--servers=3 \
--data_dir=/var/lib/zookeeper/data \
--data_log_dir=/var/lib/zookeeper/data/log \
--conf_dir=/opt/zookeeper/conf \
--client_port=2181 \
--election_port=3888 \
--server_port=2888 \
--tick_time=2000 \
--init_limit=10 \
--sync_limit=5 \
--heap=3G \
--max_client_cnxns=60 \
--snap_retain_count=3 \
--purge_interval=12 \
--max_session_timeout=40000 \
--min_session_timeout=4000 \
--log_level=INFO"
readinessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 10
timeoutSeconds: 5
livenessProbe:
exec:
command:
- sh
- -c
- "zookeeper-ready 2181"
initialDelaySeconds: 10
timeoutSeconds: 5
volumeMounts:
- name: datadir
mountPath: /var/lib/zookeeper
securityContext:
runAsUser: 1000
fsGroup: 1000
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 250Gi
A PodDisruptionBudget is created, except in the zookeeper_micro.yaml manifest, to advertise the number of Pods that can be safely drained or evicted. When administrators apply managed node or base image updates, if they cordon and drain the node appropriately, this will ensure that the ensemble remains available throughout the procedure. For ZooKeeper, the correct MaxUnavailable setting is always 1. Even with the 5 server ensemble deployed by zookeeper.yaml, we want to ensure that we only allow one node to fail due to a planned disruption. This allows the ensemble to tolerate a concurrent unplanned disruption.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
selector:
matchLabels:
app: zk
maxUnavailable: 1