Dedicated etcd instances (in multiple AZs) #525

pdressel · 2016-06-03T13:19:09Z

This PR enables dedicated and fault tolerant deployments of etcd with kube-aws.

A new optional config property in cluster.yaml (etcdIPs) controls the number of dedicated etcd instances. They are matched to the subnets that can already be configured for high-availability workers to create them in separate AZs.

If etcdIPs is not specified, etcd runs on the controller instance as before. Dedicated etcd instances use CloudWatch instance recovery and EBS to survive machine failures.

The following section of cluster.yaml allows to run etcd in three separate AZs, allowing for complete loss of one AZ without outages:

subnets:
  - availabilityZone: us-west-1a
    instanceCIDR: "10.0.0.0/24"
  - availabilityZone: us-west-1b
    instanceCIDR: "10.0.1.0/24"
  - availabilityZone: us-west-1c
    instanceCIDR: "10.0.2.0/24"

etcdIPs:
  - 10.0.0.20
  - 10.0.1.20
  - 10.0.2.20

After experimenting a lot with etcd in autoscaling groups and the associated complexity, I find this to be a clean, simple solution to running fault tolerant etcd. From here putting the controllers in an ASG and behind an ELB should be very simple.

(I originally implemented this will full TLS support with client authentication, but since these assets have to be in each of the etcd* cloud-configs, the total stack size quickly exceeds the 52k limit on direct uploads to CloudFormation. When the template is first uploaded to S3 and then referenced there is enough space, any thoughts on this?)

If etcdIPs are specified in cluster.yaml, they will be matched to configured subnets, and etcd will run outside of the kubernetes controller. cmd: added cloud-config-etcd to stackTemplateOptions and runCmdRender config: added cloud-config-etcd config: added EtcdInstanceType, EtcdRootVolumeSize and EtcdIPs to Cluster template config: modified ETCDEndpoints depending on etcd configuration config: added subnet matching of dedicated etcd ips config: added DedicatedEtcd flag for use in templates/cloud-config config: modified cloud-config-controller to only start etcd when no dedicated nodes are configured config: modified cloud-config-controller to respect ETCDEndpoints in kube-apiserver config: added cloud-config-etcd to template generator

kalbasit · 2016-06-07T15:20:31Z

multi-node/aws/pkg/config/templates/cloud-config-controller

@@ -31,7 +34,9 @@ coreos:
            [Service]
            ExecStartPre=/usr/bin/curl --silent -X PUT -d \
            "value={\"Network\" : \"{{.PodCIDR}}\", \"Backend\" : {\"Type\" : \"vxlan\"}}" \
-            http://localhost:2379/v2/keys/coreos.com/network/config?prevExist=false
+            {{if not .DedicatedEtcd}}http://localhost:2379/v2/keys/coreos.com/network/config?prevExist=false
+            {{else}}http://{{index .EtcdIPs 0}}:2379/v2/keys/coreos.com/network/config?prevExist=false


This PR should allow the controller to die and be recreated right? What if on re-creation the first etcIP was not reachable?

Good point. We should probably use etcdctl instead of curl and have it use all configured etcd endpoints.

… to set flannel network config, etcdctl with all configured peers is used. This should allow booting of controllers to succeed even if the first etcd node is down. config: changed cloud-config-controller flanneld drop-in to use etcdctl and all configured peers to set network config

pdressel · 2016-06-08T11:40:51Z

multi-node/aws/pkg/config/templates/cloud-config-controller

+            Environment="ETCDCTL_PEERS={{.ETCDEndpoints}}"
+            ExecStartPre=/usr/bin/etcdctl set coreos.com/network/config \
+            "{\"Network\" : \"{{.PodCIDR}}\", \"Backend\" : {\"Type\" : \"vxlan\"}}"
+            {{end}}


@kalbasit This should allow the network config to succeed as long as the cluster is healthy

mumoshu · 2016-06-10T04:20:13Z

@pdressel Wow, you did a very good job. Thanks for sharing this!

I'm afraid if I'm throwing cold water on your great work but let me share my thought.

I'd like to split subnets into two. One for workers+masters, and the another for etcd.

Then it would look something like:

subnets:
  - availabilityZone: ap-northeast-1a
    instanceCIDR: "10.0.0.0/24"
  - availabilityZone: ap-northeast-1b
    instanceCIDR: "10.0.1.0/24"

etcdSubnets:
  - availabilityZone: ap-northeast-1a
    instanceCIDR: "10.0.2.0/24"

etcdIPs:
  - 10.0.2.20
  - 10.0.2.21
  - 10.0.2.22

and I'll have 2 k8s clusters, one with etcd in 1a and the another for etcd in 1b.

Reasons behind this are:

To achieve resilience in the face of single AZ failure when there is not 3 but only 2 AZs in your region (e.g. Seoul, Beijin, Seigapore, and Tokyo) when...
You want to deploy Multi-AZ aware app on each K8s cluster (if it isn't a requirement for you, you can choose to deploy single AZ k8s cluster on each AZ for hosting single-AZ-aware apps, instead of 1 Multi-AZ k8s cluster for hosting multi-AZ-aware apps. No changes are needed on user side once this PR is merged then)
You can have 2 k8s clusters to limit failure domain. E.g. when 2 AZs 1a and 1b are available in your region, placing etcd in 1a for the first k8s, and in 1b for the second k8s while spreading k8s masters and nodes over 2 AZs. In the face of single AZ failure, one of two k8s clusters will be fully functioning and you can earn time to migrate(maybe scale up/out the nodes in the functioning cluster?).

Maybe am I overthinking?

A 2-AZ k8s cluster containing 2 etcd and a tie-breaker etcd node in another region,
or having single-AZ k8s cluster for each zone,

will suffice?

Btw, regardless of this PR or kube-aws itself supports H/A for regions with less than 3 AZs, this PR is GREAT!!
I'm really looking forward for it it be merged.

mumoshu · 2016-06-10T04:28:33Z

@colhom @aaronlevy @philips Sorry for bothering you but I believe you're working on dedicated etcd clusters to land kube-aws too!
Would you mind looking into this?

pdressel · 2016-06-10T12:21:33Z

@mumoshu I agree, we should also have a solution for regions with < 3 AZ. I checked the AWS docs, the following regions have only 2 AZs:

AWS GovCloud
Frankfurt
Singapore
Bejing
Seuol

There are also 5 new regions being built right now, for which the number of AZ is not published yet.

Since it is fundamentally not possible to safely keep a quorum-based service like etcd online with just two failure domains, having at least a soft failure with two clusters seems like a good option. Using a separate region for an etcd arbiter node would work, but AFAIK CloudFormation stacks are limited to a single region, so automatic deployment of this would be complicated. Using the current implementation, couldn't we achieve the two-cluster solution you described with the following configuration?

Cluster A

subnets:
  - availabilityZone: ap-northeast-1a
    instanceCIDR: "10.0.0.0/24"
  - availabilityZone: ap-northeast-1b
    instanceCIDR: "10.0.1.0/24"

etcdIPs:
  - 10.0.0.20
  - 10.0.0.21
  - 10.0.1.20

Cluster B

subnets:
  - availabilityZone: ap-northeast-1a
    instanceCIDR: "10.0.0.0/24"
  - availabilityZone: ap-northeast-1b
    instanceCIDR: "10.0.1.0/24"

etcdIPs:
  - 10.0.1.20
  - 10.0.1.21
  - 10.0.0.20

Let me know what you think and thank you for the praise!

mumoshu · 2016-06-13T02:39:57Z

@pdressel Great. Having 2 k8s clusters and differenciating the subnet in which more than majority of etcd instances placed, for each cluster, as you've shown above would work!

FYI to someone reading these comments in the future. More than majority in this case means 2 of 3 etcdIPs.

In the cluster A, we see 2 10.0.0.x IPs in 1a. In the cluster B, we see 1 10.0.0.x in 1a and 2 10.0.1.x IPs in 1b. A single AZ failure will take down one of those two clusters but the whole system will keep functioning.

It may suffer with degradated performance until the alive cluster scales depend on your workload but I believe it is not what we can fix completely.

AFAIK while the "dead" cluster having e.g. 1 of 3 etcds alive, etcd won't accept writes and therefore k8s is unable to (re)schedule pods or (re)configure services or auto-scale pods or etc. Already running pods will keep running but I don't know a single AZ failure affects network between the internet and the pods. So how much/fast the alive cluster and the stateless pods must scale, and how much/fast the stateful pods(with EBS volumes. Because EBS volumes can't move around the AZs) must be migrated to the alive cluster, will depend. Though, for me, it is much better than your k8s has a chance to fully stop working until the AZ failure is fixed. And this PR seems to allow us to prepare for single AZ failures regardless of num of AZs in your regions.
Great.

Now this PR is LGTM also in that regard!
Thanks @pdressel

P.S. to folks using the ap-northeast-1 or the us-west-1 region, AWS actually have 3 AZs in it(1a, 1b, 1c) but usually they limit you to 2 AZs. I had asked my SAs if they could expose the third AZ but had no luck. So you may be interested in having 2 k8s clusters like we discussed above, too.

colhom · 2016-06-13T16:38:30Z

@pdressel I've been travelling last week and working on the exact same thing :(. I just put a WIP PR up- #544.

With respect to the etcd/AZ issue, I simply round-robin etcdCount across all the specified availability zones. Given that there are a fixed number of AZs per region, we can't do any better than this as far as availability goes (until we figure out dynamic etcd cluster management). ref https://github.com/coreos/coreos-kubernetes/pull/544/files#diff-e5ad8b828a5de3fe2229bee5c64ab2a0R187

\cc @mumoshu

pdressel · 2016-06-14T07:53:57Z

@colhom no worries. I think your implementation using the same cloud-config-etcd for all instances is great. It nicely gets around the limitation I ran into when implementing TLS regarding template size. I am happy to close this PR and contribute to yours if you like, maybe for future work we can sync beforehand :) I would really like to help advance this project and can dedicate some resources to it.

I am not sure about the round robin vs predefined fixed IP approach for the instances - using predefined IPs for the instances might make instance replacement for cluster upgrades easier later down the road: a replacement etcd instance could re-use the IP (and thus etcd name and data dir) without runtime reconfiguration. What do you think?

colhom · 2016-06-16T19:09:14Z

@pdressel what I really want to do is setup a private hosted zone and refer to the nodes by DNS name. etcd1.xxx, etcd2.xxx, etcd3.xxx, kube-controller1.xxx,kube-worker1.xxx,kube-worker2.xxx It definitely complicates the cloudformation, adding cost and complexity.

This way though, we can just throw all the interfaces (etcd, controllers and workers!) on dhcp and not worry about it. Would greatly simplify TLS signing and we offload address allocation entirely. The problem is I haven't figured out how to assign these addresses to members of an autoscaling group without a lambda hook on autoscaling add/remove events (I want to avoid adding more bells & whistles if possible).

\cc @cgag @brianredbeard

aaronlevy · 2016-08-19T18:17:40Z

I'm going to close this now if the consensus is to continue work on this in #544 and #608

Please let me know if this should be re-opened

pdressel mentioned this pull request Jun 7, 2016

Production Quality Deployment #340

Closed

18 tasks

kalbasit reviewed Jun 7, 2016
View reviewed changes

pdressel reviewed Jun 8, 2016
View reviewed changes

aaronlevy closed this Aug 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedicated etcd instances (in multiple AZs) #525

Dedicated etcd instances (in multiple AZs) #525

pdressel commented Jun 3, 2016

kalbasit Jun 7, 2016

pdressel Jun 7, 2016

pdressel Jun 8, 2016

kalbasit Jun 8, 2016

mumoshu commented Jun 10, 2016 •

edited

Loading

mumoshu commented Jun 10, 2016

pdressel commented Jun 10, 2016

mumoshu commented Jun 13, 2016 •

edited

Loading

colhom commented Jun 13, 2016

pdressel commented Jun 14, 2016

colhom commented Jun 16, 2016 •

edited

Loading

aaronlevy commented Aug 19, 2016

Dedicated etcd instances (in multiple AZs) #525

Dedicated etcd instances (in multiple AZs) #525

Conversation

pdressel commented Jun 3, 2016

kalbasit Jun 7, 2016

Choose a reason for hiding this comment

pdressel Jun 7, 2016

Choose a reason for hiding this comment

pdressel Jun 8, 2016

Choose a reason for hiding this comment

kalbasit Jun 8, 2016

Choose a reason for hiding this comment

mumoshu commented Jun 10, 2016 • edited Loading

mumoshu commented Jun 10, 2016

pdressel commented Jun 10, 2016

mumoshu commented Jun 13, 2016 • edited Loading

colhom commented Jun 13, 2016

pdressel commented Jun 14, 2016

colhom commented Jun 16, 2016 • edited Loading

aaronlevy commented Aug 19, 2016

mumoshu commented Jun 10, 2016 •

edited

Loading

mumoshu commented Jun 13, 2016 •

edited

Loading

colhom commented Jun 16, 2016 •

edited

Loading