-
Notifications
You must be signed in to change notification settings - Fork 295
Existing VPC with custom DHCP Option Set - etcd cluster won't start #189
Comments
Thanks @rbellamy for taking the time to open and document this issue. It has caused us a lot of time and frustration and has blocked our Kubernetes progress into our EC2 VPC ENV for some time. When we continued to run into this as a blocking issue, we reached out on the various kubernetes-related IRC channels, including: kubernetes-users, sig-aws, etc.
Unfortunately, none of these are viable options for us and they also don't resolve the issue. Regarding the point that @rbellamy made above (deploying kubernetes into an existing VPC with custom DHCP option set), I too am confused as to why we were not able to find more (any actually) people having similar/related issues? @mumoshu you have been very helpful with our kube-aws efforts and issues we faced thus far, and I'm hoping that you might be able to comment on this issue and provide recommendations.
Thank you. |
@rbellamy we just worked around this a few days ago: it's kind of a hack. Clarifying note: Part of the problem for us was that we configured the DHCP options to use our own internal DNS server hosted in our on-prem datacenter (which doens't know anything about AWS's private DNS). Our solution:
Now, since the kube-aws machines are using our on-prem DNS server, they can (through two intermediaries now) query for AWS private DNS records. |
@rbellamy , can you clarify a bit why different hostname is a problem? As long as it resolvable by hosts it should be fine, provided that you issue etcd SSL certificates for both |
Hi @rbellamy and @cmcconnell1, thanks for your efforts on this! Could you confirm that this exact use-case would be supported without the workaround once:
? If so, my assumptions are:
Are my assumptions correct? |
@rbellamy @cmcconnell1 I and @redbaron discussed about situations related to your cases in #226 (comment). Would you mind taking a look into it? To be short, I believe "providing resolvable hostnames to etcd nodes according to your req and use these hostnames for etcd peer discovery and peer communication" is the core of the fix. Are you sure that you have correctly registered an etcd node's hostname as a recordset in a private hosted zone? The private hosted zone would be named Almost certainly I believe you need to do something like what is described at https://cantina.co/automated-dns-for-aws-instances-using-route-53/ provide etcd node a resolvable hostname with manually updating |
Hello @mumoshu and @redbaron Regarding the above link: These scripts fetch the EC2 instances tagged Name upon In our scripts, all of our deployed instances register and manage their own DNS records in the requisite private/public zones, based upon their (sub)domains in our route53 DNS. So it sounds like we're on the same page and are clear on that process. However, we have tried to keep our kubernetes (and in this case our kube-aws) nodes and processes as clean and untainted as possible, outside of the designated provisioning/deployment framework/processes (kube-aws). Therefore, we have not been modifying our kube-aws nodes outside of the documented kube-aws process. If I understand you correctly, it sounds like you are suggesting that we either:
In either case, is it correct to assume that we will need to modify our CoreOS kubernetes instances with cloud-init (until ignition phases cloud-init out)? Thank you |
@mumoshu - sorry for taking so long to get back to you.
Yes, your assumptions are correct. Regarding your further comments, we have a custom script which can be used to dynamically register the hostname in the Route53 hosted zone. HOWEVER, the challenge here is that when dealing with the workers and controllers is a logistic challenge to deal with dynamic naming. In other words, there's significant efforts necessary for private DNS to overcome dynamic hostname allocation, even with a working hostname registration process. |
@mumoshu regarding #226 (comment), yes I agree that's the core fix for dealing with this issue. @cmcconnell1 I know you have recently forked and built this repo and I'm wondering if you've had a chance to test this? |
@cmcconnell1 Hi, thanks for the reply! Did you have a chance to take a look into #226 (comment)? Basically, the idea is to create an universal etcd endpoint used for (1) etcd peer discovery and (2) etcd cluster discovery from etcd client (like worker and controller nodes) so that we don't need to know all the resolvable hostnames for etcd nodes at the time of running
I'm afraid if I'm following you correctly but I guess no? I'm going to improve kube-aws to add the support for custom domain name and altering our etcd peer discovery and cert generation to support it i.e. you can keep using hostnames like However, if you are going to change your nodes' externally resolvable hostnames to the ones other than something like the default To be short, after adopting the possible improvement in my mind, etcd will just work without customizing cloud-init like now but custom naming itself should be done on your side(for now!). Would you mind confirming if we are on the same page? |
Hello @mumoshu and @redbaron |
Regarding @rbellamy 's question above about testing the recent fix noted in #226 We seem to have internet connectivity issues with new kube-aws pre-release version 0.9.3 (built our own binaries from forked master branch) and inside our VPC. I merged our (configurations from our previous rc.5 and prior) cluster.yaml config file with the new pre-release 0.9.3 generated file and note, that I did get burned by missing a new config option that appeared when building from the master branch: But getting past that (totally my bad missing that there) the kubernetes cluster does get provisioned, but after giving it ample time, I can't connect via kubectl. A quick check on the etcd nodes DNS queries work but we see internet connectivity issues
And on controller we see the same issues with DNS working, but ping/internet access isn't working correctly.
So we know that given the internet connectivity issues we are seeing, we'd expect that pretty much everything (kube, etc.) will fail, but JIC will include:
So we're rolling back to kube-aws rc.5 (which unfortunately, has a blocker affecting Deis Workflow, and it seems that rc.3 also has the same problem which I'll note below).
However, we can see that our subnets appear to be tagged correctly AFAIK--please let me know if I'm missing something here (where myapp-kube is the cluster name):
Also noting an issue which could very well be related to our infrastructure/routing, etc., but it seems odd/curious that we just started seeing this with this 0.9.3 pre-release (again this is with us keeping all previous CIDR/subnet configs in the current cluster.yaml file that we've been using for many versions, and with a kube-aws binary that we built from the forked master branch). Hopefully this information is useful. |
@cmcconnell1 Thanks as always! Just a quick thought but
and
seems to indicate that
Let me also add that the only difference between v0.9.3 and v0.9.3-rc.5 is the self-hosted calico. Are you using calico? |
Excuse me if I'm repeating what you've already considered but the error seems to indicate that, if you're deploying to a VPC created by kube-aws, probably you've used If you've tried to deploy to an existing VPC, I guess either the route table or the internet gateway is specified its id to use an existing one in the another VPC. If there's any such chance, would you mind checking if you've chosen correct internetGatewayId and routeTableId? |
Hello @mumoshu The above noted comments were based on consistent observations through two (2) complete deploy/destroy kube-aws cycles into an existing VPC, using internal (non-public) network deployments. As far as using RE: your comment above
Yes we are definitely having a new issue with access. On that note, we are using the all of the same values (in our cluster.yaml) that we've been using successfully--up until this pre-release version 0.9.3, including:
As noted, we have been deploying internally into a private network (and thus using the specified route table which uses NAT). Given this scenario, if is this is is our desired topology, From the cluster.yaml auto-generated template:
Note that if we are in fact using a private subnet (with NAT), I would expect that our existing (previous versions) config options would be sufficient to configure requisite kubernets network access as long as we continued to abide by the comments/warnings, etc.:
Perhaps it's my misunderstanding and the correct question should be: On this note and to make sure we're on the same page, I just wanted to quickly summarize our AWS VPC ENV. We use an existing Internet Gateway in our VPC.
With that said, noting that there is also an AWS NAT Gateway service now available (for about a year now AIR?); and this NAT Gateway Service can now take the place of an actual EC2 NAT Instance (which used to be the only option). Also from the new pre-release kube-aws version's generated cluster.yaml file, noting that we now see config options for specifying an existing
So to summarize, perhaps we just need to know how to correctly configure our cluster.yaml files moving forward if we:
As always, Thanks again for your feedback, etc! |
Hi @cmcconnell1 @rbellamy, I've submitted a WIP PR #332 to achieve H/A of etcd clusters with supports for various etcd peer discovery strategies. According to the If you're interested, would you mind reading through the description of #332 and test it? |
This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes. After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG. Supported use-cases: * Automatic recovery from temporary Etcd node failures * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted * Rolling-update of the instance type for etcd nodes without downtime * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates * Other use-cases implied by the fact that the nodes are managed by ASGs * You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml * `"eip"`, which is the default setting, is recommended * If you want, choose `"eni"`. * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`) Unsupported use-cases: * Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure. * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE` * Scaling-in of Etcd nodes * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed Relevant issues to be (partly) resolved via this PR: * Part(s) of kubernetes-retired#27 * Wait signal for etcd nodes. See kubernetes-retired#49 * Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively. This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc. Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those. * If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them. * To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs * ENIs and EBS can't be moved to an another AZ * EBS volume can, however, be transferred utilizing a snapshot * Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all! * Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG * EBS is required in order to achieve "locking" of a pair associated to an etcd instance * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.
#332 is merged and has been included since v0.9.5-rc.1 |
Hello, |
I'm really glad to know that it worked for you! |
This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes. After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG. Supported use-cases: * Automatic recovery from temporary Etcd node failures * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted * Rolling-update of the instance type for etcd nodes without downtime * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates * Other use-cases implied by the fact that the nodes are managed by ASGs * You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml * `"eip"`, which is the default setting, is recommended * If you want, choose `"eni"`. * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`) Unsupported use-cases: * Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure. * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE` * Scaling-in of Etcd nodes * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed Relevant issues to be (partly) resolved via this PR: * Part(s) of kubernetes-retired#27 * Wait signal for etcd nodes. See kubernetes-retired#49 * Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively. This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc. Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those. * If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them. * To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs * ENIs and EBS can't be moved to an another AZ * EBS volume can, however, be transferred utilizing a snapshot * Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all! * Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG * EBS is required in order to achieve "locking" of a pair associated to an etcd instance * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.
Overview
Existing VPC in us-west-1 with custom DHCP Option Set - etcd cluster won't start. See coreos/bugs#1272.
DHCP Option Set:
This causes the systemd specifier
%H
(used by theetcd2.service
unit) to returnip-10-1-Y-Z.terradatum.com
rather than what kube-aws expects.TLDR;
Skip to the bottom to see the currently viable workaround.
Ideally, kube-aws would support this out-of-the-box. I find it hard to believe that we're the only folk that are trying to get kubernetes running in an existing VPC with a custom DHCP Option Set.
Finally, I'm very concerned about the move away from coreos-cloudinit, given that we've only now realized a working environment - what's the likelihood of Ignition and/or coreos-metadata addressing what is obviously considered an edge condition by the CoreOS + kube-aws teams?
Detail
List of etcd cluster instances
Generated by
config.go
lines 542, 549 and finally line 558You can see that these values are hard coded for either
ec2.internal
(us-east-1) or<region>.compute.internal
(everywhere else). So no, it's not a viable option for us to expect to use theterradatum.com
suffix for ourkube-aws
-controlled cluster.And of course, the machinery that sets the hostname is tightly coupled with the EC2 launch configuration and is supplied via calls to the EC2 instance metadata at http://169.254.169.254/latest/meta-data/hostname.
Before altering the hostname
So, the only viable solution is to ensure the instances receive the correct short and fully qualified names.
What we want
Setting the
hostname
in the cloud-init won't work without altering the kube-aws code:However, if you alter the
hostname
(using any of the various methods for effecting that change), using a systemd service unit, you're making a change after theetcd2.service
service unit configuration has already been loaded.You would think that these two methods would produce the same value
They don't - because the
etcd2.service
unit is loaded by the time a hostname is altered by another unit, it always uses the value available before the change.Workaround
/proc/kernel/sys/hostname
,/proc/kernel/sys/domainname
,/etc/hostname
and/etc/hosts
accordingly.sethostname.service
unit to execute that script, and ensure that unit is fired any time theetcd2.service
is restarted.etcd2.service
unit to use the createdEnvironmentFile
rather than relying on the%H
specifier.The text was updated successfully, but these errors were encountered: