Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Existing VPC with custom DHCP Option Set - etcd cluster won't start #189

Closed
rbellamy opened this issue Jan 1, 2017 · 18 comments
Closed

Comments

@rbellamy
Copy link

rbellamy commented Jan 1, 2017

Overview

Existing VPC in us-west-1 with custom DHCP Option Set - etcd cluster won't start. See coreos/bugs#1272.

DHCP Option Set:

domain-name = terradatum.com
domain-name-servers = 10.1.0.2

This causes the systemd specifier %H (used by the etcd2.service unit) to return ip-10-1-Y-Z.terradatum.com rather than what kube-aws expects.

TLDR;

Skip to the bottom to see the currently viable workaround.

Ideally, kube-aws would support this out-of-the-box. I find it hard to believe that we're the only folk that are trying to get kubernetes running in an existing VPC with a custom DHCP Option Set.

Finally, I'm very concerned about the move away from coreos-cloudinit, given that we've only now realized a working environment - what's the likelihood of Ignition and/or coreos-metadata addressing what is obviously considered an edge condition by the CoreOS + kube-aws teams?

Detail

List of etcd cluster instances

Environment=ETCD_INITIAL_CLUSTER=`.EtcdInitialCluster`

Generated by config.go lines 542, 549 and finally line 558

You can see that these values are hard coded for either ec2.internal (us-east-1) or <region>.compute.internal (everywhere else). So no, it's not a viable option for us to expect to use the terradatum.com suffix for our kube-aws-controlled cluster.

And of course, the machinery that sets the hostname is tightly coupled with the EC2 launch configuration and is supplied via calls to the EC2 instance metadata at http://169.254.169.254/latest/meta-data/hostname.

Before altering the hostname

$ hostname
ip-10-1-10-4.terradatum.com
$ hostname -s
ip-10-1-10-4
$ hostname -f
ip-10-1-10-4.terradatum.com
$ hostname -d
terradatum.com

So, the only viable solution is to ensure the instances receive the correct short and fully qualified names.

What we want

$ hostname
ip-10-1-10-4
$ hostname -s
ip-10-1-10-4
$ hostname -f
ip-10-1-10-4.us-west-1.compute.internal
$ hostname -d
us-west-1.compute.internal

Setting the hostname in the cloud-init won't work without altering the kube-aws code:

#cloud-init
hostname= # what to put here???

However, if you alter the hostname (using any of the various methods for effecting that change), using a systemd service unit, you're making a change after the etcd2.service service unit configuration has already been loaded.

You would think that these two methods would produce the same value

$(hostname) # child shell
%H # systemd

They don't - because the etcd2.service unit is loaded by the time a hostname is altered by another unit, it always uses the value available before the change.

Workaround

  1. Create a script to set the hostname, and prepare the etcd environment for startup - make it resilient enough to alter short and fully qualified names, and to adjust /proc/kernel/sys/hostname, /proc/kernel/sys/domainname, /etc/hostname and /etc/hosts accordingly.
write_files:

  - path: /opt/bin/set-hostname
    permissions: 0700
    owner: root:root
    content: |
      #!/bin/bash -e
      TEMP_IPV4=$(ip route get 8.8.8.8|awk 'NR=1{ print $7; exit }')
      TEMP_HOSTNAME="ip-${TEMP_IPV4//./-}"
      /usr/bin/hostnamectl --static set-hostname $TEMP_HOSTNAME
      echo "$TEMP_IPV4 $TEMP_HOSTNAME.us-west-1.compute.internal $TEMP_HOSTNAME" >> /etc/hosts
      cat >/etc/etcd-environment << EOL
      ETCD_NAME=$TEMP_HOSTNAME.us-west-1.compute.internal
      ETCD_LISTEN_CLIENT_URLS=https://$TEMP_HOSTNAME.us-west-1.compute.internal:2379
      ETCD_ADVERTISE_CLIENT_URLS=https://$TEMP_HOSTNAME.us-west-1.compute.internal:2379
      ETCD_LISTEN_PEER_URLS=https://$TEMP_HOSTNAME.us-west-1.compute.internal:2380
      ETCD_INITIAL_ADVERTISE_PEER_URLS=https://$TEMP_HOSTNAME.us-west-1.compute.internal:2380
      EOL
  1. Create a custom systemd sethostname.service unit to execute that script, and ensure that unit is fired any time the etcd2.service is restarted.
units:
    - name: sethostname.service
      command: start
      content: |
        [Unit]
        Description=Set Hostname Workaround https://github.com/coreos/bugs/issues/1272
        Before=etcd2.service

        [Service]
        Type=oneshot
        ExecStart=/opt/bin/set-hostname

        [Install]
        WantedBy=local.target
        RequiredBy=etcd2.service
  1. Modify the etcd2.service unit to use the created EnvironmentFile rather than relying on the %H specifier.
    - name: etcd2.service
      drop-ins:
        - name: 20-etcd2-aws-cluster.conf
          content: |
            [Unit]
            Requires=decrypt-tls-assets.service sethostname.service
            After=decrypt-tls-assets.service sethostname.service

            [Service]
            EnvironmentFile=/etc/etcd-environment

            Environment=ETCD_PEER_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
            Environment=ETCD_PEER_CERT_FILE=/etc/etcd2/ssl/etcd.pem
            Environment=ETCD_PEER_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem

            Environment=ETCD_CLIENT_CERT_AUTH=true
            Environment=ETCD_TRUSTED_CA_FILE=/etc/etcd2/ssl/ca.pem
            Environment=ETCD_CERT_FILE=/etc/etcd2/ssl/etcd.pem
            Environment=ETCD_KEY_FILE=/etc/etcd2/ssl/etcd-key.pem

            Environment=ETCD_INITIAL_CLUSTER_STATE=new
            Environment=ETCD_INITIAL_CLUSTER={{.EtcdInitialCluster}}
            Environment=ETCD_DATA_DIR=/var/lib/etcd2
            PermissionsStartOnly=true
            ExecStartPre=/usr/bin/bash -c "sed -i \"s/^ETCDCTL_ENDPOINT.*$/ETCDCTL_ENDPOINT=https:\/\/$(hostname).us-west-1.compute.internal:2379/\" /etc/environment"
            ExecStartPre=/usr/bin/chown -R etcd:etcd /var/lib/etcd2
      enable: true
      command: start
@cmcconnell1
Copy link
Contributor

Thanks @rbellamy for taking the time to open and document this issue. It has caused us a lot of time and frustration and has blocked our Kubernetes progress into our EC2 VPC ENV for some time.

When we continued to run into this as a blocking issue, we reached out on the various kubernetes-related IRC channels, including: kubernetes-users, sig-aws, etc.
The most common responses (proposed solutions / work arounds) we received from folks on those channels were essentially:

  • "Don't use kube-aws use Terraform and Kops"
  • "If you use GCE (and not AWS) it should just work."
  • "Manually modify the CloudFormation binary data using base64 encode/decode for each distinct userdata section/excerpt..."
  • "Kubernetes (in the cloud) is very hard, just hire our consulting company and we'll do it for you."

Unfortunately, none of these are viable options for us and they also don't resolve the issue.

Regarding the point that @rbellamy made above (deploying kubernetes into an existing VPC with custom DHCP option set), I too am confused as to why we were not able to find more (any actually) people having similar/related issues?

@mumoshu you have been very helpful with our kube-aws efforts and issues we faced thus far, and I'm hoping that you might be able to comment on this issue and provide recommendations.

  • Are you aware of others that might have ran into this issue with their AWS VPC, etc?
  • Will the proposed work around (as noted above) create any (future/current) unknown issues/complications for us, etc?

Thank you.

@gmcquillan
Copy link

@rbellamy we just worked around this a few days ago: it's kind of a hack.

Clarifying note: Part of the problem for us was that we configured the DHCP options to use our own internal DNS server hosted in our on-prem datacenter (which doens't know anything about AWS's private DNS).

Our solution:

  • We ended up setting up a machine in our VPC and configured dnsmasq to forward all DNS queries to the AWS private DNS address in our VPC (hard coded to the 10.x.x.2 for your VPC network).
  • Then we added a zone in our on-prem datacenter which forwards all queries to *.ec2.internal to this dnsmasq forwarder's IP in the VPC (Note: we're also using Direct connect, so these two have routable private IPs to each other).

Now, since the kube-aws machines are using our on-prem DNS server, they can (through two intermediaries now) query for AWS private DNS records.

@redbaron
Copy link
Contributor

redbaron commented Jan 9, 2017

@rbellamy , can you clarify a bit why different hostname is a problem? As long as it resolvable by hosts it should be fine, provided that you issue etcd SSL certificates for both *.terradatum.com, *.compute.internal and *.ec2.internal.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2017

Hi @rbellamy and @cmcconnell1, thanks for your efforts on this!
Sorry for leaving this in silence for a long time.
Several things:

Could you confirm that this exact use-case would be supported without the workaround once:

  1. We provide the way to customize the domain name kube-aws assumes for EC2 instances in an existing VPC via e.g. internalDomainName: terradatum.com and
  2. We ensure "correct" hostnames are used at least for etcd peer discovery and
  3. You ensure that the DHCP options include the domain name hosted by the private route53 hosted zone attached to the VPC you're deploying your kube-aws cluster to

?

If so, my assumptions are:

  • You've already done 3. correctly (An FQDN like <host>.terradatum.com is already resolvable from each other inside a etcd cluster)
  • What should be done in kube-aws are only 1. and 2.

Are my assumptions correct?

@mumoshu
Copy link
Contributor

mumoshu commented Jan 16, 2017

@rbellamy @cmcconnell1 I and @redbaron discussed about situations related to your cases in #226 (comment). Would you mind taking a look into it?

To be short, I believe "providing resolvable hostnames to etcd nodes according to your req and use these hostnames for etcd peer discovery and peer communication" is the core of the fix.

Are you sure that you have correctly registered an etcd node's hostname as a recordset in a private hosted zone? The private hosted zone would be named terradatum.com which is associated to the existing vpc you're trying to create your kubernetes cluster, in @rbellamy's case for example.

Almost certainly I believe you need to do something like what is described at https://cantina.co/automated-dns-for-aws-instances-using-route-53/ provide etcd node a resolvable hostname with manually updating userdata/cloud-config-*.

@cmcconnell1
Copy link
Contributor

cmcconnell1 commented Jan 23, 2017

Hello @mumoshu and @redbaron
Thanks for following-up on this!

Regarding the above link:
We do something very similar with all of our AWS nodes during their bootstrap process, wherein these instances receive customized route53 systemd processes/scripts.

These scripts fetch the EC2 instances tagged Name upon systemctl start route53 of the systemd service.

In our scripts, all of our deployed instances register and manage their own DNS records in the requisite private/public zones, based upon their (sub)domains in our route53 DNS. So it sounds like we're on the same page and are clear on that process.

However, we have tried to keep our kubernetes (and in this case our kube-aws) nodes and processes as clean and untainted as possible, outside of the designated provisioning/deployment framework/processes (kube-aws). Therefore, we have not been modifying our kube-aws nodes outside of the documented kube-aws process.

If I understand you correctly, it sounds like you are suggesting that we either:

  • Continue to apply our existing workaround as noted above in the post by @rbellamy (via cloud-init).
    or
  • Test/refine bootstrapping our CoreOS kube-aws kubernetes instances with something similar to our above noted standard AWS route53 bootstrap process for any/all kubernetes nodes (also likely via cloud-init).

In either case, is it correct to assume that we will need to modify our CoreOS kubernetes instances with cloud-init (until ignition phases cloud-init out)?

Thank you

@rbellamy
Copy link
Author

@mumoshu - sorry for taking so long to get back to you.

  1. We provide the way to customize the domain name kube-aws assumes for EC2 instances in an existing VPC via e.g. internalDomainName: terradatum.com and
  2. We ensure "correct" hostnames are used at least for etcd peer discovery and
  3. You ensure that the DHCP options include the domain name hosted by the private route53 hosted zone attached to the VPC you're deploying your kube-aws cluster to

If so, my assumptions are:

  • You've already done 3. correctly (An FQDN like .terradatum.com is already resolvable from each other inside a etcd cluster)
  • What should be done in kube-aws are only 1. and 2.
    Are my assumptions correct?

Yes, your assumptions are correct.

Regarding your further comments, we have a custom script which can be used to dynamically register the hostname in the Route53 hosted zone.

HOWEVER, the challenge here is that when dealing with the workers and controllers is a logistic challenge to deal with dynamic naming. In other words, there's significant efforts necessary for private DNS to overcome dynamic hostname allocation, even with a working hostname registration process.

@rbellamy
Copy link
Author

@mumoshu regarding #226 (comment), yes I agree that's the core fix for dealing with this issue. @cmcconnell1 I know you have recently forked and built this repo and I'm wondering if you've had a chance to test this?

@mumoshu
Copy link
Contributor

mumoshu commented Jan 27, 2017

@cmcconnell1 Hi, thanks for the reply!

Did you have a chance to take a look into #226 (comment)?

Basically, the idea is to create an universal etcd endpoint used for (1) etcd peer discovery and (2) etcd cluster discovery from etcd client (like worker and controller nodes) so that we don't need to know all the resolvable hostnames for etcd nodes at the time of running kube-aws up.
IFAIK, it does nothing for etcd peer "communication".

Continue to apply our existing workaround as noted above in the post by @rbellamy (via cloud-init).
Test/refine bootstrapping our CoreOS kube-aws kubernetes instances with something similar to our above noted standard AWS route53 bootstrap process for any/all kubernetes nodes (also likely via cloud-init).

I'm afraid if I'm following you correctly but I guess no?

I'm going to improve kube-aws to add the support for custom domain name and altering our etcd peer discovery and cert generation to support it i.e. you can keep using hostnames like ip-10-1-10-4.terradatum.com or whatever under terradatum.com for etcd peed discovery and communication possibly without modifying kube-aws or cloud-init, as long as it is externally resolvable.

However, if you are going to change your nodes' externally resolvable hostnames to the ones other than something like the default ip-10-1-10-4.terradatum.com, utilizing EC2 tags and Route 53 and systemd units, I'm afraid you have to do that your own in cloud-config-* (for now, until someone sends pull requests to improve the situation!) as custom host naming itself does nothing with kubernetes?(Am I correct?)

To be short, after adopting the possible improvement in my mind, etcd will just work without customizing cloud-init like now but custom naming itself should be done on your side(for now!).

Would you mind confirming if we are on the same page?

@cmcconnell1
Copy link
Contributor

Hello @mumoshu and @redbaron
Yes I/we think that the referenced comment/proposal #226 (comment) would be a (much appreciated) fix/resolution.
Sorry for the delay on that feedback, 'just realized that's the second time you asked me to review and comment on that issue.
Thanks again, we appreciate the project's rapid pulse, your quick responses, and the developers being so receptive and proactive regarding our feedback, issues, etc.

@cmcconnell1
Copy link
Contributor

Regarding @rbellamy 's question above about testing the recent fix noted in #226

We seem to have internet connectivity issues with new kube-aws pre-release version 0.9.3 (built our own binaries from forked master branch) and inside our VPC.
Specifically, our kube-aws provisioned kubernetes instances have external internet access issues--using the exact same CIDR / subnet settings we've used in each previous release up to and including rc.5 without issue.

I merged our (configurations from our previous rc.5 and prior) cluster.yaml config file with the new pre-release 0.9.3 generated file and note, that I did get burned by missing a new config option that appeared when building from the master branch:
internetGatewayId: missing that caused some fresh new errors I hadn't seen before:
CREATE_FAILED AWS::EC2::Route PublicRouteToInternet route table rtb-71db6915 and network gateway igw-a45707c1 belong to different networks
where both the rtb and the igw are kube-aws/CloudFormation stack artifacts/infrastructure being created.

But getting past that (totally my bad missing that there) the kubernetes cluster does get provisioned, but after giving it ample time, I can't connect via kubectl.
Looking at the kube-aws provisioined ELB status shows that the controller instance is out of service.

A quick check on the etcd nodes
uname -a Linux ip-10-1-10-202.terradatum.com 4.7.3-coreos-r2 #1 SMP Sun Jan 8 00:32:25 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz GenuineIntel GNU/Linux

DNS queries work but we see internet connectivity issues

google.com has address 216.58.193.206
google.com has IPv6 address 2607:f8b0:4005:805::200e
google.com mail is handled by 50 alt4.aspmx.l.google.com.
google.com mail is handled by 10 aspmx.l.google.com.
google.com mail is handled by 20 alt1.aspmx.l.google.com.
google.com mail is handled by 30 alt2.aspmx.l.google.com.
google.com mail is handled by 40 alt3.aspmx.l.google.com.

ip-10-1-10-91 bee2edb3ef084690abc89bbdd3fe4e3d # ping google.com
PING google.com (216.58.193.206) 56(84) bytes of data.
^C
--- google.com ping statistics ---
19 packets transmitted, 0 received, 100% packet loss, time 17999ms

ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
^C
--- 8.8.8.8 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5000ms

And on controller we see the same issues with DNS working, but ping/internet access isn't working correctly.

Last login: Fri Jan 27 19:20:19 UTC 2017 from 10.1.0.178 on pts/0
Container Linux by CoreOS stable (1235.6.0)
Failed Units: 1
  coreos-cloudinit-958058535.service

So we know that given the internet connectivity issues we are seeing, we'd expect that pretty much everything (kube, etc.) will fail, but JIC will include:

-- Logs begin at Fri 2017-01-27 02:24:44 UTC, end at Fri 2017-01-27 20:35:36 UTC. --
Jan 27 02:24:56 ip-10-1-10-91.terradatum.com systemd[1]: Started Unit generated and executed by coreos-cloudinit on behalf of user.
Jan 27 02:26:07 ip-10-1-10-91.terradatum.com bash[1094]: run: discovery failed
Jan 27 02:26:07 ip-10-1-10-91.terradatum.com bash[1094]: 2017/01/27 02:26:07 Checking availability of "local-file"
Jan 27 02:26:07 ip-10-1-10-91.terradatum.com bash[1094]: 2017/01/27 02:26:07 Checking availability of "local-file"
Jan 27 02:26:07 ip-10-1-10-91.terradatum.com bash[1094]: 2017/01/27 02:26:07 Checking availability of "local-file"
Jan 27 02:26:07 ip-10-1-10-91.terradatum.com bash[1094]: 2017/01/27 02:26:07 Checking availability of "local-file"
Jan 27 02:26:08 ip-10-1-10-91.terradatum.com bash[1094]: 2017/01/27 02:26:08 Checking availability of "local-file"
<SNIP lots more here>
Jan 27 02:31:07 ip-10-1-10-91.terradatum.com bash[1094]: 2017/01/27 02:31:07 No datasources available in time
Jan 27 02:31:07 ip-10-1-10-91.terradatum.com systemd[1]: coreos-cloudinit-958058535.service: Main process exited, code=exited, status=1/FAILURE
Jan 27 02:31:07 ip-10-1-10-91.terradatum.com systemd[1]: coreos-cloudinit-958058535.service: Unit entered failed state.
Jan 27 02:31:07 ip-10-1-10-91.terradatum.com systemd[1]: coreos-cloudinit-958058535.service: Failed with result 'exit-code'.

So we're rolling back to kube-aws rc.5 (which unfortunately, has a blocker affecting Deis Workflow, and it seems that rc.3 also has the same problem which I'll note below).
I've run this issue past the Deis IRC channel a few times and the consensus was that this is very similar to a previous issue noted where the kubernetes subnet tags were missing:

kubectl --namespace=deis describe svc deis-router | egrep LoadBalancer
Type:			LoadBalancer
  2m		14s		6	{service-controller }			Normal		CreatingLoadBalancer		Creating load balancer
  2m		14s		6	{service-controller }			Warning		CreatingLoadBalancerFailed	Error creating load balancer (will retry): Failed to create load balancer for service deis/deis-router: could not find any suitable subnets for creating the ELB

However, we can see that our subnets appear to be tagged correctly AFAIK--please let me know if I'm missing something here (where myapp-kube is the cluster name):

aws ec2 describe-tags --filters "Name=key,Values=KubernetesCluster" --output text | grep subnet
TAGS	KubernetesCluster	subnet-1b832443	subnet	myapp-kube
TAGS	KubernetesCluster	subnet-4e24b22a	subnet	myapp-kube

Also noting an issue which could very well be related to our infrastructure/routing, etc., but it seems odd/curious that we just started seeing this with this 0.9.3 pre-release (again this is with us keeping all previous CIDR/subnet configs in the current cluster.yaml file that we've been using for many versions, and with a kube-aws binary that we built from the forked master branch).
The new problem I noticed today with the pre-release v-0.9.3, is that I couldn't connect via ssh, etc. to any instances from outside EC2, etc. (but could from an inside EC2 instance). So, for a quick test, I added all the new kube instances to security groups which definitely allow all requisite access, but still couldn't connect from anything outside EC2. Again, this could be our VPC and its config, but it's strange that it only now affects us now with this pre-release version 0.9.3 and never before with any other version.

Hopefully this information is useful.
Thanks

@mumoshu
Copy link
Contributor

mumoshu commented Jan 30, 2017

@cmcconnell1 Thanks as always!

Just a quick thought but

Jan 27 02:31:07 ip-10-1-10-91.terradatum.com bash[1094]: 2017/01/27 02:31:07 No datasources available in time

and

So, for a quick test, I added all the new kube instances to security groups which definitely allow all requisite access, but still couldn't connect from anything outside EC2

seems to indicate that

  • an internet gateway is missing in the VPC or
  • a route to the internet gateway is missing in route table(s)
  • or a security group or a network ACL is blocking all the packets to the internet(I'm not sure this could happen in practice).

Let me also add that the only difference between v0.9.3 and v0.9.3-rc.5 is the self-hosted calico.

v0.9.3-rc.5...v0.9.3

Are you using calico?
Could it produce such a disaster to your cluster? Hmm, no idea. I'll keep thinking about your case but I'd really appreciate more information to consider.

@mumoshu
Copy link
Contributor

mumoshu commented Jan 30, 2017

@cmcconnell1

internetGatewayId: missing that caused some fresh new errors I hadn't seen before:
CREATE_FAILED AWS::EC2::Route PublicRouteToInternet route table rtb-71db6915 and network gateway igw-a45707c1 belong to different networks
where both the rtb and the igw are kube-aws/CloudFormation stack artifacts/infrastructure being created.

Excuse me if I'm repeating what you've already considered but the error seems to indicate that, if you're deploying to a VPC created by kube-aws, probably you've used kube-aws update?
If that's the case, would you mind recreating your cluster with kube-aws destroy and then kube-aws up?

If you've tried to deploy to an existing VPC, I guess either the route table or the internet gateway is specified its id to use an existing one in the another VPC. If there's any such chance, would you mind checking if you've chosen correct internetGatewayId and routeTableId?

@cmcconnell1
Copy link
Contributor

Hello @mumoshu

The above noted comments were based on consistent observations through two (2) complete deploy/destroy kube-aws cycles into an existing VPC, using internal (non-public) network deployments.
We always run through multiple, complete deploy and destroy cycles for any/each release, to ensure we have predictable/dependable IAC, that behaves according to our expectations, etc.

As far as using kube-aws update
Honestly, I'm sadly not very comfortable with that process (yet), mostly due to my own lack of experience with the update process, coupled with the notes/readme's, which often have stated that we must destroy and redeploy (with certain version updates, due to significant changes, etc.). This is on our list however, and will become a high priority as soon as we get our kubernetes and deis workflow deployments and IAC dialed-in / settled.

RE: your comment above

an internet gateway is missing in the VPC or
a route to the internet gateway is missing in route table(s)
or a security group or a network ACL is blocking all the packets to the internet (I'm not sure this could happen in practice).

Yes we are definitely having a new issue with access. On that note, we are using the all of the same values (in our cluster.yaml) that we've been using successfully--up until this pre-release version 0.9.3, including:

  • vpcId
  • routeTableId
  • vpcCIDR
  • subnets and their associated instanceCIDR ranges
  • serviceCIDR
  • podCIDR

As noted, we have been deploying internally into a private network (and thus using the specified route table which uses NAT). Given this scenario, if is this is is our desired topology,
I am not sure about some new cluster.yaml configuration options (new to the 0.9.3 pre-release version, one which bit me as noted above when it was not initially specified--my thinking on that that was that I didn't need it because we use private subnets and NAT?). But, when it is left blank, kube-aws tried to create a new internet gateway if I recall correctly from late last week.

From the cluster.yaml auto-generated template:

# ID of existing Internet Gateway to associate subnet with. Leave blank to create a new Internet Gateway
internetGatewayId: igw-abc12345 

Note that if we are in fact using a private subnet (with NAT), I would expect that our existing (previous versions) config options would be sufficient to configure requisite kubernets network access as long as we continued to abide by the comments/warnings, etc.:

# Uncomment to provision nodes without a public IP. This assumes your VPC route table is setup to route to the internet via a NAT gateway.
# If you did not set vpcId and routeTableId the cluster will not bootstrap.
mapPublicIPs: false

Perhaps it's my misunderstanding and the correct question should be:
Are the new configuration options (and related/supported code) only applicable for the (perhaps newer) AWS NAT Gateway service and NOT applicable for (older) infrastructures which may utilize an actual NAT EC2 instance? This would be us here ;-) So, perhaps here be our dragon(s) ?

On this note and to make sure we're on the same page, I just wanted to quickly summarize our AWS VPC ENV. We use an existing Internet Gateway in our VPC.
Since we're using the private network topology and require NAT, etc., the below description is accurate for our desired kube-aws Development deployment ENV/requirements.

  • Our instances that need access to the Internet have their Internet-bound traffic forwarded to a NAT Instance via the specified Route Table (in the cluster.yaml).
  • Our NAT Instance then makes the request to the Internet (since it is on a Public Subnet) and the responses will be forwarded back to the private instances.

With that said, noting that there is also an AWS NAT Gateway service now available (for about a year now AIR?); and this NAT Gateway Service can now take the place of an actual EC2 NAT Instance (which used to be the only option).
In regards to using an actual EC2 NAT instance, we note that one must disable the Source/Destination Check option on the NAT Instance otherwise the traffic will be blocked. And we do have this option disabled as well.

Also from the new pre-release kube-aws version's generated cluster.yaml file, noting that we now see config options for specifying an existing natGateway wherein it appears as though perhaps a very important distinction should/must be made that this new configuration option, and its related code, appear to be designed to work with the newer AWS NAT Gateway service (and perhaps/maybe not with the (older) EC2 NAT instances? Because if it did support actual EC2 NAT instances, we'd need to specify an actual EC2 instance-id of the current NAT instance, right? I don't see this noted in the comments and perhaps I am in error by not noting this distinction when I tested building and deploying the pre-release binary?

# Kubernetes subnets with their CIDRs and availability zones. Differentiating availability zone for 2 or more subnets result in high-availability (failures of a single availability zone won't result in immediate downtimes)
# subnets:
#   - availabilityZone: us-west-1a
#     instanceCIDR: "10.0.0.0/24"
#     subnetId: "subnet-xxxxxxxx" #optional
#   - availabilityZone: us-west-1b
#     instanceCIDR: "10.0.1.0/24"
#     # natGateway:
#     #   # Pre-allocated NAT Gateway. Used with private subnets.
#     #   id: "ngw-abcdef12"
#     #   # Pre-allocated EIP for NAT Gateways. Used with private subnets.
#     #   eipAllocationId: "eipalloc-abcdef12"
#     subnetId: "subnet-xxxxxxxx" #optional

So to summarize, perhaps we just need to know how to correctly configure our cluster.yaml files moving forward if we:

  • have an existing VPC and igw
  • deploy into private subnets which use an actual EC2 instance and not the NAT Service for all NAT traffic (ensuring the requisite route table is specified in cluster.yaml)
  • want multi-AZ H/A

As always, Thanks again for your feedback, etc!

@mumoshu
Copy link
Contributor

mumoshu commented Feb 24, 2017

Hi @cmcconnell1 @rbellamy, I've submitted a WIP PR #332 to achieve H/A of etcd clusters with supports for various etcd peer discovery strategies.

According to the domain-name-servers = 10.1.0.2 setting you've shown me in the very beginning of this thread, you rely on the default DNS - Amazon DNS - provided by AWS, right?
If that's the case, I believe that #332 would solve your issue.
More concretely, discovery strategies available in #332 allow you to freely change hostname and domain name per etcd node without affecting etcd peer discovery, as long as you rely on Amazon DNS.

If you're interested, would you mind reading through the description of #332 and test it?

mumoshu added a commit to mumoshu/kube-aws that referenced this issue Mar 1, 2017
This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes.

After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG.

Supported use-cases:

* Automatic recovery from temporary Etcd node failures
  * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted
* Rolling-update of the instance type for etcd nodes without downtime
  * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates
* Other use-cases implied by the fact that the nodes are managed by ASGs
* You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml
  * `"eip"`, which is the default setting, is recommended
  * If you want, choose `"eni"`.
  * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery
  * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`)

Unsupported use-cases:

* Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure.
  * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE`
* Scaling-in of Etcd nodes
  * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed

Relevant issues to be (partly) resolved via this PR:

* Part(s) of kubernetes-retired#27
* Wait signal for etcd nodes. See kubernetes-retired#49
* Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not

The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively.
This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc.

Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those.

* If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation
   * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them.
* To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs
   * ENIs and EBS can't be moved to an another AZ
   * EBS volume can, however, be transferred utilizing a snapshot

* Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG
  * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all!
* Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG
  * EBS is required in order to achieve "locking" of a pair associated to an etcd instance
    * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance
    * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions
  * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.
@mumoshu
Copy link
Contributor

mumoshu commented Mar 10, 2017

#332 is merged and has been included since v0.9.5-rc.1

@cmcconnell1
Copy link
Contributor

Hello,
'Apologies for the delay, we were stuck on older versions and have started upgrading, testing, etc. With the most recent and latest version v0.9.6-rc.2 we no longer require any workarounds, etcd deploys fine without any modifications, etc. I believe that this issue can be closed.
Thanks @mumoshu

@mumoshu
Copy link
Contributor

mumoshu commented Apr 19, 2017

I'm really glad to know that it worked for you!
Thanks for your efforts and the confirmation, @cmcconnell1

@mumoshu mumoshu closed this as completed Apr 19, 2017
kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 27, 2018
This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes.

After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG.

Supported use-cases:

* Automatic recovery from temporary Etcd node failures
  * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted
* Rolling-update of the instance type for etcd nodes without downtime
  * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates
* Other use-cases implied by the fact that the nodes are managed by ASGs
* You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml
  * `"eip"`, which is the default setting, is recommended
  * If you want, choose `"eni"`.
  * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery
  * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`)

Unsupported use-cases:

* Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure.
  * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE`
* Scaling-in of Etcd nodes
  * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed

Relevant issues to be (partly) resolved via this PR:

* Part(s) of kubernetes-retired#27
* Wait signal for etcd nodes. See kubernetes-retired#49
* Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not

The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively.
This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc.

Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those.

* If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation
   * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them.
* To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs
   * ENIs and EBS can't be moved to an another AZ
   * EBS volume can, however, be transferred utilizing a snapshot

* Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG
  * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all!
* Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG
  * EBS is required in order to achieve "locking" of a pair associated to an etcd instance
    * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance
    * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions
  * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants