Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Calico enabled cluster produces non-functional network? #289

Closed
redbaron opened this issue Jan 30, 2017 · 3 comments
Closed

Calico enabled cluster produces non-functional network? #289

redbaron opened this issue Jan 30, 2017 · 3 comments

Comments

@redbaron
Copy link
Contributor

This is preliminary bugreport in case somebody pulling hair like me , maybe we can join forces.

Long story short - pods on worker nodes can't reach apiserver.

Symptomps:

$ docker logs <kube-dns-container id>

E0130 18:37:25.125145       1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: Get https://192.168.128.1:443/api/v1/services?resourceVersion=0: dial tcp 192.168.128.1:443: getsockopt: no route to host
E0130 18:37:25.125256       1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://192.168.128.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 192.168.128.1:443: getsockopt: no route to host

Flanneld can't find route to LOCAL ip address:

$ journalctl -f -u flanneld
Jan 30 19:15:20 ip-10-29-188-39.us-west-2.compute.internal flannel-wrapper[4109]: I0130 19:15:20.709140 04109 network.go:225] L3 miss: 192.168.9.2
Jan 30 19:15:20 ip-10-29-188-39.us-west-2.compute.internal flannel-wrapper[4109]: I0130 19:15:20.709190 04109 network.go:229] Route for 192.168.9.2 not found

Weird routing table inside pod:

# docker top <ID OF KUBE-DNS /pause container>
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                29465               29450               0                   18:08               ?                   00:00:00            /pause

# nsenter -t 29465 -n ip a  # use PID in -t arg
...
3: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP group default 
    link/ether 4e:51:56:b7:a1:e8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.9.2/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::4c51:56ff:feb7:a1e8/64 scope link 
       valid_lft forever preferred_lft forever

# nsenter -t 29465 -n ip r
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0  scope link 

should it be 169.254.1.1?? even if IP address is 192.168.9.2

If somebody runs calico enabled cluster which was created lately and it works for you, please be kind run commands above and paste results here.

Thank you

@cmcconnell1
Copy link
Contributor

Hello @redbaron
During my initial testing with building the binary from a forked kube-aws master last Friday, I/we began experiencing internet connectivity issues with all the kube-aws provisioned kubernetes nodes.

Note, that we hadn't seen this problem before. And this is with us keeping all of our cluster.yaml configurations/settings the same as we've used in all previous releases (up to and including rc.5), using cluster.yaml from git and merging with the latest versions default file and the new config options, etc.

As noted in the below comments, that gist/summary is that we seem to have hit some (in our case internet) connectivity issues with the current pre-release version 0.9.3. Perhaps unrelated to Calico specifically, but if we just roll-back to rc.5 (and implement our cloud-init hostname fix for our custom DHCP option set), we again get a working and deployable cluster.

Below are a couple of my comments that describe our observed symptoms/issues with the latest 0.9.3 pre-release (in our existing VPC with internal private subnets using NAT):
#189 (comment)
#189 (comment)

@mumoshu
Copy link
Contributor

mumoshu commented Feb 6, 2017

Hi @heschlie, could you share your insight on this? 🙇

@redbaron
Copy link
Contributor Author

redbaron commented Feb 6, 2017

What seems to be happening is that both calico/node and calico/kube-controller use python etcd client, which can't handle CA chain certs. This is not manifesting itself when kube-aws creates cluster with 'manageCertificates: true` (default value), because certs generated by kube-aws have only one self-signed CA cert and therefore CA chain is just one cert.

when using v1.1.0-rc calico/node images, then problem goes away as they finally replaced python bits of code with Go version, which solved the problem for that image,but kube-controller still used python, therefore not usable for now.

Overall it is not a kube-aws problem, upstream issue was opened https://github.com/projectcalico/k8s-policy/issues/67

@redbaron redbaron closed this as completed Feb 6, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants