Calico enabled cluster produces non-functional network? #289

redbaron · 2017-01-30T19:21:32Z

This is preliminary bugreport in case somebody pulling hair like me , maybe we can join forces.

Long story short - pods on worker nodes can't reach apiserver.

Symptomps:

$ docker logs <kube-dns-container id>

E0130 18:37:25.125145       1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: Get https://192.168.128.1:443/api/v1/services?resourceVersion=0: dial tcp 192.168.128.1:443: getsockopt: no route to host
E0130 18:37:25.125256       1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://192.168.128.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 192.168.128.1:443: getsockopt: no route to host

Flanneld can't find route to LOCAL ip address:

$ journalctl -f -u flanneld
Jan 30 19:15:20 ip-10-29-188-39.us-west-2.compute.internal flannel-wrapper[4109]: I0130 19:15:20.709140 04109 network.go:225] L3 miss: 192.168.9.2
Jan 30 19:15:20 ip-10-29-188-39.us-west-2.compute.internal flannel-wrapper[4109]: I0130 19:15:20.709190 04109 network.go:229] Route for 192.168.9.2 not found

Weird routing table inside pod:

# docker top <ID OF KUBE-DNS /pause container>
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                29465               29450               0                   18:08               ?                   00:00:00            /pause

# nsenter -t 29465 -n ip a  # use PID in -t arg
...
3: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP group default 
    link/ether 4e:51:56:b7:a1:e8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.9.2/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::4c51:56ff:feb7:a1e8/64 scope link 
       valid_lft forever preferred_lft forever

# nsenter -t 29465 -n ip r
default via 169.254.1.1 dev eth0 
169.254.1.1 dev eth0  scope link

should it be 169.254.1.1?? even if IP address is 192.168.9.2

If somebody runs calico enabled cluster which was created lately and it works for you, please be kind run commands above and paste results here.

Thank you

The text was updated successfully, but these errors were encountered:

cmcconnell1 · 2017-01-31T00:53:17Z

Hello @redbaron
During my initial testing with building the binary from a forked kube-aws master last Friday, I/we began experiencing internet connectivity issues with all the kube-aws provisioned kubernetes nodes.

Note, that we hadn't seen this problem before. And this is with us keeping all of our cluster.yaml configurations/settings the same as we've used in all previous releases (up to and including rc.5), using cluster.yaml from git and merging with the latest versions default file and the new config options, etc.

As noted in the below comments, that gist/summary is that we seem to have hit some (in our case internet) connectivity issues with the current pre-release version 0.9.3. Perhaps unrelated to Calico specifically, but if we just roll-back to rc.5 (and implement our cloud-init hostname fix for our custom DHCP option set), we again get a working and deployable cluster.

Below are a couple of my comments that describe our observed symptoms/issues with the latest 0.9.3 pre-release (in our existing VPC with internal private subnets using NAT):
#189 (comment)
#189 (comment)

mumoshu · 2017-02-06T03:38:37Z

Hi @heschlie, could you share your insight on this? 🙇

redbaron · 2017-02-06T13:03:42Z

What seems to be happening is that both calico/node and calico/kube-controller use python etcd client, which can't handle CA chain certs. This is not manifesting itself when kube-aws creates cluster with 'manageCertificates: true` (default value), because certs generated by kube-aws have only one self-signed CA cert and therefore CA chain is just one cert.

when using v1.1.0-rc calico/node images, then problem goes away as they finally replaced python bits of code with Go version, which solved the problem for that image,but kube-controller still used python, therefore not usable for now.

Overall it is not a kube-aws problem, upstream issue was opened https://github.com/projectcalico/k8s-policy/issues/67

redbaron closed this as completed Feb 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calico enabled cluster produces non-functional network? #289

Calico enabled cluster produces non-functional network? #289

redbaron commented Jan 30, 2017

cmcconnell1 commented Jan 31, 2017

mumoshu commented Feb 6, 2017

redbaron commented Feb 6, 2017

Calico enabled cluster produces non-functional network? #289

Calico enabled cluster produces non-functional network? #289

Comments

redbaron commented Jan 30, 2017

cmcconnell1 commented Jan 31, 2017

mumoshu commented Feb 6, 2017

redbaron commented Feb 6, 2017