kube-flannel: host->remote-pod networking does not work on first node #533

aaronlevy · 2016-10-19T18:11:41Z

I've been trying to use kube-flannel as part of the bootkube, but I'm still seeing an issue on bootstrap.

What I'm seeing is that on some nodes host-network --> remote pod (seemingly) hits some kind of race condition when bootstrapping using kube-flannel.

pod-to-pod (on same node and to remote nodes) works fine, this just affects host to remote pod, and it begins working after restarting kube-flannel on the remote node.

Repro steps:

Launch a cluster:

go get github.com/kubernetes-incubator/bootkube
git remote add aaronlevy https://github.com/aaronlevy/bootkube
cd $GOPATH/src/github.com/kubernetes-incubator/bootkube
git fetch aaronlevy
git checkout -b v1.4.0+flannel aaronlevy/v1.4.0+flannel
make clean all
cd hack/multi-node
./bootkube-up

Start nginx pods for testing:

kubectl --kubeconfig=cluster/auth/kubeconfig run n1 --image=nginx --replicas=2

You should will have one nginx pod on each node, but can verify by checking nodeIP of each pod:

kubectl --kubeconfig=cluster/auth/kubeconfig get pods -owide

Make note of the podIP for each of the pods above, and which node they are assigned to. E.g.:

NAMESPACE     NAME                                       READY     STATUS    RESTARTS   AGE       IP             NODE
default       n1-1110676790-e0w0c                        1/1       Running   0          12m       10.2.1.2       172.17.4.201
default       n1-1110676790-s3qfi                        1/1       Running   0          12m       10.2.0.5       172.17.4.101

The 10.2.0.5 is assigned to the "controller" (c1) node (172.17.4.101)
The 10.2.1.2 is assinged to the "worker" (w1) node (172.17.4.201)

Test the routability from each host:

note: This seems to fail in different ways sometimes (e.g. both sides, or just one side). Typically from what I've seen the common case is master (host) --> worker (pod) fails.

From controller node (route to remote pod likely doesn't work):

vagrant ssh c1
$ curl 10.2.0.5 # same node, should work
$ curl 10.2.1.2 # different node, probably doesn't work

From worker node (all routes should work):

vagrant ssh w1
$ curl 10.2.0.5 # different node, probably works
$ curl 10.2.1.2 # same node, should work

Now to resolve the issue:

Kill the kube-flannel pod on the failing "remote destination" side (in this case, worker)

kubectl --kubeconfig=cluster/auth/kubeconfig --namespace=kube-system delete pod kube-flannel-abcde

Wait until the kube-flannel daemonset re-launches a replacment pod, then you should be able to re-do the tests above and all networking should work.

The text was updated successfully, but these errors were encountered:

aaronlevy · 2016-11-02T00:09:52Z

To add some more details.

I ran an audit script while broken / after fixed (restarting kube-flannel pod on worker node):

#!/bin/bash
CMD="ip link show && ip addr && ip route && arp -an && bridge fdb show && ip route show table local && sudo iptables-save"
for SERVER in w1 c1; do echo "Server: $SERVER"; vagrant ssh $SERVER -- "${CMD}"; done

Then diff-ing the output I saw that:

Broken ip route showed a flannel.1 route with scope link and src 10.2.1.0:

10.2.0.0/16 dev flannel.1  proto kernel  scope link  src 10.2.1.0

Working ip route showed a flannel.1 with a global scope and no src:

10.2.0.0/16 dev flannel.1

Broken ip route list table local has two extra broadcast entries:

broadcast 10.2.0.0 dev flannel.1  proto kernel  scope link  src 10.2.1.0
broadcast 10.2.255.255 dev flannel.1  proto kernel  scope link  src 10.2.1.0

So a manual resolution that worked.

On worker machine:

sudo ip route del 10.2.0.0/16 dev flannel.1  proto kernel  scope link  src 10.2.1.0
sudo ip route add 10.2.0.0/16 dev flannel.1 scope global
sudo ip route del table local broadcast 10.2.0.0 dev flannel.1  proto kernel  scope link  src 10.2.1.0
sudo ip route del table local broadcast 10.2.255.255 dev flannel.1  proto kernel  scope link  src 10.2.1.0

tomdee · 2016-11-02T00:53:42Z

Thanks @aaronlevy. Does the flannel0.1 device persist across restarts of flanneld but not reboots. Maybe there's a code path that creates routes differently when the device already exists

philips · 2016-11-22T23:41:09Z

@tomdee Hrm, good idea. I will dig through that.

philips · 2016-11-23T00:13:22Z

@tomdee https://github.com/coreos/flannel/blob/master/backend/vxlan/device.go#L85

philips · 2016-11-23T18:37:51Z

@tomdee @aaronlevy the instructions in the original posting were wrong, I fixed them and I am investigating deeper now.

philips · 2016-11-23T23:57:19Z

Changing the backend to UDP causes the same behavior! This is useful information!

It means we can likely point our fingers at @mikedanese :-P (love you mike)

bison · 2016-11-24T01:19:14Z

I looked at this for a bit this afternoon. When Flannel is picking subnets, it skips the first, i.e. the network address: config.go. The allocator in Kubernetes doesn't: cidr_set.go.

Using that first subnet may still work with the routes set up correctly, but Flannel only adds a route with the UNIVERSE scope if the default link-scoped route wasn't added automatically: device.go.

I have clusters up and running fine with the default link-scoped routing because the network address isn't being used as a subnet somewhere. Anyway, it seems like either Flannel needs to ensure the routing is definitely set up correctly, or Kubernetes needs to not allocate that first subnet.

aaronlevy · 2016-11-28T23:39:18Z

Related: #535

The proposed change in that issue is to force flannel to skip to the next subnet - but I don't think that is a viable option, as we would have all nodes essentially off-by-one from what kubernetes thinks is the subnet assigned to the node.

I'll try and make a change to stomp on the route even if it already exists, and see how well that works.

Another suggestion from @philips is to make sure the cni* interfaces are ignored by networkd: https://github.com/coreos/coreos-overlay/blob/22746580d4ac75ddbbd0a1330c98eb9273d0d699/app-admin/flannel/files/50-flannel.network

aaronlevy · 2016-11-29T19:38:14Z

Gave this patch a try (remove all routes, then explicitly re-add) but does not seem to help:

diff --git a/backend/vxlan/device.go b/backend/vxlan/device.go
index a8c6a92..17c18cf 100644
--- a/backend/vxlan/device.go
+++ b/backend/vxlan/device.go
@@ -135,7 +135,18 @@ func (dev *vxlanDevice) Configure(ipn ip.IP4Net) error {
                Scope:     netlink.SCOPE_UNIVERSE,
                Dst:       ipn.Network().ToIPNet(),
        }
-       if err := netlink.RouteAdd(&route); err != nil && err != syscall.EEXIST {
+
+       existingRoutes, err := netlink.RouteList(dev.link, netlink.FAMILY_ALL)
+       if err != nil {
+               return fmt.Errorf("failed to get route list for %s: %v", dev.link.Attrs().Name, err)
+       }
+
+       for _, er := range existingRoutes {
+               log.Infof("Removing route: %s", er.String())
+               netlink.RouteDel(&er)
+       }
+
+       if err := netlink.RouteAdd(&route); err != nil {
                return fmt.Errorf("failed to add route (%s -> %s): %v", ipn.Network().String(), dev.link.Attrs().Name, err)
        }

The resolution is still to, on the destination node, delete the local route table entry for:

broadcast 10.2.0.0 dev flannel.1 proto kernel scope link src 10.2.1.0

bison · 2016-11-29T21:06:45Z

RouteList() only lists the main table. I just tested, and this does work if you delete the routes from the local table as well:

mainFilter := &netlink.Route{
        LinkIndex: dev.link.Attrs().Index,
        Table:     syscall.RT_TABLE_MAIN,
}

localFilter := &netlink.Route{
        LinkIndex: dev.link.Attrs().Index,
        Table:     syscall.RT_TABLE_LOCAL,
}

mainRoutes, err := netlink.RouteListFiltered(netlink.FAMILY_ALL, mainFilter, netlink.RT_FILTER_OIF|netlink.RT_FILTER_TABLE)
if err != nil {
        return fmt.Errorf("Failed to list routes: %v", err)
}

localRoutes, err := netlink.RouteListFiltered(netlink.FAMILY_ALL, localFilter, netlink.RT_FILTER_OIF|netlink.RT_FILTER_TABLE)
if err != nil {
        return fmt.Errorf("Failed to list routes: %v", err)
}

for _, er := range append(mainRoutes, localRoutes...) {
        log.Infof("Removing route: %s", er.String())
        if err := netlink.RouteDel(&er); err != nil {
                return fmt.Errorf("Failed to delete route: %v", err)
        }
}

if err := netlink.RouteAdd(&route); err != nil {
        return fmt.Errorf("Failed to add route: %v", err)
}

bison · 2016-11-29T21:12:13Z

It might also be an option to have the netlink library support some way of adding the link without the routes. Haven't looked to see if that's a thing that would be doable / desirable.

aaronlevy · 2016-11-29T21:44:59Z

Awesome, thanks. I was just looking into copying RouteList but removing the "ignore all but main table" segment -- this looks much cleaner.

My naive understanding is that we shouldn't need to muck directly with the local route table - and I'm still not really understanding what about restarting the kube-flannel process would change this behavior.

bison · 2016-11-29T23:17:55Z

That's true. I'm still not sure what the deal is there. If you delete the daemonset, you can see that the device and routes are still there. If you recreate it, when Flannel comes back, it adds the correct routes. No idea why.

aaronlevy · 2016-11-29T23:22:10Z

With your patch it will work after bootstrap, but does not work after machine reboot. My one test was rebooting the worker machine. After that:

c1 host -> c1 pod (works)
c1 host -> w1 pod (does not work)
w1 host -> c1 pod (does not work)
w1 host -> w1 pod (works)

aaronlevy · 2016-11-30T00:01:23Z

One more random datapoint. I tried just changing the kubernetes code to skip first subnet range. After bootstrap everything works. If I reboot the worker, it has the same behavior as above.

diff --git a/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go b/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go
index c4d9aaa..d86ad3a 100644
--- a/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go
+++ b/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go
@@ -45,6 +45,7 @@ func newCIDRSet(clusterCIDR *net.IPNet, subNetMaskSize int) *cidrSet {
                clusterMaskSize: clusterMaskSize,
                maxCIDRs:        maxCIDRs,
                subNetMaskSize:  subNetMaskSize,
+               nextCandidate:   1,
        }
 }

aaronlevy · 2016-11-30T00:07:35Z

Also fwiw the above was just meant as short-term test (would also need to carry patch in hyperkube releases if we wanted to go this route).

aaronlevy · 2016-11-30T04:33:34Z

Dug deeper, and my hunch now is that after a reboot, the mac address for the vxlan device has changed. You can see that the mac address was updated in the kubernetes node annotations, but doesn't seem to be reflected in bridge fdb show and also the stale mac address gets added to the arp table on L2 misses.

So this makes me think flannel is caching the old mac address and never updating - but if you restart the flannel process on the node with the stale entries, everything starts working.

My hunch is the issue is here: https://github.com/coreos/flannel/blob/master/subnet/watch.go#L142

We are only comparing existing leases against the subnet - but not a changed mac address. We would likely need to extend this to also consider changes to the vxlan mac address (which should be in the lease attrs).

However, I'm now wondering, if this is the issue -- how has this ever worked? The vxlan mac address will always change regardless of using the kube backend...

Anyway -- enough digging for tonight, I'll try and look some more tomorrow.

aaronlevy · 2016-11-30T22:38:48Z

I tested this using the existing coreos-kubernetes installation, and everything worked as expected (after reboots, etc). So it's not that we've magically missed this always being a problem.

So the mac address is actually checked, but in the handleInitialSubnetEvents (https://github.com/coreos/flannel/blob/master/backend/vxlan/network.go#L187) but that will only run once on start (which makes sense, the mac address should not change except if the process is restarted).

What I'm thinking is happening is that our cache of the node objects is not yet populated, so the call to WatchLeases() is returning nothing, so we don't reconfigure after the initial check.

I'm going to test a change where we block until the informer has succesfully synced from the apiserver and see if that helps.

aaronlevy · 2016-12-01T03:02:15Z

Okay, the above was a red herring.

I think I found the issue, but gonna have to test tomorrow.

In subnet/kube we always return a snapshot as the LeaseWatchResult:
https://github.com/coreos/flannel/blob/master/subnet/kube/kube.go#L220

In subnet/local_manager we return anevent:
https://github.com/coreos/flannel/blob/master/subnet/local_manager.go#L288-L291

When returning a snapshot (subnet/kube), we call listwatcher.reset:
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L55
And this determines if this is a "new" lease based on subnet comparison (not mac address):
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L79
In our case, it's never new because the subnet has not changed, and continues to use incorrect mac address.

Whereas an event only has "added/removed" states, so no such check is done:
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L53
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L115-L120
This causes any event that is received to trigger an add/remove.

aaronlevy · 2016-12-02T01:12:36Z

WIP fix for this:
https://github.com/coreos/flannel/compare/master...aaronlevy:533?expand=1

There may be two separate issues at play here:

One is the issue of using the network address as a subnet assigned to a node (what @bison identified)
Another is that after reboot, other nodes have stale lease information.

The fix linked above is for the latter issue -- I was skipping the network address to just test this issue in isolation. I'll test both tomorrow.

tomdee · 2016-12-04T02:22:34Z

I think that blowing away all the local table routes isn't the best idea now. Having identified the difference in the broadcast routes, there is a potential solution in this approach but the code as is stands is too heavy handed and removes more than just the broadcast routes.

Ideally we wouldn't be manipulating the local table at all - but I need to did into the ordering of how we're creating the device, bringing it up, assigning the IP address to it and then creating the additional route. I'll keep mulling this over and I'll try to work on a fix on Monday

philips · 2016-12-09T23:07:55Z

@tomdee new PR removes just the broadcast route. Think that is OK?

aoxn · 2017-01-11T03:10:14Z

@tomdee @philips Any update? I came across the same issue. And I find simply replace the flannel.1 device with another regular IP address(rather than a network address) works for me. Like replace 172.16.0.0/16 with 172.16.0.1/16.

aaronlevy · 2017-01-11T03:25:17Z

@spacexnice this should have been resolved by: #576

Does that change resolve the issue for you?

aoxn · 2017-01-11T05:14:19Z

@aaronlevy That works for me , very much thanks!!

aaronlevy · 2017-01-11T17:00:39Z

closed by #576

tomdee added the component/kubernetes label Oct 25, 2016

aaronlevy changed the title ~~kube-flannel seems to hit race condition on bootstrap~~ kube-flannel: host->remote-pod networking does not work on first node Nov 23, 2016

philips mentioned this issue Nov 28, 2016

networkd: mask out Kubernetes CNI interfaces coreos/bugs#1678

Closed

aaronlevy mentioned this issue Dec 2, 2016

Improvements to kubernetes lease management & vxlan backend #560

Closed

This was referenced Dec 5, 2016

Improve kube subnet lease handling #564

Merged

[do-not-merge] backend/vxlan: Remove local broadcast route for network #569

Closed

aaronlevy mentioned this issue Dec 14, 2016

backend/vxlan: Set the netmask of the IP used for the vxlan device #576

Merged

aaronlevy closed this as completed Jan 11, 2017

This was referenced Feb 1, 2018

Flannel connectivity seems to be broken in certain situations sipb/homeworld#247

Closed

backend/udp: Use a /32 prefix for the flannel0 interface #941

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-flannel: host->remote-pod networking does not work on first node #533

kube-flannel: host->remote-pod networking does not work on first node #533

aaronlevy commented Oct 19, 2016 •

edited by philips

Loading

aaronlevy commented Nov 2, 2016 •

edited

Loading

tomdee commented Nov 2, 2016

philips commented Nov 22, 2016

philips commented Nov 23, 2016

philips commented Nov 23, 2016

philips commented Nov 23, 2016

bison commented Nov 24, 2016

aaronlevy commented Nov 28, 2016

aaronlevy commented Nov 29, 2016

bison commented Nov 29, 2016

bison commented Nov 29, 2016

aaronlevy commented Nov 29, 2016

bison commented Nov 29, 2016

aaronlevy commented Nov 29, 2016

aaronlevy commented Nov 30, 2016

aaronlevy commented Nov 30, 2016

aaronlevy commented Nov 30, 2016 •

edited

Loading

aaronlevy commented Nov 30, 2016

aaronlevy commented Dec 1, 2016

aaronlevy commented Dec 2, 2016

tomdee commented Dec 4, 2016

philips commented Dec 9, 2016

aoxn commented Jan 11, 2017

aaronlevy commented Jan 11, 2017

aoxn commented Jan 11, 2017

aaronlevy commented Jan 11, 2017

kube-flannel: host->remote-pod networking does not work on first node #533

kube-flannel: host->remote-pod networking does not work on first node #533

Comments

aaronlevy commented Oct 19, 2016 • edited by philips Loading

Repro steps:

Launch a cluster:

Start nginx pods for testing:

Test the routability from each host:

Now to resolve the issue:

aaronlevy commented Nov 2, 2016 • edited Loading

tomdee commented Nov 2, 2016

philips commented Nov 22, 2016

philips commented Nov 23, 2016

philips commented Nov 23, 2016

philips commented Nov 23, 2016

bison commented Nov 24, 2016

aaronlevy commented Nov 28, 2016

aaronlevy commented Nov 29, 2016

bison commented Nov 29, 2016

bison commented Nov 29, 2016

aaronlevy commented Nov 29, 2016

bison commented Nov 29, 2016

aaronlevy commented Nov 29, 2016

aaronlevy commented Nov 30, 2016

aaronlevy commented Nov 30, 2016

aaronlevy commented Nov 30, 2016 • edited Loading

aaronlevy commented Nov 30, 2016

aaronlevy commented Dec 1, 2016

aaronlevy commented Dec 2, 2016

tomdee commented Dec 4, 2016

philips commented Dec 9, 2016

aoxn commented Jan 11, 2017

aaronlevy commented Jan 11, 2017

aoxn commented Jan 11, 2017

aaronlevy commented Jan 11, 2017

aaronlevy commented Oct 19, 2016 •

edited by philips

Loading

aaronlevy commented Nov 2, 2016 •

edited

Loading

aaronlevy commented Nov 30, 2016 •

edited

Loading