Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-flannel: host->remote-pod networking does not work on first node #533

Closed
aaronlevy opened this issue Oct 19, 2016 · 26 comments
Closed

Comments

@aaronlevy
Copy link
Contributor

aaronlevy commented Oct 19, 2016

I've been trying to use kube-flannel as part of the bootkube, but I'm still seeing an issue on bootstrap.

What I'm seeing is that on some nodes host-network --> remote pod (seemingly) hits some kind of race condition when bootstrapping using kube-flannel.

pod-to-pod (on same node and to remote nodes) works fine, this just affects host to remote pod, and it begins working after restarting kube-flannel on the remote node.

Repro steps:

Launch a cluster:

go get github.com/kubernetes-incubator/bootkube
git remote add aaronlevy https://github.com/aaronlevy/bootkube
cd $GOPATH/src/github.com/kubernetes-incubator/bootkube
git fetch aaronlevy
git checkout -b v1.4.0+flannel aaronlevy/v1.4.0+flannel
make clean all
cd hack/multi-node
./bootkube-up

Start nginx pods for testing:

kubectl --kubeconfig=cluster/auth/kubeconfig run n1 --image=nginx --replicas=2

You should will have one nginx pod on each node, but can verify by checking nodeIP of each pod:

kubectl --kubeconfig=cluster/auth/kubeconfig get pods -owide

Make note of the podIP for each of the pods above, and which node they are assigned to. E.g.:

NAMESPACE     NAME                                       READY     STATUS    RESTARTS   AGE       IP             NODE
default       n1-1110676790-e0w0c                        1/1       Running   0          12m       10.2.1.2       172.17.4.201
default       n1-1110676790-s3qfi                        1/1       Running   0          12m       10.2.0.5       172.17.4.101
  • The 10.2.0.5 is assigned to the "controller" (c1) node (172.17.4.101)
  • The 10.2.1.2 is assinged to the "worker" (w1) node (172.17.4.201)

Test the routability from each host:

  • note: This seems to fail in different ways sometimes (e.g. both sides, or just one side). Typically from what I've seen the common case is master (host) --> worker (pod) fails.

From controller node (route to remote pod likely doesn't work):

vagrant ssh c1
$ curl 10.2.0.5 # same node, should work
$ curl 10.2.1.2 # different node, probably doesn't work

From worker node (all routes should work):

vagrant ssh w1
$ curl 10.2.0.5 # different node, probably works
$ curl 10.2.1.2 # same node, should work

Now to resolve the issue:

Kill the kube-flannel pod on the failing "remote destination" side (in this case, worker)

kubectl --kubeconfig=cluster/auth/kubeconfig --namespace=kube-system delete pod kube-flannel-abcde

Wait until the kube-flannel daemonset re-launches a replacment pod, then you should be able to re-do the tests above and all networking should work.

@aaronlevy
Copy link
Contributor Author

aaronlevy commented Nov 2, 2016

To add some more details.

I ran an audit script while broken / after fixed (restarting kube-flannel pod on worker node):

#!/bin/bash
CMD="ip link show && ip addr && ip route && arp -an && bridge fdb show && ip route show table local && sudo iptables-save"
for SERVER in w1 c1; do echo "Server: $SERVER"; vagrant ssh $SERVER -- "${CMD}"; done

Then diff-ing the output I saw that:

Broken ip route showed a flannel.1 route with scope link and src 10.2.1.0:

10.2.0.0/16 dev flannel.1  proto kernel  scope link  src 10.2.1.0

Working ip route showed a flannel.1 with a global scope and no src:

10.2.0.0/16 dev flannel.1

Broken ip route list table local has two extra broadcast entries:

broadcast 10.2.0.0 dev flannel.1  proto kernel  scope link  src 10.2.1.0
broadcast 10.2.255.255 dev flannel.1  proto kernel  scope link  src 10.2.1.0

So a manual resolution that worked.

On worker machine:

sudo ip route del 10.2.0.0/16 dev flannel.1  proto kernel  scope link  src 10.2.1.0
sudo ip route add 10.2.0.0/16 dev flannel.1 scope global
sudo ip route del table local broadcast 10.2.0.0 dev flannel.1  proto kernel  scope link  src 10.2.1.0
sudo ip route del table local broadcast 10.2.255.255 dev flannel.1  proto kernel  scope link  src 10.2.1.0

@tomdee
Copy link
Contributor

tomdee commented Nov 2, 2016

Thanks @aaronlevy. Does the flannel0.1 device persist across restarts of flanneld but not reboots. Maybe there's a code path that creates routes differently when the device already exists

@philips
Copy link
Contributor

philips commented Nov 22, 2016

@tomdee Hrm, good idea. I will dig through that.

@philips
Copy link
Contributor

philips commented Nov 23, 2016

@philips
Copy link
Contributor

philips commented Nov 23, 2016

@tomdee @aaronlevy the instructions in the original posting were wrong, I fixed them and I am investigating deeper now.

@aaronlevy aaronlevy changed the title kube-flannel seems to hit race condition on bootstrap kube-flannel: host->remote-pod networking does not work on first node Nov 23, 2016
@philips
Copy link
Contributor

philips commented Nov 23, 2016

Changing the backend to UDP causes the same behavior! This is useful information!

It means we can likely point our fingers at @mikedanese :-P (love you mike)

@bison
Copy link

bison commented Nov 24, 2016

I looked at this for a bit this afternoon. When Flannel is picking subnets, it skips the first, i.e. the network address: config.go. The allocator in Kubernetes doesn't: cidr_set.go.

Using that first subnet may still work with the routes set up correctly, but Flannel only adds a route with the UNIVERSE scope if the default link-scoped route wasn't added automatically: device.go.

I have clusters up and running fine with the default link-scoped routing because the network address isn't being used as a subnet somewhere. Anyway, it seems like either Flannel needs to ensure the routing is definitely set up correctly, or Kubernetes needs to not allocate that first subnet.

@aaronlevy
Copy link
Contributor Author

Related: #535

The proposed change in that issue is to force flannel to skip to the next subnet - but I don't think that is a viable option, as we would have all nodes essentially off-by-one from what kubernetes thinks is the subnet assigned to the node.

I'll try and make a change to stomp on the route even if it already exists, and see how well that works.

Another suggestion from @philips is to make sure the cni* interfaces are ignored by networkd: https://github.com/coreos/coreos-overlay/blob/22746580d4ac75ddbbd0a1330c98eb9273d0d699/app-admin/flannel/files/50-flannel.network

@aaronlevy
Copy link
Contributor Author

Gave this patch a try (remove all routes, then explicitly re-add) but does not seem to help:

diff --git a/backend/vxlan/device.go b/backend/vxlan/device.go
index a8c6a92..17c18cf 100644
--- a/backend/vxlan/device.go
+++ b/backend/vxlan/device.go
@@ -135,7 +135,18 @@ func (dev *vxlanDevice) Configure(ipn ip.IP4Net) error {
                Scope:     netlink.SCOPE_UNIVERSE,
                Dst:       ipn.Network().ToIPNet(),
        }
-       if err := netlink.RouteAdd(&route); err != nil && err != syscall.EEXIST {
+
+       existingRoutes, err := netlink.RouteList(dev.link, netlink.FAMILY_ALL)
+       if err != nil {
+               return fmt.Errorf("failed to get route list for %s: %v", dev.link.Attrs().Name, err)
+       }
+
+       for _, er := range existingRoutes {
+               log.Infof("Removing route: %s", er.String())
+               netlink.RouteDel(&er)
+       }
+
+       if err := netlink.RouteAdd(&route); err != nil {
                return fmt.Errorf("failed to add route (%s -> %s): %v", ipn.Network().String(), dev.link.Attrs().Name, err)
        }

The resolution is still to, on the destination node, delete the local route table entry for:

broadcast 10.2.0.0 dev flannel.1 proto kernel scope link src 10.2.1.0

@bison
Copy link

bison commented Nov 29, 2016

RouteList() only lists the main table. I just tested, and this does work if you delete the routes from the local table as well:

mainFilter := &netlink.Route{
        LinkIndex: dev.link.Attrs().Index,
        Table:     syscall.RT_TABLE_MAIN,
}

localFilter := &netlink.Route{
        LinkIndex: dev.link.Attrs().Index,
        Table:     syscall.RT_TABLE_LOCAL,
}

mainRoutes, err := netlink.RouteListFiltered(netlink.FAMILY_ALL, mainFilter, netlink.RT_FILTER_OIF|netlink.RT_FILTER_TABLE)
if err != nil {
        return fmt.Errorf("Failed to list routes: %v", err)
}

localRoutes, err := netlink.RouteListFiltered(netlink.FAMILY_ALL, localFilter, netlink.RT_FILTER_OIF|netlink.RT_FILTER_TABLE)
if err != nil {
        return fmt.Errorf("Failed to list routes: %v", err)
}

for _, er := range append(mainRoutes, localRoutes...) {
        log.Infof("Removing route: %s", er.String())
        if err := netlink.RouteDel(&er); err != nil {
                return fmt.Errorf("Failed to delete route: %v", err)
        }
}

if err := netlink.RouteAdd(&route); err != nil {
        return fmt.Errorf("Failed to add route: %v", err)
}

@bison
Copy link

bison commented Nov 29, 2016

It might also be an option to have the netlink library support some way of adding the link without the routes. Haven't looked to see if that's a thing that would be doable / desirable.

@aaronlevy
Copy link
Contributor Author

Awesome, thanks. I was just looking into copying RouteList but removing the "ignore all but main table" segment -- this looks much cleaner.

My naive understanding is that we shouldn't need to muck directly with the local route table - and I'm still not really understanding what about restarting the kube-flannel process would change this behavior.

@bison
Copy link

bison commented Nov 29, 2016

That's true. I'm still not sure what the deal is there. If you delete the daemonset, you can see that the device and routes are still there. If you recreate it, when Flannel comes back, it adds the correct routes. No idea why.

@aaronlevy
Copy link
Contributor Author

With your patch it will work after bootstrap, but does not work after machine reboot. My one test was rebooting the worker machine. After that:

c1 host -> c1 pod (works)
c1 host -> w1 pod (does not work)
w1 host -> c1 pod (does not work)
w1 host -> w1 pod (works)

@aaronlevy
Copy link
Contributor Author

One more random datapoint. I tried just changing the kubernetes code to skip first subnet range. After bootstrap everything works. If I reboot the worker, it has the same behavior as above.

diff --git a/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go b/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go
index c4d9aaa..d86ad3a 100644
--- a/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go
+++ b/vendor/k8s.io/kubernetes/pkg/controller/node/cidr_set.go
@@ -45,6 +45,7 @@ func newCIDRSet(clusterCIDR *net.IPNet, subNetMaskSize int) *cidrSet {
                clusterMaskSize: clusterMaskSize,
                maxCIDRs:        maxCIDRs,
                subNetMaskSize:  subNetMaskSize,
+               nextCandidate:   1,
        }
 }

@aaronlevy
Copy link
Contributor Author

Also fwiw the above was just meant as short-term test (would also need to carry patch in hyperkube releases if we wanted to go this route).

@aaronlevy
Copy link
Contributor Author

aaronlevy commented Nov 30, 2016

Dug deeper, and my hunch now is that after a reboot, the mac address for the vxlan device has changed. You can see that the mac address was updated in the kubernetes node annotations, but doesn't seem to be reflected in bridge fdb show and also the stale mac address gets added to the arp table on L2 misses.

So this makes me think flannel is caching the old mac address and never updating - but if you restart the flannel process on the node with the stale entries, everything starts working.

My hunch is the issue is here: https://github.com/coreos/flannel/blob/master/subnet/watch.go#L142

We are only comparing existing leases against the subnet - but not a changed mac address. We would likely need to extend this to also consider changes to the vxlan mac address (which should be in the lease attrs).

However, I'm now wondering, if this is the issue -- how has this ever worked? The vxlan mac address will always change regardless of using the kube backend...

Anyway -- enough digging for tonight, I'll try and look some more tomorrow.

@aaronlevy
Copy link
Contributor Author

I tested this using the existing coreos-kubernetes installation, and everything worked as expected (after reboots, etc). So it's not that we've magically missed this always being a problem.

So the mac address is actually checked, but in the handleInitialSubnetEvents (https://github.com/coreos/flannel/blob/master/backend/vxlan/network.go#L187) but that will only run once on start (which makes sense, the mac address should not change except if the process is restarted).

What I'm thinking is happening is that our cache of the node objects is not yet populated, so the call to WatchLeases() is returning nothing, so we don't reconfigure after the initial check.

I'm going to test a change where we block until the informer has succesfully synced from the apiserver and see if that helps.

@aaronlevy
Copy link
Contributor Author

Okay, the above was a red herring.

I think I found the issue, but gonna have to test tomorrow.

In subnet/kube we always return a snapshot as the LeaseWatchResult:
https://github.com/coreos/flannel/blob/master/subnet/kube/kube.go#L220

In subnet/local_manager we return anevent:
https://github.com/coreos/flannel/blob/master/subnet/local_manager.go#L288-L291

When returning a snapshot (subnet/kube), we call listwatcher.reset:
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L55
And this determines if this is a "new" lease based on subnet comparison (not mac address):
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L79
In our case, it's never new because the subnet has not changed, and continues to use incorrect mac address.

Whereas an event only has "added/removed" states, so no such check is done:
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L53
https://github.com/coreos/flannel/blob/master/subnet/watch.go#L115-L120
This causes any event that is received to trigger an add/remove.

@aaronlevy
Copy link
Contributor Author

WIP fix for this:
https://github.com/coreos/flannel/compare/master...aaronlevy:533?expand=1

There may be two separate issues at play here:

  • One is the issue of using the network address as a subnet assigned to a node (what @bison identified)
  • Another is that after reboot, other nodes have stale lease information.

The fix linked above is for the latter issue -- I was skipping the network address to just test this issue in isolation. I'll test both tomorrow.

@tomdee
Copy link
Contributor

tomdee commented Dec 4, 2016

I think that blowing away all the local table routes isn't the best idea now. Having identified the difference in the broadcast routes, there is a potential solution in this approach but the code as is stands is too heavy handed and removes more than just the broadcast routes.

Ideally we wouldn't be manipulating the local table at all - but I need to did into the ordering of how we're creating the device, bringing it up, assigning the IP address to it and then creating the additional route. I'll keep mulling this over and I'll try to work on a fix on Monday

@philips
Copy link
Contributor

philips commented Dec 9, 2016

@tomdee new PR removes just the broadcast route. Think that is OK?

@aoxn
Copy link
Contributor

aoxn commented Jan 11, 2017

@tomdee @philips Any update? I came across the same issue. And I find simply replace the flannel.1 device with another regular IP address(rather than a network address) works for me. Like replace 172.16.0.0/16 with 172.16.0.1/16.

@aaronlevy
Copy link
Contributor Author

@spacexnice this should have been resolved by: #576

Does that change resolve the issue for you?

@aoxn
Copy link
Contributor

aoxn commented Jan 11, 2017

@aaronlevy That works for me , very much thanks!!

@aaronlevy
Copy link
Contributor Author

closed by #576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants