-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-flannel: host->remote-pod networking does not work on first node #533
Comments
To add some more details. I ran an audit script while broken / after fixed (restarting kube-flannel pod on worker node):
Then diff-ing the output I saw that: Broken
Working
Broken
So a manual resolution that worked. On worker machine:
|
Thanks @aaronlevy. Does the flannel0.1 device persist across restarts of flanneld but not reboots. Maybe there's a code path that creates routes differently when the device already exists |
@tomdee Hrm, good idea. I will dig through that. |
@tomdee @aaronlevy the instructions in the original posting were wrong, I fixed them and I am investigating deeper now. |
Changing the backend to UDP causes the same behavior! This is useful information! It means we can likely point our fingers at @mikedanese :-P (love you mike) |
I looked at this for a bit this afternoon. When Flannel is picking subnets, it skips the first, i.e. the network address: config.go. The allocator in Kubernetes doesn't: cidr_set.go. Using that first subnet may still work with the routes set up correctly, but Flannel only adds a route with the I have clusters up and running fine with the default link-scoped routing because the network address isn't being used as a subnet somewhere. Anyway, it seems like either Flannel needs to ensure the routing is definitely set up correctly, or Kubernetes needs to not allocate that first subnet. |
Related: #535 The proposed change in that issue is to force flannel to skip to the next subnet - but I don't think that is a viable option, as we would have all nodes essentially off-by-one from what kubernetes thinks is the subnet assigned to the node. I'll try and make a change to stomp on the route even if it already exists, and see how well that works. Another suggestion from @philips is to make sure the cni* interfaces are ignored by networkd: https://github.com/coreos/coreos-overlay/blob/22746580d4ac75ddbbd0a1330c98eb9273d0d699/app-admin/flannel/files/50-flannel.network |
Gave this patch a try (remove all routes, then explicitly re-add) but does not seem to help:
The resolution is still to, on the destination node, delete the local route table entry for:
|
mainFilter := &netlink.Route{
LinkIndex: dev.link.Attrs().Index,
Table: syscall.RT_TABLE_MAIN,
}
localFilter := &netlink.Route{
LinkIndex: dev.link.Attrs().Index,
Table: syscall.RT_TABLE_LOCAL,
}
mainRoutes, err := netlink.RouteListFiltered(netlink.FAMILY_ALL, mainFilter, netlink.RT_FILTER_OIF|netlink.RT_FILTER_TABLE)
if err != nil {
return fmt.Errorf("Failed to list routes: %v", err)
}
localRoutes, err := netlink.RouteListFiltered(netlink.FAMILY_ALL, localFilter, netlink.RT_FILTER_OIF|netlink.RT_FILTER_TABLE)
if err != nil {
return fmt.Errorf("Failed to list routes: %v", err)
}
for _, er := range append(mainRoutes, localRoutes...) {
log.Infof("Removing route: %s", er.String())
if err := netlink.RouteDel(&er); err != nil {
return fmt.Errorf("Failed to delete route: %v", err)
}
}
if err := netlink.RouteAdd(&route); err != nil {
return fmt.Errorf("Failed to add route: %v", err)
} |
It might also be an option to have the netlink library support some way of adding the link without the routes. Haven't looked to see if that's a thing that would be doable / desirable. |
Awesome, thanks. I was just looking into copying My naive understanding is that we shouldn't need to muck directly with the local route table - and I'm still not really understanding what about restarting the kube-flannel process would change this behavior. |
That's true. I'm still not sure what the deal is there. If you delete the daemonset, you can see that the device and routes are still there. If you recreate it, when Flannel comes back, it adds the correct routes. No idea why. |
With your patch it will work after bootstrap, but does not work after machine reboot. My one test was rebooting the worker machine. After that: c1 host -> c1 pod (works) |
One more random datapoint. I tried just changing the kubernetes code to skip first subnet range. After bootstrap everything works. If I reboot the worker, it has the same behavior as above.
|
Also fwiw the above was just meant as short-term test (would also need to carry patch in hyperkube releases if we wanted to go this route). |
Dug deeper, and my hunch now is that after a reboot, the mac address for the vxlan device has changed. You can see that the mac address was updated in the kubernetes node annotations, but doesn't seem to be reflected in So this makes me think flannel is caching the old mac address and never updating - but if you restart the flannel process on the node with the stale entries, everything starts working. My hunch is the issue is here: https://github.com/coreos/flannel/blob/master/subnet/watch.go#L142 We are only comparing existing leases against the subnet - but not a changed mac address. We would likely need to extend this to also consider changes to the vxlan mac address (which should be in the lease attrs). However, I'm now wondering, if this is the issue -- how has this ever worked? The vxlan mac address will always change regardless of using the kube backend... Anyway -- enough digging for tonight, I'll try and look some more tomorrow. |
I tested this using the existing coreos-kubernetes installation, and everything worked as expected (after reboots, etc). So it's not that we've magically missed this always being a problem. So the mac address is actually checked, but in the What I'm thinking is happening is that our cache of the node objects is not yet populated, so the call to WatchLeases() is returning nothing, so we don't reconfigure after the initial check. I'm going to test a change where we block until the informer has succesfully synced from the apiserver and see if that helps. |
Okay, the above was a red herring. I think I found the issue, but gonna have to test tomorrow. In subnet/kube we always return a In subnet/local_manager we return an When returning a snapshot (subnet/kube), we call Whereas an |
WIP fix for this: There may be two separate issues at play here:
The fix linked above is for the latter issue -- I was skipping the network address to just test this issue in isolation. I'll test both tomorrow. |
I think that blowing away all the local table routes isn't the best idea now. Having identified the difference in the broadcast routes, there is a potential solution in this approach but the code as is stands is too heavy handed and removes more than just the broadcast routes. Ideally we wouldn't be manipulating the local table at all - but I need to did into the ordering of how we're creating the device, bringing it up, assigning the IP address to it and then creating the additional route. I'll keep mulling this over and I'll try to work on a fix on Monday |
@tomdee new PR removes just the broadcast route. Think that is OK? |
@spacexnice this should have been resolved by: #576 Does that change resolve the issue for you? |
@aaronlevy That works for me , very much thanks!! |
closed by #576 |
I've been trying to use kube-flannel as part of the bootkube, but I'm still seeing an issue on bootstrap.
What I'm seeing is that on some nodes host-network --> remote pod (seemingly) hits some kind of race condition when bootstrapping using kube-flannel.
pod-to-pod (on same node and to remote nodes) works fine, this just affects host to remote pod, and it begins working after restarting kube-flannel on the remote node.
Repro steps:
Launch a cluster:
Start nginx pods for testing:
You should will have one nginx pod on each node, but can verify by checking nodeIP of each pod:
Make note of the podIP for each of the pods above, and which node they are assigned to. E.g.:
Test the routability from each host:
From controller node (route to remote pod likely doesn't work):
From worker node (all routes should work):
Now to resolve the issue:
Kill the kube-flannel pod on the failing "remote destination" side (in this case, worker)
Wait until the kube-flannel daemonset re-launches a replacment pod, then you should be able to re-do the tests above and all networking should work.
The text was updated successfully, but these errors were encountered: