-
Notifications
You must be signed in to change notification settings - Fork 670
weave-net pod fails to get peers after Kubernetes v1.12.0 upgrade #3420
Comments
Thanks for the detailed report. The "running pod" is running because you manually inserted the list, right? Can you shell into the |
The running pod is scheduled on an uncordoned control plane. It can retrieve peers even without the peer list variable, while the other weave-net pod scheduled on a worker node can't retrieve them automatically. Here are my logs :
|
This is printing the details of calls done by
however it turns out |
The fact that the second container returns only one address seems relevant - I'm wondering if this is the same as #3392 (comment) Can you do |
|
OK, not the same as #3392. I figured out how to make kubectl -n kube-system exec -it weave-net-5txmw -c weave -- /home/weave/kube-utils -v=8 -alsologtostderr` |
Then it timeout and restart. |
@tailtwo Can you check |
I believe what we are looking at now is not exactly the same as the original behaviour:
because it shouldn't get as far as "Failed to get peers" if the whole pod times out. Nontheless, your system is not going to do anything if the node can't access the Kubernetes api-server. I wonder if this is the thing where Linux picks the wrong source IP address for DNATted packets? |
Hello, same behaviour here in some of our clusters. It seems that weave want to connect to api-server (kubernetes.default:443) but for some reason kube-proxy do not forward it to master_ip:6443
|
@gousse it's impossible to diagnose without a lot of detail; please open your own issue and fill in the information requested. |
@bboreham understand, it was just to point out we have exact same situation, and found that the problem is kube-proxy failing to join apiserver |
OK, I appreciate you are trying to help, but just to be clear to anyone else reading this your comment makes assumptions ahead of the information available in this particular case. "the thing where Linux picks the wrong source address" is described more fully at kubernetes/kubeadm#102 (comment) |
@bboreham just hit this issue after updating to 1.12 and after seeing your comment (kubernetes/kubeadm#102 (comment)) adding the I do see that If you want to reproduce this you can go to https://play-with-k8s.com (PWK co-creator here) and bootstrap the cluster with |
Why do you think this is an issue with Weave Net? My understanding of the problem is that a process on a host cannot establish an TCP connection to an address it has been given. The same thing would happen for any process. |
@bboreham I'm aware it's not a Weave Net issue. I'm just adding some information here as it's the only place where I found that these issue is being referenced. I'll try to run some more tests and probable open an issue in the k8s tracker |
@marcosnils sorry, I mis-read the context |
After investigating found what was causing the problem in v1.12.1. As stated in (kubernetes/kubeadm#102 (comment)) the following iptable rule was missing in the new version: Not sure how / when / if that rule is necessary but after adding that it started working. I'd also love to know where K8s networking is using that MARK to forward the packet to the correct interface. I haven't found any routing tables that actually uses that mark. |
I prefer adding a route as it has less overhead, and you might want to keep the original source address for tracing or policy reasons. |
I think I just figured what was wrong with my setup. The pod CIDR (10.32.0.0/12) was not passed to some kubernetes system components because I did not initialized the cluster with the correct kubeadm parameter. This resulted in unreachable kubernetes API calls on the slave node. The kubernetes documentation is slighlty misleading. It advices to set a pod CIDR for various pod network add-on but not for weave. See : https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/#pod-network This isssue can be closed. |
thanks for reporting back @tailtwo. FYI it's documented under |
Could you kindly explain the command in a bit so that one referring to this can understand what they're trying to do and adapt if the need be? I keep getting error for
Please help this is getting recursive troubleshooting for me. |
The question about what MARK does is answered at #3420 (comment); you can find descriptions of other iptables options at https://linux.die.net/man/8/iptables
That suggests your Linux doesn't have the "comment" extension for iptables. It is normally installed as standard. |
This worked and solved my issue. |
What you expected to happen?
Weave fills automatically the
KUBE_PEERS
env var and the connection between peers can initiate.What happened?
After upgrading a working v1.11.3 Kubernetes cluster to v1.12.0, the weave container in of the two weave pods fails to obtain the peer list and enter the
CrashLoopBackOff
state, only loggingFailed to get peers
. The weave-npc container fails to contact 10.96.0.1 (kube-api) and every list issued fail with a timeout. Manually entering the peers in the daemonset allows the connection between weave pods to initiate successfully.How to reproduce it?
Upgrade K8 cluster from v1.11.3 to v1.12.0 with kubeadm.
Anything else we need to know?
The cluster is running on bare metal with a single node and control plane. I also upgraded cni from 0.6.0 to 0.7.1. I have specified the cluster-cidr in the kube-proxy daemonset (
- --cluster-cidr=10.32.0.0/12
).Versions:
Logs:
Network:
The text was updated successfully, but these errors were encountered: