Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

weave-npc not adding running pod's IP existing entries during transition from pending to running #3610

Closed
mmerrill3 opened this issue Mar 8, 2019 · 6 comments

Comments

@mmerrill3
Copy link

What you expected to happen?

When the pod transitioned to running, the IP address should be recognized by weave and added to its collection of running IP addresses (update routes, update network policy structures)

What happened?

the IP address was not reachable from any pod within the weave network

How to reproduce it?

I saw a pod transition to pending, which in our case causes an autoscaling event
. This event (switch to pending) caused all of the weave-npc controllers to remove the IP from their meta-data structures: I saw these events from every weave-npc in the daemonset:

INFO: 2019/03/08 16:15:21.767275 deleting entry 100.106.0.4 from weave-%.n=[2]4bcU)@h+_HH}[kHl}x of 0a901e63-3533-11e9-8abb-0a2c58af28b4

But, when the pod was scheduled, and the status switched to running, I did not see weave add the entries back. The ip address stayed the same when it was running again.

But, the pod is now not reachable.

Anything else we need to know?

aws cloud provider. Using weave-net version 2.5.1

Versions:

k8s version 1.11.7
$ weave version 2.5.1
$ docker version
$ uname -a
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T19:45:23Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

$ docker logs weave

or, if using Kubernetes:
the logged event that caused the removal

DEBU: 2019/03/08 16:15:21.779114 EVENT UpdatePod {"metadata":{"creationTimestamp":"2019-02-20T17:14:54Z","deletionGracePeriodSeconds":30,"deletionTimestamp":"2019-03-08T16:15:20Z","generateName":"kibana-68db7c85b4-","labels":{"app":"kibana","es-proxy-client":"true","internalcert":"wildcard","pod-template-hash":"2486374160","wildcardcertmanager":"aware"},"name":"kibana-68db7c85b4-9hcbn","namespace":"qa-logging","resourceVersion":"248625274","selfLink":"/api/v1/namespaces/qa-logging/pods/kibana-68db7c85b4-9hcbn","uid":"0a901e63-3533-11e9-8abb-0a2c58af28b4"},"spec":{"containers":[{"image":"docker.elastic.co/kibana/kibana-oss:6.4.2","imagePullPolicy":"IfNotPresent","name":"kibana","ports":[{"containerPort":5601,"name":"ui","protocol":"TCP"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","nodeName":"ip-10-202-20-134.ec2.internal","priority":0,"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"default","serviceAccountName":"default","terminationGracePeriodSeconds":30},"status":{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:18:46Z","status":"True","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":null,"status":"True","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"PodScheduled"}],"hostIP":"10.202.20.134","phase":"Running","podIP":"100.106.0.4","qosClass":"Burstable","startTime":"2019-02-20T17:14:54Z"}} {"metadata":{"creationTimestamp":"2019-02-20T17:14:54Z","deletionGracePeriodSeconds":30,"deletionTimestamp":"2019-03-08T16:15:20Z","generateName":"kibana-68db7c85b4-","labels":{"app":"kibana","es-proxy-client":"true","internalcert":"wildcard","pod-template-hash":"2486374160","wildcardcertmanager":"aware"},"name":"kibana-68db7c85b4-9hcbn","namespace":"qa-logging","resourceVersion":"248625512","selfLink":"/api/v1/namespaces/qa-logging/pods/kibana-68db7c85b4-9hcbn","uid":"0a901e63-3533-11e9-8abb-0a2c58af28b4"},"spec":{"containers":[{"image":"docker.elastic.co/kibana/kibana-oss:6.4.2","imagePullPolicy":"IfNotPresent","name":"kibana","ports":[{"containerPort":5601,"name":"ui","protocol":"TCP"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","nodeName":"ip-10-202-20-134.ec2.internal","priority":0,"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"default","serviceAccountName":"default","terminationGracePeriodSeconds":30},"status":{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2019-03-08T16:15:21Z","message":"containers with unready status: [kibana]","reason":"ContainersNotReady","status":"False","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":null,"message":"containers with unready status: [kibana]","reason":"ContainersNotReady","status":"False","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"PodScheduled"}],"hostIP":"10.202.20.134","phase":"Pending","qosClass":"Burstable","startTime":"2019-02-20T17:14:54Z"}}

The logged event that did not cause the addition of the IP back:

DEBU: 2019/03/08 16:14:50.177228 EVENT UpdatePod {"metadata":{"creationTimestamp":"2019-02-20T17:14:54Z","generateName":"kibana-68db7c85b4-","labels":{"app":"kibana","es-proxy-client":"true","internalcert":"wildcard","pod-template-hash":"2486374160","wildcardcertmanager":"aware"},"name":"kibana-68db7c85b4-9hcbn","namespace":"qa-logging","resourceVersion":"239696997","selfLink":"/api/v1/namespaces/qa-logging/pods/kibana-68db7c85b4-9hcbn","uid":"0a901e63-3533-11e9-8abb-0a2c58af28b4"},"spec":{"containers":[{"image":"docker.elastic.co/kibana/kibana-oss:6.4.2","imagePullPolicy":"IfNotPresent","name":"kibana","ports":[{"containerPort":5601,"name":"ui","protocol":"TCP"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","nodeName":"ip-10-202-20-134.ec2.internal","priority":0,"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"default","serviceAccountName":"default","terminationGracePeriodSeconds":30},"status":{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:18:46Z","status":"True","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":null,"status":"True","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"PodScheduled"}],"hostIP":"10.202.20.134","phase":"Running","podIP":"100.106.0.4","qosClass":"Burstable","startTime":"2019-02-20T17:14:54Z"}} {"metadata":{"creationTimestamp":"2019-02-20T17:14:54Z","deletionGracePeriodSeconds":30,"deletionTimestamp":"2019-03-08T16:15:20Z","generateName":"kibana-68db7c85b4-","labels":{"app":"kibana","es-proxy-client":"true","internalcert":"wildcard","pod-template-hash":"2486374160","wildcardcertmanager":"aware"},"name":"kibana-68db7c85b4-9hcbn","namespace":"qa-logging","resourceVersion":"248625274","selfLink":"/api/v1/namespaces/qa-logging/pods/kibana-68db7c85b4-9hcbn","uid":"0a901e63-3533-11e9-8abb-0a2c58af28b4"},"spec":{"containers":[{"image":"docker.elastic.co/kibana/kibana-oss:6.4.2","imagePullPolicy":"IfNotPresent","name":"kibana","ports":[{"containerPort":5601,"name":"ui","protocol":"TCP"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}],"dnsPolicy":"ClusterFirst","nodeName":"ip-10-202-20-134.ec2.internal","priority":0,"restartPolicy":"Always","schedulerName":"default-scheduler","securityContext":{},"serviceAccount":"default","serviceAccountName":"default","terminationGracePeriodSeconds":30},"status":{"conditions":[{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"Initialized"},{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:18:46Z","status":"True","type":"Ready"},{"lastProbeTime":null,"lastTransitionTime":null,"status":"True","type":"ContainersReady"},{"lastProbeTime":null,"lastTransitionTime":"2019-02-20T17:14:54Z","status":"True","type":"PodScheduled"}],"hostIP":"10.202.20.134","phase":"Running","podIP":"100.106.0.4","qosClass":"Burstable","startTime":"2019-02-20T17:14:54Z"}}

Network:

$ ip route
$ ip -4 -o addr
$ sudo iptables-save
@mmerrill3
Copy link
Author

I had to delete the pod to resolve the issue. It fixed the issue b/c the pod received a new IP address from weave that was different that the previous IP.

@bboreham
Copy link
Contributor

bboreham commented Mar 8, 2019

Thanks for the detailed report. The way the code works is it compares the previous version of the pod data, so if you could dig out those from your log, or just post the entire log as an attachment, this would help to understand what happened.

In particular, I don't think pending should cause any change in iptables rules.

@mmerrill3
Copy link
Author

ok, digging that up.

@mmerrill3
Copy link
Author

@bboreham , I will close this since I think I jumped to conclusions. The end result was that there was a pod unreachable within the weave-net overlay, but I would like to do more investigations to figure out why. I don't think it's b/c of status changes.

@mmerrill3
Copy link
Author

@bboreham Jut to followup, and give a reason why I'm closing this.

I had two nodes in our cluster, where every pod on the node was unreachable. It wasn't just one pod, which is what I thought at first.

From the logs, its looks like I had a mac conflict? I'm running in AWS. It's like I had two EC2 instances sending out the same MAC/LAN interface?

I see these errors since I initially upgraded weave to version 2.5.1 on the EC2 nodes:

ERRO: 2019/03/04 15:03:19.562815 Captured frame from MAC (02:a7:12:86:21:25) to (ea:a5:0a:71:d9:c2) associated with another peer ce:87:81:de:c9:70(ip-10-202-20-71.ec2.internal)
ERRO: 2019/03/04 15:03:19.634335 Captured frame from MAC (ae:c1:c1:b7:c6:e9) to (ea:a5:0a:71:d9:c2) associated with another peer 02:96:6c:08:9a:b1(ip-10-202-21-219.ec2.internal)
ERRO: 2019/03/04 15:03:19.746968 Captured frame from MAC (a2:a0:39:71:f0:3e) to (5a:b1:ac:7f:fa:7b) associated with another peer a2:a0:39:71:f0:3e(ip-10-202-19-226.ec2.internal)
ERRO: 2019/03/04 15:03:19.748920 Captured frame from MAC (c6:31:40:83:64:05) to (5a:b1:ac:7f:fa:7b) associated with another peer c6:31:40:83:64:05(ip-10-202-21-78.ec2.internal)
ERRO: 2019/03/04 15:03:19.966312 Captured frame from MAC (02:96:6c:08:9a:b1) to (5a:b1:ac:7f:fa:7b) associated with another peer 02:96:6c:08:9a:b1(ip-10-202-21-219.ec2.internal)
ERRO: 2019/03/04 15:03:19.994579 Captured frame from MAC (12:23:6c:24:3a:20) to (5a:b1:ac:7f:fa:7b) associated with another peer 12:23:6c:24:3a:20(ip-10-202-20-112.ec2.internal)

and the weave-net log is saturated with these messages up until the current time:

ERRO: 2019/03/08 04:53:00.829664 Captured frame from MAC (0e:8c:b4:52:64:c3) to (26:a3:81:ae:25:48) associated with another peer 12:23:6c:24:3a:20(ip-10-202-20-112.ec2.internal)
ERRO: 2019/03/08 04:53:00.849558 Captured frame from MAC (aa:5f:9a:5b:2d:15) to (26:a3:81:ae:25:48) associated with another peer 46:a4:d9:69:3a:93(ip-10-202-19-125.ec2.internal)
ERRO: 2019/03/08 04:53:00.955338 Captured frame from MAC (22:25:09:4a:42:1b) to (26:a3:81:ae:25:48) associated with another peer de:58:69:27:a3:30(ip-10-202-21-154.ec2.internal)

I am just destroying the ec2 instances for now.

@bboreham
Copy link
Contributor

bboreham commented Mar 11, 2019

Some discussion at #2877 of the "Captured frame from MAC" message.
Without extensive investigation we won't know exactly what happened.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants