Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Captured frame from MAC associated with another peer #3713

Closed
henro001 opened this issue Oct 10, 2019 · 13 comments
Closed

Captured frame from MAC associated with another peer #3713

henro001 opened this issue Oct 10, 2019 · 13 comments

Comments

@henro001
Copy link

henro001 commented Oct 10, 2019

What happened?

My logs are flooded with the following messages:
ERRO: 2019/10/10 17:16:55.967023 Captured frame from MAC (be:ed:6f:3c:e5:6f) to (92:43:ab:22:f9:76) associated with another peer ba:93:aa:cc:34:a8(knode27-hf)

How to reproduce it?

This issue is ongoing and never stops.

Anything else we need to know?

Kubernetes installed by Kubespray on Openstack.
I am looking for help to troubleshoot this. I read some older issues here how this can be innocuous in some cases, but here its tens of millions event messages in 12 hours.
Edit: the mac addresses in question are always associated with Elasticsearch/Fluentd pods

Versions:

$ weave version
2.5.1
$ docker version
18.09.5
$ uname -a
Linux knode30-hf 4.15.0-65-generic #74-Ubuntu SMP Tue Sep 17 17:06:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-19T22:12:47Z", GoVersion:"go1.12.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Logs:

$ kubectl logs -n kube-system <weave-net-pod> weave

ERRO: 2019/10/10 17:16:55.967023 Captured frame from MAC (be:ed:6f:3c:e5:6f) to (92:43:ab:22:f9:76) associated with another peer ba:93:aa:cc:34:a8(knode27-hf)

@bboreham
Copy link
Contributor

I hear when you say you have millions of lines of logs, but sending one line gives us no chance.
How about the first thousand lines?

Possibly related: #2433 #2842 #2877

@henro001
Copy link
Author

henro001 commented Oct 11, 2019

That makes sense!
I attached sample log from one of the pods.
weave.log
Note: These logs are right after the pod restart for all weave pods, hence the connectivity errors in the beginning

@henro001
Copy link
Author

henro001 commented Oct 11, 2019

This is from syslog on the node

[Oct11 16:02] weave: port 3(vethwedu) entered blocking state
[  +0.000003] weave: port 3(vethwedu) entered disabled state
[  +0.000139] device vethwedu entered promiscuous mode
[  +0.000463] device vethwedu left promiscuous mode
[  +0.000003] weave: port 3(vethwedu) entered disabled state
[  +0.413014] device vxlan-6784 left promiscuous mode
[  +0.045550] device vxlan-6784 entered promiscuous mode

@henro001
Copy link
Author

One thing I noticed it starts in fastdp mode and then switch to sleeve mode and remains in sleeve

-> 10.10.0.68:6783       established encrypted   sleeve 76:c6:75:cf:c2:3a(knode28-hf) encrypted=truemtu=1368
-> 10.10.0.50:6783       established encrypted   sleeve 32:88:01:82:54:be(knode30-hf) encrypted=truemtu=1368
-> 10.10.0.57:6783       established encrypted   sleeve 7a:9f:95:7a:53:e8(knode29-hf) encrypted=truemtu=1368
-> 10.10.0.43:6783       established encrypted   sleeve e2:37:67:c4:52:8c(knode25-hf) encrypted=truemtu=1368
-> 10.10.0.55:6783       established encrypted   sleeve ba:93:aa:cc:34:a8(knode27-hf) encrypted=truemtu=1368
-> 10.10.0.49:6783       established encrypted   sleeve 62:ad:da:ec:38:3d(knode23-hf) encrypted=truemtu=1368
<- 10.10.0.65:43149      established encrypted   sleeve 36:38:71:ad:fc:62(knode26-hf) encrypted=truemtu=1368
-> 10.10.0.51:6783       established encrypted   sleeve 7a:59:81:37:5c:27(kmstr22-hf) encrypted=truemtu=1368
-> 10.10.0.18:6783       established encrypted   sleeve b6:0c:81:e2:81:b4(knode24-hf) encrypted=truemtu=1368
-> 10.10.0.37:6783       established encrypted   sleeve b6:96:1c:9d:4d:8d(kmstr23-hf) encrypted=truemtu=1368
-> 10.10.0.59:6783       failed      cannot connect to ourself, retry: never```

@bboreham
Copy link
Contributor

Given this log message:

[  +0.045550] device vxlan-6784 entered promiscuous mode

It's quite believable that promiscuous mode would cause packets to appear where they aren't expected. Any idea why those devices are entering promiscuous mode?

We've seen some strange symptoms in the past when systemd-networkd was attempting to control our devices - could you paste the output of networkctl status ?

@henro001
Copy link
Author

that promiscuous event happens when the weave pods are restarted.

root@knode31-hf:~# networkctl status
●        State: routable
       Address: 10.10.0.59 on ens3
                172.17.0.1 on docker0
                10.233.0.1 on kube-ipvs0
                10.233.0.3 on kube-ipvs0
                10.233.0.47 on kube-ipvs0
                10.233.0.134 on kube-ipvs0
                10.233.5.149 on kube-ipvs0
                10.233.5.254 on kube-ipvs0
                10.233.8.154 on kube-ipvs0
                10.233.12.212 on kube-ipvs0
                10.233.18.43 on kube-ipvs0
                10.233.23.70 on kube-ipvs0
                10.233.23.125 on kube-ipvs0
                10.233.25.80 on kube-ipvs0
                10.233.37.49 on kube-ipvs0
                10.233.37.80 on kube-ipvs0
                10.233.39.124 on kube-ipvs0
                10.233.39.139 on kube-ipvs0
                10.233.41.134 on kube-ipvs0
                10.233.41.194 on kube-ipvs0
                10.233.42.27 on kube-ipvs0
                10.233.45.164 on kube-ipvs0
                10.233.47.157 on kube-ipvs0
                10.233.47.187 on kube-ipvs0
                10.233.48.213 on kube-ipvs0
                10.233.49.220 on kube-ipvs0
                10.233.50.96 on kube-ipvs0
                10.233.59.118 on kube-ipvs0
                10.233.60.125 on kube-ipvs0
                10.233.63.144 on kube-ipvs0
                10.233.63.146 on kube-ipvs0
                10.233.70.0 on weave
                fe80::f816:3eff:fea5:4249 on ens3
                fe80::899:b9ff:feeb:ff96 on datapath
                fe80::ca4:1aff:fe57:526 on weave
                fe80::c4db:acff:fec8:c3cc on vethwe-datapath
                fe80::645c:dcff:feef:d3fc on vethwe-bridge
                fe80::e4b2:32ff:fef8:2f71 on vethweplaceb4fa
                fe80::8e5:7fff:fef3:53f8 on vethwepl8c31c67
                fe80::3ce8:ceff:fe25:7f29 on vethwepla349f33
                fe80::9426:b3ff:fe6c:ccce on vethwepl8638e88
                fe80::6064:e5ff:fe53:2f40 on vxlan-6784
       Gateway: 10.10.0.1 on ens3
           DNS: 10.10.0.137
                10.10.0.33
                10.10.0.132
Search Domains: collab.local

@bboreham
Copy link
Contributor

Sorry, I think I wanted networkctl list

@henro001
Copy link
Author

henro001 commented Oct 11, 2019

root@knode31-hf:~# networkctl list
IDX LINK             TYPE               OPERATIONAL SETUP
  1 lo               loopback           carrier     unmanaged
  2 ens3             ether              routable    configured
  3 docker0          ether              no-carrier  unmanaged
  4 kube-ipvs0       ether              off         unmanaged
  5 datapath         ether              degraded    unmanaged
  7 weave            ether              routable    unmanaged
  9 vethwe-datapath  ether              degraded    unmanaged
 10 vethwe-bridge    ether              degraded    unmanaged
 13 vethweplaceb4fa  ether              degraded    unmanaged
 15 vethwepl8c31c67  ether              degraded    unmanaged
 17 vethwepla349f33  ether              degraded    unmanaged
 19 vethwepl8638e88  ether              degraded    unmanaged
 22 vxlan-6784       ether              degraded    unmanaged

@bboreham
Copy link
Contributor

OK, we'll let systemd off this time.

And I looked at my own machine, and I have the same "promiscuous" messages, so that's probably a red herring also.

@henro001
Copy link
Author

henro001 commented Nov 6, 2019

I thought this was just cosmetic in my case until recently, when I observed the same issue on a different cluster. It was reported that there were some connection issues with a specific pod, and i could correlate this event around the same time. Is it possible that the actual frames are being dropped?

I really have no clue what is happening, it may be just simpler to switch to a different plugin until this has been resolved.

@henro001
Copy link
Author

henro001 commented Nov 8, 2019

I may have some progress here. I followed the clue with fastdp failing back to sleeve. I found this #3378 and investigated 'certain MTU conditions'

I am running Kube nodes on Openstack instances. Default MTU size on Ubuntu was 1500. I lowered Weave MTU from default 1376 to 1366 just to test and after restarting pods, I saw the fastdp mode did not fall back to sleeve. At this point i had all kinds of issues with my apps, so ultimately I started looking to enabling jumbo frames.

Jumbo frames are enabled on Openstack compute nodes so I enabled MTU 8000 on the Kubernetes nodes (Ubuntu) and set MTU 7000 on Weave.

I restarted Weave and have not seen the "captured frame" messages since the restart. I will keep an eye on this.

My question - what could cause the frames associated with another peer, being observed by weave in sleeve mode and not in fastdp mode?

@henro001
Copy link
Author

I can confirm all the issues are resolved.

@henro001
Copy link
Author

Just a note: since this is used for ingress traffic from outside where I dont have control over MTU size, i had to revert back from jumbo frames. The node MTU size is now 1454 and Weave MTU size is 1364. (Not 1366 as I mentioned previously, since this needs to be divisible by four for fastdp mode(jeez even my cat knows it))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants