Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

"Sometimes containers cannot connect but they still respond to ping" #2433

Open
bboreham opened this issue Jul 7, 2016 · 40 comments
Open

"Sometimes containers cannot connect but they still respond to ping" #2433

bboreham opened this issue Jul 7, 2016 · 40 comments
Labels
Milestone

Comments

@bboreham
Copy link
Contributor

bboreham commented Jul 7, 2016

Seems similar to another report on Slack from @fermayo; TL;DR is that Weave is apparently seeing the same MAC as coming from two different peers, and this happens at the same time as networking gets broken to the container that really owns that MAC.

I tried to recreate locally and failed.

As reported on weave-users Slack:

jakolehm "we have really strange problem with weave net.. sometimes containers cannot connect but they still respond to ping (weave 1.4.5)
and this happens on coreos stable (latest)
and this happens pretty randomly
let's say we have two peers, A and B ... if we send request from peer A to container in peer B we don't see packet in peer B at all

bryan do you see it on the weave bridge?
jakolehm no
only on peer A weave bridge
bryan ok, this is crossing two hosts, so you have two weave bridges (edited)
jakolehm yes
ping seems to work, and we see the packets

bryan Check the destination MAC address does actually correspond to the container you are trying to hit
jakolehm it does, checked
but, ping reply mac was different
jakolehm actually ping reply mac address is something that we cannot find in any of the machines in this cluster
jakolehm actually it seems that request destination mac is wrong also for tcp connections

12:56:37.698792 4e:fa:a8:29:0a:e2 (oui Unknown) > ba:6e:b6:d8:d8:b9 (oui Unknown), ethertype IPv4 (0x0800), length 74: 10.81.31.139.50716 > weave-test-3.gcp-1.kontena.local.http: Flags [S], seq 31933074
77, win 27400, options [mss 1370,sackOK,TS val 286470583 ecr 0,nop,wscale 7], length 0

ba:6e:b6:d8:d8:b9 should be f2:e2:6e:f4:a3:ce

matthias what host does weave think that MAC is on? logs should tell you.
jakolehm

INFO: 2016/07/06 10:42:08.675404 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 11:26:34.352968 Expired MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 11:41:12.600574 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:07:34.364139 Expired MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:10:06.398987 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
INFO: 2016/07/06 12:14:08.173212 Discovered remote MAC ba:6e:b6:d8:d8:b9 at 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
ERRO: 2016/07/06 12:18:01.123991 Captured frame from MAC (ba:6e:b6:d8:d8:b9) associated with another peer 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)
ERRO: 2016/07/06 12:20:26.996998 Captured frame from MAC (ba:6e:b6:d8:d8:b9) associated with another peer 5e:5a:4a:45:b3:a9(gcp-1-4.c.kontena-1162.internal)

jakolehm bryan: what is a "local mac" ... where does weave get that?

bryan It's printed when we see a packet with a source mac we have never seen before, on the weave bridge.

Since there ought to be no way for packets to get onto the bridge except from a locally-running container, we think it's from one of those.

jakolehm but I can't find that mac on "gcp-1-4" machine
bryan it's possible it went away
jakolehm but I restarted weave and it's coming back...
bryan that's interesting
jakolehm one of the first locally discovered macs
bryan I guess you could tcpdump the weave bridge and see if the packet itself gives any clues
this is somewhat consistent with the "MAC associated with another peer" message - if we've never seen the src address before we print "local MAC", and if we have seen it on another peer we print "associated ..."
so, since you do get the latter, it must be something of a race which one is taken as the "real" home of the packet
and the real question is how come we are seeing packets with the same src address on two different weave bridges?

jakolehm

13:59:53.379567 ARP, Request who-has weave-test-3.gcp-1.kontena.local tell 10.81.31.139, length 28
13:59:53.379578 ARP, Reply weave-test-3.gcp-1.kontena.local is-at ba:6e:b6:d8:d8:b9 (oui Unknown), length 28

bryan and is that container on a different machine?
jakolehm yes

@bboreham bboreham added the bug label Jul 7, 2016
@bboreham
Copy link
Contributor Author

bboreham commented Jul 7, 2016

The error message is the same as that noted in #2297.

@jakolehm
Copy link

jakolehm commented Jul 8, 2016

It seems that we can reproduce this pretty easily with 5 node cluster by scaling one service ~20 times up/down (between 0-100 containers).

Weave: 1.4.5
OS: CoreOS 1010.6.0
Kernel: 4.5.7
Docker: 1.10.3

@rade
Copy link
Member

rade commented Jul 9, 2016

@bboreham was pondering whether we can make a better decision by inspecting the destination mac, i.e. only drop the mac if the source mac appears to be from a different peer and the destination mac is for a local peer.

Which, by my reading of the code, simply means turning the error into a warning and carrying on.

We'd need to think carefully in what situations that might create a loop.

btw, afaict, we'd get this error if a MAC moved or was duplicated. Correct? Surely we ought to handle the former.

@rade rade added this to the 1.6.1 milestone Jul 9, 2016
@rade
Copy link
Member

rade commented Jul 9, 2016

Surely we ought to handle the former (MAC move).

We have code in the handler of forwarded packets for that, but not captured packets. Surely we need both. Without the latter when a MAC moves from another peer to ourselves, we will be suppressing all outgoing packets from that MAC, won't we?

@rade
Copy link
Member

rade commented Jul 10, 2016

Surely we ought to handle the former (MAC move).

Reproduced and filed #2436.

@crockpotveggies
Copy link

Hi gents, downstream user of Weave. We're experiencing this issue and happy to help in any way we can.

@awh
Copy link
Contributor

awh commented Jul 18, 2016

Hi @crockpotveggies - thanks. I have a specific question - are you taking any steps outside of weave to adjust or set container MACs, e.g. for the purposes of migration? The MACs of weave's ethwe interfaces are generated randomly by the kernel, and given 2^46 possibilities (the top two bits are fixed) the likelihood of clashes is remote, birthday paradox notwithstanding. Under what circumstances is this happening for you? Can you reproduce it reliably?

@brb
Copy link
Contributor

brb commented Jul 21, 2016

@jakolehm Could you provide steps for reproducing the issue? Thanks.

@awh
Copy link
Contributor

awh commented Jul 21, 2016

@jakolehm @crockpotveggies next time you get the Captured frame from MAC ... associated with another peer ... messages, can you run

ip -d link show vethwe-bridge

on the affected host and see if you have hairpin on in the output? e.g.:

36: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc pfifo_fast master weave state UP mode DEFAULT group default qlen 1000
    link/ether 3e:52:3d:13:ac:3d brd ff:ff:ff:ff:ff:ff promiscuity 1 
    veth 
    bridge_slave state forwarding priority 32 cost 2 hairpin on guard off root_block off fastleave off learning on flood on addrgenmode eui64 

@crockpotveggies
Copy link

crockpotveggies commented Jul 21, 2016

Sorry late reply here, fairly busy on my end. Regarding "taking any steps outside of weave" I can't fully answer the question because, as a downstream user, we rely on what the Docker Cloud agent does for us.

The circumstances leading to failure involve redeploys within our infrastructure. Basically, we "reload" our backend services (which involves around 10 containers to reproduce this issue) and suddenly we start seeing timeouts to our RabbitMQ container and our main database.

When we open up RabbitMQ to a public address and bypass weave so the containers connect to a public IP directly, it completely alleviates the problem...which, of course, opens us up to more shenanigans.

@jakolehm
Copy link

@awh yes it has hairpin:

$ ip -d link show vethwe-bridge
303: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
mode DEFAULT group default qlen 1000
    link/ether 82:3a:1c:32:d5:1b brd ff:ff:ff:ff:ff:ff promiscuity 1 
    veth 
    bridge_slave state forwarding priority 32 cost 2 hairpin off guard off root_block off fastleave off learning on
 flood on addrgenmode eui64 

@rade
Copy link
Member

rade commented Jul 21, 2016

bingo! any idea how it ended up like that?

@jakolehm
Copy link

no idea, where should I look? :)

@jakolehm
Copy link

Oh, sorry for confusion.. hairpin is off in that output :/

@rade
Copy link
Member

rade commented Jul 21, 2016

bah, that output format is terrible.

@awh
Copy link
Contributor

awh commented Jul 22, 2016

Thanks @jakolehm - back to the drawing board then!

We'd like to try to replicate what you mentioned earlier:

It seems that we can reproduce this pretty easily with 5 node cluster by scaling one service ~20 times up/down (between 0-100 containers).

Weave: 1.4.5
OS: CoreOS 1010.6.0
Kernel: 4.5.7
Docker: 1.10.3

Could you answer a few more questions?

  • Nature of the hosts (e.g. bare metal or VM) as well as provider if hosted
  • Description of the service you mentioned, and how you were driving traffic to it
  • Orchestrator used for scaling (e.g. swarm, kubernetes, ....)

Essentially as much concrete information as you are able to give about the setup and testing methodology - if there are images and configs you can share so much the better. Thanks!

@jakolehm
Copy link

jakolehm commented Jul 22, 2016

  • Hosts: Google Compute Engine, 5 x n1-standard-8 (us-east, spread across 3 az's)
  • OS: CoreOS Stable (now 1068.8.0)
  • Services: Rails & Nodejs apps that are behind loadbalancer(s) (haproxy), haproxy does health check periodically to services and shows these connection problems
  • Orchestrator: Kontena (kontena.io)
    • kontena attaches containers to weave network using weave attachafter container is started (logic is similar to weave-proxy)
  • Weave: sleeve mode with password (no fastdp)

We have been trying to reproduce this locally without success :(

@awh
Copy link
Contributor

awh commented Jul 22, 2016

Great detail, thanks!

We have been trying to reproduce this locally without success

That's interesting... what're you using locally?

@jakolehm
Copy link

Exactly same stuff but running inside vagrant boxes.

@crockpotveggies
Copy link

Hi gents, I've been doing packet captures, here's what's coming from ethwe on my main Mongo container, would you like the full file? Is it helpful?

screen shot 2016-07-24 at 1 19 24 pm

@crockpotveggies
Copy link

Additionally, here's what eth0 captures on a bare metal server using dockercloud-agent:

screen shot 2016-07-24 at 1 23

@crockpotveggies
Copy link

I may have identified a culprit. I checked the logs in the router that handles all of our offline processing nodes and found that it seems to think it's under constant attack. Is it possible that Weave is triggering security systems in networks meant to protect against DoS attacks?

screen shot 2016-07-24 at 2 02

@awh
Copy link
Contributor

awh commented Jul 25, 2016

Thanks @crockpotveggies - let us digest that. Out of interest, are you using sleeve as well?

@awh
Copy link
Contributor

awh commented Jul 25, 2016

@jakolehm @crockpotveggies is there any possibility we could gain interactive access to one of your affected systems when this occurs again?

@crockpotveggies
Copy link

crockpotveggies commented Jul 25, 2016

@awh negative not using sleeve (unless Docker Cloud has implemented it upstream). FYI the IP address shown in the logs is the IP of the router that handles our small on-site datacenter, meaning it came from within.

I'm happy to grant access if it happens again, though I have a feeling we won't. Literally hacked a special version of router software just to accommodate this special case.

@brb brb modified the milestones: 1.6.1, 1.6.2 Aug 18, 2016
@bboreham bboreham modified the milestones: 1.8.2, 1.8.1 Nov 21, 2016
@brb brb modified the milestones: 1.8.2, 1.8.3 Dec 8, 2016
@bboreham
Copy link
Contributor Author

bboreham commented Dec 9, 2016

Note that #2674, released in Weave Net 1.8.2, added more checks and logging which will be useful if this reoccurs.

@bboreham bboreham modified the milestones: 1.8.3, 1.9.0 Dec 9, 2016
@SpComb
Copy link

SpComb commented Dec 9, 2016

I debugged the original issue with the GCE cluster (running Weave 1.5.2) in-situ, and this is actually something closer to #1455, with zombie network namespaces that get left behind after the Docker container itself is removed, that still have an ethwe interface configured with the same overlay network IP address and attached to the weave bridge on the host.

ARP from the client containers returns one of two ethernet addresses. One of those is associated with the "broken" container that is running the service we're attempt to connecting to, but the other ethernet address can be found on a different host, where the same Weave IP address was used by a container that is no longer running, and the overlay IP has been re-allocated for use by the newer container. In this case, this zombie container was started immediately after the initial host reboot, and terminated approximately a week before debugging this issue.

TCP connections fail with RST -> ECONNREFUSED because the server app process is no longer running in that netns, but ARP and ICMP ping still work because the netns is alive in the kernel.

The symptom of this is a host-side vethweplXXXX where there is no corresponding pid=XXXX anymore, and the peer veth interface cannot be found in any Docker container network sandbox: weave-zombie-hunter.sh

Unforunately while I've been able to track down these zombie veths and verified that unlinking them from the weave bridge fixes the issue, I haven't been able to figure out why those netns's are leaking.

@awh
Copy link
Contributor

awh commented Dec 9, 2016

@SpComb are you running cadvisor?

@SpComb
Copy link

SpComb commented Dec 9, 2016

@SpComb are you running cadvisor?

Yes, but I IIRC dug through the nsenter -p ... -m cat /proc/mounts for cadvisor and there weren't any netns mounts there. docker inspect ... for the cadvisor container shows that it does have the host / fs bind-mounted, but with "Propagation": "rprivate", so the netns mounts shouldn't be leaking...?

I can take a better dig at those next time I run across such a zombie to dissect.

@awh
Copy link
Contributor

awh commented Dec 9, 2016

@SpComb kernel version?

@SpComb
Copy link

SpComb commented Dec 9, 2016

I'm not sure I noted that down, but it would have been CoreOS stable as of October, so likely Linux 4.7.3

@Starefossen
Copy link

I am experiencing this problem on my Kubernetes cluster. Here are some details:

Kubernetes Version: v1.4.4
Host Operating System: Oracle Linux 7.2 Linux Kernel 4.1.12-61.1.16.el7uek.x86_64

2016-12-09T14:38:00.791531314Z INFO: 2016/12/09 14:38:00.791289 Discovered remote MAC 96:98:93:52:e8:29 at 82:c6:b0:56:fe:a4(node-02)
2016-12-09T14:39:04.945256721Z ERRO: 2016/12/09 14:39:04.945049 Captured frame from MAC (96:98:93:52:e8:29) to (aa:d6:1b:bd:d2:e4) associated with another peer 82:c6:b0:56:fe:a4(node-02)

@brb
Copy link
Contributor

brb commented Dec 9, 2016

@Starefossen what does cat /sys/devices/virtual/net/weave/brif/vethwe-bridge/hairpin_mode return on the affected host?

@panuhorsmalahti
Copy link

I'm probably having the same issue.

docker: Docker version 1.12.5, build 047e51b/1.12.5
kontena version: 1.1.2
Linux: Linux version 3.10.0-514.6.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) ) #1 SMP Sat Dec 10 11:15:38 EST 2016
Storage driver: devicemapper

[root]# ip addr
2817: vethwepl7963@if2816: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 36:d6:20:80:bd:80 brd ff:ff:ff:ff:ff:ff link-netnsid 36
2561: veth138dd2e@if2560: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether ae:85:5f:3b:4a:c6 brd ff:ff:ff:ff:ff:ff link-netnsid 18
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:50:56:b9:bb:b3 brd ff:ff:ff:ff:ff:ff
    inet 10.193.38.14/26 brd 10.193.38.63 scope global ens192
       valid_lft forever preferred_lft forever
2563: vethwepl7566@if2562: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether ee:3b:7e:50:bb:4f brd ff:ff:ff:ff:ff:ff link-netnsid 17
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 02:42:85:79:0f:74 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
4: datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UNKNOWN qlen 1000
    link/ether d6:34:e1:5e:3b:ae brd ff:ff:ff:ff:ff:ff
2565: vethwepl8134@if2564: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 32:49:00:30:19:55 brd ff:ff:ff:ff:ff:ff link-netnsid 18
6: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue state UP qlen 1000
    link/ether 4a:b4:26:7b:6c:89 brd ff:ff:ff:ff:ff:ff
    inet 10.81.0.1/16 scope global weave
       valid_lft forever preferred_lft forever
2567: vethbc67172@if2566: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 5e:ad:c4:17:a1:54 brd ff:ff:ff:ff:ff:ff link-netnsid 19
7: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether ee:0c:2e:b5:34:96 brd ff:ff:ff:ff:ff:ff
3337: vethwepl19161@if3336: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 8e:09:f4:81:4a:b1 brd ff:ff:ff:ff:ff:ff link-netnsid 45
9: vethwe-datapath@vethwe-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master datapath state UP qlen 1000
    link/ether 2a:9e:2c:13:04:ca brd ff:ff:ff:ff:ff:ff
10: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP qlen 1000
    link/ether 76:b7:5a:73:fe:06 brd ff:ff:ff:ff:ff:ff
2571: vethwepl8796@if2570: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 72:68:00:78:c0:00 brd ff:ff:ff:ff:ff:ff link-netnsid 19
3341: vethwepl19781@if3340: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 0a:94:a5:96:7f:34 brd ff:ff:ff:ff:ff:ff link-netnsid 46
2575: vethwepl9143@if2574: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 62:04:15:25:85:ae brd ff:ff:ff:ff:ff:ff link-netnsid 20
3345: vethwepl20347@if3344: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 9a:47:0f:20:e3:b9 brd ff:ff:ff:ff:ff:ff link-netnsid 26
2577: vethwepl9586@if2576: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 56:af:fa:68:86:09 brd ff:ff:ff:ff:ff:ff link-netnsid 21
3603: veth0fedb9d@if3602: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether fe:71:0d:65:6f:45 brd ff:ff:ff:ff:ff:ff link-netnsid 6
3347: vethffb0999@if3346: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 7e:ff:ae:24:e8:2c brd ff:ff:ff:ff:ff:ff link-netnsid 29
2579: vetha23e2bc@if2578: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 52:66:da:fe:79:e1 brd ff:ff:ff:ff:ff:ff link-netnsid 22
3605: vethwepl9350@if3604: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 76:6d:b8:56:94:b8 brd ff:ff:ff:ff:ff:ff link-netnsid 6
3349: vethwepl26401@if3348: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 86:80:62:de:ae:9b brd ff:ff:ff:ff:ff:ff link-netnsid 29
2581: veth7aae92b@if2580: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 1e:26:2e:43:ef:27 brd ff:ff:ff:ff:ff:ff link-netnsid 23
2583: vethwepl10354@if2582: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 1e:b0:24:75:f5:c0 brd ff:ff:ff:ff:ff:ff link-netnsid 22
2585: veth16cd5c7@if2584: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 2a:d1:59:19:19:aa brd ff:ff:ff:ff:ff:ff link-netnsid 24
2587: vethwepl10748@if2586: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 2e:07:33:cb:84:20 brd ff:ff:ff:ff:ff:ff link-netnsid 23
2591: vethwepl11214@if2590: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether fa:9f:4e:9f:5b:c4 brd ff:ff:ff:ff:ff:ff link-netnsid 24
2595: vethwepl11666@if2594: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 3e:38:8a:c3:47:7c brd ff:ff:ff:ff:ff:ff link-netnsid 25
3365: vethwepl8386@if3364: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 2e:1d:4a:6b:5b:6f brd ff:ff:ff:ff:ff:ff link-netnsid 30
3381: vethwepl10930@if3380: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether b2:65:33:4a:88:81 brd ff:ff:ff:ff:ff:ff link-netnsid 43
3385: vethwepl11609@if3384: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 12:cc:58:09:10:bb brd ff:ff:ff:ff:ff:ff link-netnsid 44
3389: vethwepl12234@if3388: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 4a:06:e6:ee:90:ac brd ff:ff:ff:ff:ff:ff link-netnsid 33
3391: veth55f29a7@if3390: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether aa:c0:5e:ce:40:b9 brd ff:ff:ff:ff:ff:ff link-netnsid 0
3393: vethwepl13508@if3392: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 7a:99:df:df:24:57 brd ff:ff:ff:ff:ff:ff link-netnsid 0
3137: vethwepl24638@if3136: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 9a:e4:41:e2:15:60 brd ff:ff:ff:ff:ff:ff link-netnsid 53
3395: vethd0ab2fa@if3394: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 86:16:c7:ea:df:2b brd ff:ff:ff:ff:ff:ff link-netnsid 1
3397: vethwepl13154@if3396: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 36:6e:2d:76:1a:7f brd ff:ff:ff:ff:ff:ff link-netnsid 1
2653: vethwepl6228@if2652: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 32:47:99:04:80:d1 brd ff:ff:ff:ff:ff:ff link-netnsid 31
2411: vxlan-6784: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master datapath state UNKNOWN qlen 1000
    link/ether fe:aa:25:88:65:72 brd ff:ff:ff:ff:ff:ff
2701: vethwepl17929@if2700: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 6e:35:fa:99:bc:a9 brd ff:ff:ff:ff:ff:ff link-netnsid 28
2705: vethwepl18402@if2704: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 4e:da:f8:65:c6:88 brd ff:ff:ff:ff:ff:ff link-netnsid 37
2709: vethwepl18940@if2708: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 26:33:50:bb:ce:5d brd ff:ff:ff:ff:ff:ff link-netnsid 38
2713: vethwepl19668@if2712: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether d6:e9:30:83:c3:9c brd ff:ff:ff:ff:ff:ff link-netnsid 39
3753: vethwepl23259@if3752: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 96:08:f6:02:73:6d brd ff:ff:ff:ff:ff:ff link-netnsid 9
3757: vethwepl23941@if3756: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether da:5b:8e:10:8d:9e brd ff:ff:ff:ff:ff:ff link-netnsid 10
2479: vethwepl6832@if2478: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 1a:6f:72:3e:6b:41 brd ff:ff:ff:ff:ff:ff link-netnsid 14
3761: vethwepl24474@if3760: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 02:d5:ed:85:db:e7 brd ff:ff:ff:ff:ff:ff link-netnsid 4
3779: veth9ebf941@if3778: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 72:5e:91:e1:fc:81 brd ff:ff:ff:ff:ff:ff link-netnsid 2
3781: vethwepl30712@if3780: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 46:6e:c3:b7:02:50 brd ff:ff:ff:ff:ff:ff link-netnsid 2
3793: vethwepl32388@if3792: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 26:d8:c4:4e:a6:76 brd ff:ff:ff:ff:ff:ff link-netnsid 5
3795: veth31fb42f@if3794: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 2a:4f:d0:ca:f2:f2 brd ff:ff:ff:ff:ff:ff link-netnsid 8
3797: vethwepl590@if3796: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 86:9d:2f:95:3b:ed brd ff:ff:ff:ff:ff:ff link-netnsid 8
3799: veth2be23ec@if3798: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether ea:ba:d7:2e:25:c8 brd ff:ff:ff:ff:ff:ff link-netnsid 12
3801: vethwepl1331@if3800: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether c2:08:f7:8c:74:7e brd ff:ff:ff:ff:ff:ff link-netnsid 12
3803: veth9776d0d@if3802: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether ba:fc:6d:3a:af:9f brd ff:ff:ff:ff:ff:ff link-netnsid 3
3805: vethwepl1894@if3804: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 5e:ea:d9:6c:e1:f9 brd ff:ff:ff:ff:ff:ff link-netnsid 3
2789: vethwepl23742@if2788: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 96:81:59:9e:0e:9f brd ff:ff:ff:ff:ff:ff link-netnsid 32
2533: vethwepl23876@if2532: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 1a:5c:38:b4:9d:00 brd ff:ff:ff:ff:ff:ff link-netnsid 11
2537: vethwepl24545@if2536: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 12:7d:df:ca:79:19 brd ff:ff:ff:ff:ff:ff link-netnsid 13
2795: vethwepl24844@if2794: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 1e:5e:bf:07:1d:db brd ff:ff:ff:ff:ff:ff link-netnsid 27
2797: vethwepl24989@if2796: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether 36:7b:54:87:9b:99 brd ff:ff:ff:ff:ff:ff link-netnsid 41
2541: vethwepl25828@if2540: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether fe:62:ba:24:00:2f brd ff:ff:ff:ff:ff:ff link-netnsid 15
2545: vethwepl28113@if2544: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether da:aa:21:6b:78:27 brd ff:ff:ff:ff:ff:ff link-netnsid 16
3321: vethwepl16816@if3320: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether be:90:4b:41:c5:9b brd ff:ff:ff:ff:ff:ff link-netnsid 42
2813: vethwepl7304@if2812: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc noqueue master weave state UP 
    link/ether f6:03:b1:7d:d2:76 brd ff:ff:ff:ff:ff:ff link-netnsid 35
2815: vethce44706@if2814: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP 
    link/ether 52:7c:1c:f6:bb:52 brd ff:ff:ff:ff:ff:ff link-netnsid 36

@bboreham
Copy link
Contributor Author

@panuhorsmalahti "the same issue" being "seeing the same MAC as coming from two different peers" ?

Can you show some of the evidence that leads you to that thought?

@panuhorsmalahti
Copy link

panuhorsmalahti commented Mar 10, 2017

I can ping and curl the container IP from the host, and can ping the container from another container, but curl doesn't work.

weave-zombie-hunter.sh output:
https://gist.github.com/panuhorsmalahti/302d520353fb23196fc8c179925ce501

arping output:

Host:
arping -I weave 10.81.128.20
ARPING 10.81.128.20 from 10.81.0.1 weave
Unicast reply from 10.81.128.20 [82:D0:28:6B:7B:42]  0.682ms
Unicast reply from 10.81.128.20 [06:FC:01:5E:B5:FA]  0.706ms
Unicast reply from 10.81.128.20 [FE:FA:0B:55:78:10]  0.720ms
Unicast reply from 10.81.128.20 [FE:FA:0B:55:78:10]  0.530ms

Container:
arping 10.81.128.20         
ARPING 10.81.128.20
42 bytes from 82:d0:28:6b:7b:42 (10.81.128.20): index=0 time=5.955 msec
42 bytes from 06:fc:01:5e:b5:fa (10.81.128.20): index=1 time=6.010 msec
42 bytes from fe:fa:0b:55:78:10 (10.81.128.20): index=2 time=6.034 msec
42 bytes from 82:d0:28:6b:7b:42 (10.81.128.20): index=3 time=8.709 msec

Not 100% sure this is the same issue, was linked here by Kontena slack.

@bboreham
Copy link
Contributor Author

@panuhorsmalahti unfortunately we have two sets of symptoms in this issue; please can you open your own issue to avoid making things worse?

@bboreham
Copy link
Contributor Author

@panuhorsmalahti I read carefully through the output you supplied from weave-zombie-hunter.sh, and the scripts, and I cannot see how they fit.

The way I read the script, if it prints id=8dfc61bc31d8 ip=10.81.128.55/16 then it should not subsequently print missing for the same veth. Yet most of your veths have both.

Also, it seems to be saying that every single one of the veths on your system is "nomaster".

Are you running the same scripts I am reading?

@panuhorsmalahti
Copy link

Created a new issue: #2842

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants