You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.
Weave should always reconnect after a network failure
What happened?
Weave occasionally fails to properly reconnect after a network failure. Failed node can receive ESP traffic but does not transmit any.
ip xfrm policy list is empty for the failed node.
How to reproduce it?
Disconnect the nodes physical/virtual network until heartbeat failure and then reconnect. Only happens very occasionally.
I used the following script to reproduce it on a Proxmox server:
#!/bin/bash
while true; do
# disconnect network
qm set 102 --net1 model=virtio,bridge=vmbr1,macaddr=62:40:98:FF:02:72,link_down=1
sleep 55
# reconnect network
qm set 102 --net1 model=virtio,bridge=vmbr1,macaddr=62:40:98:FF:02:72,link_down=0
sleep 10
# check if weave still works
if ssh [email protected] ping 10.42.128.0 -c1; then
date
echo pass
else
date
echo broken
break
fi
done
This usually triggers it within 30 minutes.
Anything else we need to know?
This reproduced on a 2 node Proxmox cluster running RancherOS 1.5.2. We first discovered it on a 3 node VMware cluster where it was triggered by excessive etcd load that eventually caused network timeouts.
Versions:
/home/weave # ./weave --local version
weave 2.5.2
$ docker version
Client:
Version: 18.06.3-ce
API version: 1.38
Go version: go1.10.4
Git commit: d7080c1
Built: Wed Feb 20 02:24:22 2019
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.3-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: d7080c1
Built: Wed Feb 20 02:25:33 2019
OS/Arch: linux/amd64
Experimental: false
$ uname -a
Linux hostyname0 4.14.122-rancher #1 SMP Tue May 28 01:50:21 UTC 2019 x86_64 GNU/Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:02:58Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
$ sudo ip xfrm policy list
src 192.168.128.10/32 dst 192.168.128.11/32 proto udp
dir out priority 0 ptype main
mark 0x20000/0x20000
tmpl src 192.168.128.10 dst 192.168.128.11
proto esp spi 0x939f9099 reqid 0 mode transport
The failed node also never notices that fastdp has failed because it keeps receiving heartbeats. The other node fell back to sleeve mode because it never received any heartbeats. In this state no communication is possible between the nodes using weave.
The text was updated successfully, but these errors were encountered:
What you expected to happen?
Weave should always reconnect after a network failure
What happened?
Weave occasionally fails to properly reconnect after a network failure. Failed node can receive ESP traffic but does not transmit any.
ip xfrm policy list
is empty for the failed node.How to reproduce it?
Disconnect the nodes physical/virtual network until heartbeat failure and then reconnect. Only happens very occasionally.
I used the following script to reproduce it on a Proxmox server:
This usually triggers it within 30 minutes.
Anything else we need to know?
This reproduced on a 2 node Proxmox cluster running RancherOS 1.5.2. We first discovered it on a 3 node VMware cluster where it was triggered by excessive etcd load that eventually caused network timeouts.
Versions:
Logs:
I suspect the relevant bits are:
After this I get
node0:
node1:
Looking at how the netlink package used by weave determines xfrm policy selectors: https://github.com/vishvananda/netlink/blob/b1e98597921720cee4147d6476110fc8fc56072d/xfrm_policy_linux.go#L8
It does not seem to use SPI at all meaning the
netlink.XfrmPolicyDel()
delete done byipsec.Destroy()
will select any xfrm policy with the same local and remote IP and delete it.The failed node also never notices that fastdp has failed because it keeps receiving heartbeats. The other node fell back to sleeve mode because it never received any heartbeats. In this state no communication is possible between the nodes using weave.
The text was updated successfully, but these errors were encountered: