You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.
Weave should always reconnect after a network failure
What happened?
Very occasionally when reconnecting InitSARemote gets called after Destroy has been called on a connection. This causes a valid xfrm policy to be updated to an old SPI and the connection stops working.
How to reproduce it?
This was found while testing #3669 . I used the following script:
#!/bin/bash
while true; do
# disconnect network
qm set 102 --net1 model=virtio,bridge=vmbr1,macaddr=62:40:98:FF:02:72,link_down=1
sleep 55
# reconnect network
qm set 102 --net1 model=virtio,bridge=vmbr1,macaddr=62:40:98:FF:02:72,link_down=0
sleep 10
# check if weave still works
if ssh [email protected] ping 10.42.128.0 -c1; then
date
echo pass
else
sleep 10
if ssh [email protected] ping 10.42.128.0 -c1; then
date
echo pass
else
date
echo broken
break
fi
fi
done
This bug only happened after about 26 hours.
Without #3669 it is unlikely to trigger since using the same method would hit #3666 much more frequently.
This looks like it could be caused by very unlikely scheduling of the mesh.receiveTCP goroutine.
Adding a check to fastDatapathForwarder.handleCryptoInitSARemote to not run on a stopped forwarder seems to fix it.
The text was updated successfully, but these errors were encountered:
What you expected to happen?
Weave should always reconnect after a network failure
What happened?
Very occasionally when reconnecting
InitSARemote
gets called afterDestroy
has been called on a connection. This causes a valid xfrm policy to be updated to an old SPI and the connection stops working.How to reproduce it?
This was found while testing #3669 . I used the following script:
This bug only happened after about 26 hours.
Without #3669 it is unlikely to trigger since using the same method would hit #3666 much more frequently.
Anything else we need to know?
Exact same setup as #3666
Versions:
Shortened Logs:
Failing node:
Other node:
On the failing node
InitSARemote
for 0xb66479db happens afterDestroy
for that connection. This updates the xfrm policy to a no longer used SPI.More complete logs:
Failing node: https://gist.github.com/hpdvanwyk/a9649492882d3ce8ac0ec474dc2e4ef1
Non failing node: https://gist.github.com/hpdvanwyk/027d8584bdf8740254b5c0da76ecff20
This looks like it could be caused by very unlikely scheduling of the
mesh.receiveTCP
goroutine.Adding a check to
fastDatapathForwarder.handleCryptoInitSARemote
to not run on a stopped forwarder seems to fix it.The text was updated successfully, but these errors were encountered: