-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beacon node's P2P degrades permanently after 40 minutes of no connectivity #6323
Comments
Could you please provide execution arguments of |
|
Could you please confirm that you are using latest release version of |
@catwith1hat Do you also have the full log line for:
? Normally this log line should also print the |
@cheatfate I can confirm that I haven't set the
@kdeme: Sorry, I truncated the line while copying it. Here is the full line:
|
Thank you, that's very useful. In terms of reproducing this: Do I understand correctly that you are running a Nimbus Docker container inside a QEMU VM? edit: Additional question, did all the |
That's correct.
Pretty much:
(the awk script counts how many times a line repeats)
This is probably correct that you would not get But spotting @kdeme If you have your setup still at hand, would you mind trying to reproduce this by remove the default route of the docker container (or the whole VM/host) for let's say 60 minutes? |
Two more datapoints:
|
Can you repro this with the 24.8 release? there's a related fix in there that might be related. |
I retested this on my holskey VM. Cut network access in virt-manager (untick the box), wait 1h, restore. Result: The node didn't rejoin the holskey network after 10h. |
Describe the bug
When a node looses connectivity for an extended period of time, it eventually exhausts trying all peers. After it has unsuccessfully tried each peer and after connectivity is restored, the node does not heal. It tries to discover new peers, but can't find any. I straced the beacon_node binary to see what's going on, and it seems that logging to syslog is the only activity of the node. The node quickly recovers with a restart.
To Reproduce
Cut connectivity on a Holesky node for about 40 minutes till you see the "Peer low" warning. Restore connection. Observe that the node does not heal. If you are using a node in a libvirt/qemu VM, you can easily toggle the link connectivity in the virt-manager interface.
Screenshots
Our test starts off with a working node with 158 peers.
I cut connection around 17:22
Around 18:00, I restore the connection. The "Discovery send failed" stops, but no new peers are discovered.
From that I assume that some kind of Discovery is being done by Nimbus. But it probably only finds peers that it already knew about and that it marked as bad during the period of lost connectivity. Two hours later, the node is still stuck at this:
I restart the node and things go back to normal quickly:
As you see above, the node immediately picks up 22 peers at 19:53:49.
Additional context
Nimbus 24.5.1 from official Docker image.
The text was updated successfully, but these errors were encountered: