-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Finality Lagging issues in Kusama archive & full nodes #13295
Comments
There's likely a bug in ancestry search as indicated by paritytech/polkadot-sdk#528
this is interesting because it's not clear to me why the remote would choose to provide us a block with an unknown parent. would be good to see debug logs for this case. |
The remote is not deciding which blocks it is providing to us. We are actively requesting blocks from the remote. It could either be a bug in sync where we drain the received blocks incorrectly and did not imported the parent before or the parent failed to import. Or we requested something for that we don't have the parent. |
This is probably related to all the other syncing-related issues we've seen recently. We're working on a couple of fixes |
I have a similar observation on our node. Old behavior (v.0.9.36): For example, the screenshot of the node which had an incorrect New behavior:(v.0.9.37) Fix: After flag |
@BulatSaif thank you this is very valuable information I wonder if this has something to do with libp2p because it was update for |
@altonen maybe we should test if downgrading fixes it and then find out what the difference is. |
I've been prioritizing the CPU usage issue since adjusting peer ratios should've bought us some time with this "node keeps losing peers" issue and because I believe we can deploy a fix the CPU issue for relatively soon that would take some of the load off of |
The same team reported low peer issues this morning after upgrading their KSM nodes to
I will share more detailed logs as soon as/if they are provided. One interesting fact was that these issues appear only in the US and the Tokyo region. However, their nodes running in the EU (Ireland) are stable enough and they have at least 9-15 peers connected at any given time. |
There will be a PSA related to these low peer counts with the next release. Basically if the team is experiencing this and they're running the nodes behind a NAT, it would be a good idea to test opening the P2P port for inbound connections. The reason why the node is only able to have 0-1 peers is because the public nodes reachable to it are fully occupied, that is not an issue with the node implementation but that the network has too many nodes. If they have the P2P port open for inbound connections and they're still experiencing this, then the issue is something else. In that case, |
Thank you @altonen for the quick feedback! I forwarded your suggestions to the three teams who are experiencing the same/similar issue (low peer count/syn issue in
|
If they can't open the inbound port, even temporarily to rule out that it's not an issue with inability to accept inbound connections, there's very little we can do about this. If they can give at least trace logs, there might be something there. I looked at the trace logs and they seem to have a same problem: I await for a response from the third team. |
First team also shared logs after adding
|
The first issue should be fixed by this and it will be included in the next release. In the logs there is this pattern that repeats constantly:
The node imports a block, announces it and soon after peers start disconnecting. This is because right now the I believe the underlying problem to be with fully occupied slots. The fact that it's observable in Japan and the US but not in Europe is interesting and I'll ask DevOps if we could have access to a VM in these problem regions. |
ksm-low-peers.log Hey team, we're still having the same sync issue after we opened the 30333 port. The logs with Full command: --chain kusama --unsafe-rpc-external --unsafe-ws-external --rpc-cors=all --pruning=100000 -l afg=trace,sync=debug |
can you run the node with |
@johnsonco reported that their node is syncing properly now. It required approximately 12 hours after they opened the p2p port to sync correctly. |
It is quite interesting, the node is getting block announcements and its best block is increasing but the finalized block is not. There is another bug report opened today that is very similar to this: paritytech/polkadot-sdk#632 There's a suspicious pattern in the logs, the nodes keeps exchanging some message(s) over the GRANDPA substream hundreds of times a second:
So it could be that it's actually getting finality notifications but they're buried under whatever this flood of other messages is. Then at some point @andresilva do you have any thoughts on this? |
The grandpa log target was changed from |
The team shared new logs (with Please note that the new logs is after the issue was solved so the node no longer had the peer issue. However, I still requested the file in case it can show any interesting pattern/weird behaviour that is not causing any issues now but might affect the peer count again in the future. |
Indeed these logs don't show anything abnormal but that's expected since everything is working properly there. Could still be helpful to keep running with those log targets in case the issue happens again.
I have confirmed that these 150 byte messages are indeed GRANDPA votes. Since we didn't have |
Team, we are facing the same issue, how do we open the p2p port? |
The P2P port by default should be An example that I have at hand is from the ansible scripts I used to setup my validator : |
Since the node is working fine, I'm not sure how much interesting information there is about a possible issue but sadly these logs don't help me debug this further. GRANDPA just tells it received a message but I don't know what that message is and whether it's expected behavior.
|
The team of @bhargavtheerthamcb is also experiencing the peer/sync issue. Summary of their Setup & Changes
|
I looked at the logs and the node is only establishing outbound connections, i.e., the role is always |
@Imod7 which helm chart are they using to set up a node in Kubernetes? if it is this paritytech/helm-charts try to set up:
This will proxy port
|
@altonen , upon deeper investigation, we are noticing the same finality lag and syncing issues in our non - k8s env also. We noticed that our deploy times ( restore a latest snapshot and catch up to tip) is not taking several days whereas a month or two ago it consistently completed in 30 mins or so. In the non k8s env, we didnt have 30333 open and it had worked fine for couple yrs. Is opening the port 30333 a new requirement? Also curious on how opening the port helps keep stable peers. |
Are you able to pinpoint a release when the deploy times started to degrade? I'm not sure if we're talking about the same thing but opening the P2P port is not a requirement but if you have a low number of peers and that negatively affects the node, then opening the port should help with that. |
As discussed in paritytech/polkadot#7259 I'm providing the startup logs and first few minutes of grandpa trace. The node is generating approx 100k lines of log per minute with The following is a graylog export (just in case you wonder about the format) |
@altonen the issue seems to have started with 0.9.41. We are still not seeing any improvement in our nodes' stability / availablity. We are using our own helm chart and not using the above one. We have opened up the P2P port but it doesnt seem to help. Validated that port is open telnet xx.xx.xx.xx 30333 We had zero issues with peering or node stability before 0.9.41 and the nodes were very stable. Are other customers having the same issue with nodes ? This is becoming very critiical for us. |
Can you give |
Hi @altonen The logs were already uploaded. Its in one of the threads above. Please let me know if you are not able to access it. |
I've looked at these logs once already and the node was only able to establish outbound connections. I believe this is further corroborated by the fact that the node is only showing the four following listen addresses:
which I believe are all local addresses. Since the last time you provided logs, you have made no changes to your network configurations? Where is the node located? Instead of telnet, can you try connection to the node using another Polkadot node by setting the node that is having trouble with peers as a reserved peer with |
Hi @altonen the one change we made after uploading logs was to open up the P2P port. The nodes are located in the US East region. I can try using the --reserved-peers parameter from a different node pointing to the prod pods. Only challenge is in K8s the pod IPs can change if they reboot but can certainly try. Do we need to collect the traces while doing the above? |
|
@altonen Would you be available for a quick call on monday US time so we can discuss this? thanks |
Sure, I'm available until 9PM EEST |
Thanks @altonen would you please share your email, So I can send an invite. Thank you. |
Hi @altonen @Imod7 , Over the last month we were working to set up a new k8s cluster with nodes that have public IP assigned so other nodes can talk to it. I deployed a polkadot node in the new cluster and enabled tracing (attached). Last time you reviewed the logs, you mentioned that you do not see any incoming connections. Can you please review this one and let me know if you see what you expect to? thanks |
The node didn't get any inbound connections, only dialed other nodes which is not itself suspicious since the node was running for only 2 minutes and if it was run with a fresh But these are the listen addresses the node discovered for itself:
which are all local addresses and will be ignored by other nodes so your node won't be able to accept inbound connections. |
@altonen , Thanks . how do we configure the listen addresses on the public IP? I saw some messages in the log that said
Is the outbound connection? |
@bhargavtheertham-cb Since your environment is k8s, can you please first check your helm chart/configuration files and see if the values are the same as the ones mentioned by @BulatSaif in the previous comment here ? |
we are not using the same helm charts as them but I am looking at the code to see how we set the public-addr to the command startup. |
@altonen I have set the public addr correctly now. Please review the logs .
|
@bhargavtheertham-cb The logs start with this line: You may need to provide |
Hi @altonen I did add the public-addr flag to the end of the command which is why it is able to listen on it . I will fix the echo command which was not updated to include the latest flag. I will try running it for longer |
it looks to be working now, there are few lines like this: meaning |
Closing as resolved since the finality hasn't been lagging after #13829 |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
The Integrations Team at Parity was contacted by a team who is running KSM nodes and they reported sync isssues as soon as they upgraded to
v.0.9.37
.A small summary of the issues they faced & the solution they found :
v.0.9.36
everything was very stable and no sync issues.v.0.9.37
their peer count dropped significantly.-l afg=trace,sync=debug
flags to get more detailed logs and restarted the nodev.0.9.36
and now it looks like there is no lagging.Their Setup
Flags
Two examples of the flags they are using when running their nodes :
and :
Lagging from Log
One output they shared from their logs that shows the sync issue / lagging :
Errors from Logs
After adding
-l afg=trace,sync=debug
and restarting, they shared their logs so I paste here some of the errors found there :If needed, we can share internally the full list of logs cc @IkerAlus @CurlyBracketEffect @Imod7
Similar Past Issues ?
I am adding here some past issues I found that look similar :
Related Current Issue ?
There is one issue that was brought to my attention by @gabreal (devops) and it looks like it is related
polkadot-sync-0 + kusama-sync-0 lagging behind paritytech/substrate#26
Steps to reproduce
v.0.9.36
tov.0.9.37
The text was updated successfully, but these errors were encountered: