-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node sync gets stuck towards the end of the sync process #2122
Comments
Logs are filled with these (block number increases over time, same number for all peers, but always off by one):
Looks like off by one somewhere. Full log: |
@nazar-pc I didn't have much luck reproducing the issue (in close to 10 attempts). Also that the devnet chain is almost 90k blocks now makes the retries time consuming. I also see these logs in my node frequently, but the node is able to sync fine. On reading the code, these logs should are expected when the Since you are able to see the issue frequently on your set up, can you help me with these:
|
FWIW my node has not unstuck yet so I can archive the whole thing if needed |
Think I understand the sequence now. While the node is syncing, a reorg needs to happen among the peers for the issue to happen. As the reorg is more likely near the tip, this would explain why nodes get stuck as they near the end of the sync process. The node is at
After this, we download blocks Am in the process of adding an unit test to verify the scenario, and look at potential fixes (also start with upstream folks with more concrete data) |
Thanks for the details! |
With the sync fix, we don't try to import. Instead we see this message: Without the fix, we would have tried to import that would have failed as parent not found, and restarted the sync process(the original symptom fixed by paritytech/polkadot-sdk#1812). Now instead we stall, for different reason though. If the common number is somehow restored, the sync should complete because of 1812. I currently don't have concrete proposal for a fix, need to think more. One option is to trigger ancestry search, but the sync state machine is pretty obtuse, worried about breaking something else. Worst case work around would be to just kill/restart the node binary if it gets stuck :- |
The last fix didn't cover all the cases. I am reverting it and going to think about it from a high level, instead of doing bandaids. The main issue can be seen here:
The reorg happens and the peers are split in two groups ( |
Are you able to create a test case with this information? Also can you create an upstream discussion about this now that you know more about it? |
yeah I can create a test the test(there are some existing tests in blocks.rs, we would need a repro test from the sync layer instead). And let me start something on the polkadot forum |
A simplified explanation of the issue:
So node remains stuck. The downloaded blocks are not imported, no new requests go out after this. |
So looks like numbers should be advanced individually and only if common block is actually common, not just at the same block height. |
yes, it goes back to the analysis above: #2122 (comment)
This works fine if no re-org (nodes are on same chain), so just block number based logic is good. In the presence of re-orgs, this breaks and run into issues. Similarly, I am trying to add tests in the upstream to confirm this as well. TL:DR - the sync layer should manage a tree of forks, where blocks are linked to parent blocks by hash. This understandably will be a bigger change |
Yesterday I tried a simpler approach to move the common number back on re-org, but that ran into other issues |
…#1950) This reverts #1812 until we know why it causes syncing issues reported in autonomys/subspace#2122.
I was looking at something similar last two days, am going to continue with this today and see how it goes (and hopefully we can bring back the rolled back 1812) |
paritytech/polkadot-sdk#493 was re-resolved with paritytech/polkadot-sdk#2045, which landed in #2424, so I'm closing this for now |
…ech#1812)" (paritytech#1950) This reverts paritytech#1812 until we know why it causes syncing issues reported in autonomys/subspace#2122.
I have hit this a few times:
In both cases node exits "Preparing" state back into "Syncing" (though I wouldn't expect that to happen) and then gets stuck at whatever height was the target at the moment of exiting "Preparing" state. Syncs successfully on restart and doesn't always get stuck in the first place.
Not yet sure what the reason is, will try to collect trace logs for Substrate sync.
The text was updated successfully, but these errors were encountered: