-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully shut down node on critical errors like full disk #650
Conversation
5b9a7c3
to
4cd7381
Compare
So creating the RAMDisk on Linux requires Interestingly, the two platforms crash at different stages of the block writing process. Linux always crashes on the levelDB log, but OSX always crashes on the blockstore Linux
OSX
|
fdde5ff
to
0b6c1f5
Compare
This error should be emitted on out-of-bounds reorg on a pruning or tree-compacted node as well (see #669) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other questions:
- Should we cover WalletDB with these as well and test?
- Do we actually need to change anyhting in migration code? Since a failed migration will throw in
chain.open()
which will crashFullNode.open()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR as far as I can understand wants all disk write failures (not reads)
to be critical and shut down.
There are several critical errors in the code, but regardless
of hsds ability to recover from those errors (even some disk
failure could be recoverable) we want to shut down the node.
First of all, given the situation, I believe this approach is good
enough for now(with tweaks), but will also try to describe ideal scenario.
Missing handlers:
- ChainDB.getLocks also has database corruption check, I would also include that in this.
- Mempool cache flush if persistency is enabled
- WalletDB same thing.
Ideally, graceful shutdown would shutdown everything w/o many
internal errors and assertion failures (even though they don't cause much
of a problem). For this, we would need to have better statica analyses on the
callers of these methods to properly guard ALL the places to orcestrate
shutdown. First thing that came to my mind is, all queued tasks need to
fail (after the currently running one). That can only happen in the
handlePreClose, where we could destroy the locks in the prioritized
task for that lock. We can't just destroy the locks in the process, because
we want to finish answering in-flight queries, like was write a success?
E.g. chain failure does not necessarily mean wallet failure and vice verca. So
write could be finished before we shut down. Current locks don't support this,
so that would be good thing to add for this.
There are more complicated calls, e.g. mempool verifyLocks/verifyContext or
anything related to that failure. Those failures does not mean mempool db had
issues but chain, so we want to abort chain and close mempool normally. But
mempool cache flush failure (if persistent is enabled) will need mempool
shutting down critically and everything else normally and vice versa.
Fortunately, even now it should not cause any complications for the writes
because our writes are atomic, whether be it via LevelDB or logically. (as we
can see in Tree compaction and Block Store PRs) Also, majority of users will
be using same disk for chain/wallet/mempool so failures affect all of them.
33b95ee
to
9f5fedf
Compare
rebased on master |
Added a new commit to throw
The error is a bit unexpected but makes sense: On OSX at least, the In my test setup, adding only 713 blocks instead of 723 leaves enough room on the volume for Otherwise, tiny-volume testing on OSX went great, I used full and SPV nodes to fill the volume and close gracefully: Full
SPV
|
The error makes more sense on Ubuntu, otherwise the same test (using a 4MB tmpfs volume)
|
Looking good on windows as well inside MSYS2:
|
Closes #642
(although there may be more places where this extreme action is warranted)
This PR adds a new test utility: RAMDisk. This is a cute trick where we create a virtual storage device in RAM that is really small (1 MB) then run a full node inside it until we fill up the disk. I tried to ensure that even in test suite failures we don't end up with a bunch of tiny volumes still mounted, but it may take some more work to make sure everything is super clean.
TODO: