Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully shut down node on critical errors like full disk #650

Merged
merged 5 commits into from
Jun 8, 2022

Conversation

pinheadmz
Copy link
Member

@pinheadmz pinheadmz commented Oct 21, 2021

Closes #642
(although there may be more places where this extreme action is warranted)

This PR adds a new test utility: RAMDisk. This is a cute trick where we create a virtual storage device in RAM that is really small (1 MB) then run a full node inside it until we fill up the disk. I tried to ensure that even in test suite failures we don't end up with a bunch of tiny volumes still mounted, but it may take some more work to make sure everything is super clean.

TODO:

  • Apply to MIgrateBlockstore and test
  • Either skip test on Windows or implement RAMDisk for Windows (currently CI only runs on Linux)
  • Remove logConsole from test nodes (I just wanna see how it runs on CI first)

@pinheadmz pinheadmz force-pushed the disk-full branch 2 times, most recently from 5b9a7c3 to 4cd7381 Compare October 21, 2021 18:39
@coveralls
Copy link

coveralls commented Oct 21, 2021

Coverage Status

Coverage increased (+0.05%) to 66.773% when pulling 6b4fea0 on pinheadmz:disk-full into a53f877 on handshake-org:master.

@pinheadmz
Copy link
Member Author

So creating the RAMDisk on Linux requires sudo which is "free" on Github actions but on my laptop here requires a password, which sucks. OSX did not require root. I may try to trick this test in to only running on CI so developers don't have to run anything as root locally.

Interestingly, the two platforms crash at different stages of the block writing process. Linux always crashes on the levelDB log, but OSX always crashes on the blockstore fs.write():

Linux

[info] (chain) Block 73003637e1028461c67ad83b4adebf6a9c1ed7073e500c39226bcd4a96cf659b (827) added to chain (size=351 txs=1 time=0.942803).
[debug] (miner) Created block tmpl (height=828, weight=4000, fees=0, txs=1, diff=0, bits=545259519).
Warning:  (node) Critical error, shutting down: IO error: /tmp/hsd-ramdisk-1634842610320/regtest/chain/000003.log: No space left on device
[info] (blockstore) Closing FileBlockStore...

OSX

[info] (chain) Block 3541fe44db876076e1c0969aabfabbec53795ecd53b4feb91a3c11ab3d74ad06 (706) added to chain (size=351 txs=1 time=0.543917).
[debug] (miner) Created block tmpl (height=707, weight=4000, fees=0, txs=1, diff=0, bits=545259519).
[warning] (node) Critical error, shutting down: Could not write block.
[info] (blockstore) Closing FileBlockStore...

@pinheadmz pinheadmz force-pushed the disk-full branch 2 times, most recently from fdde5ff to 0b6c1f5 Compare October 22, 2021 18:52
@nodech nodech added blockchain part of the codebase node part of the codebase - Full, SPV, Pruned nodes tests part of the codebase enhancement general - improving existing feature labels Nov 18, 2021
@nodech nodech added the patch Release version label Dec 9, 2021
@pinheadmz
Copy link
Member Author

This error should be emitted on out-of-bounds reorg on a pruning or tree-compacted node as well (see #669)

Copy link
Member Author

@pinheadmz pinheadmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other questions:

  • Should we cover WalletDB with these as well and test?
  • Do we actually need to change anyhting in migration code? Since a failed migration will throw in chain.open() which will crash FullNode.open()

package.json Outdated Show resolved Hide resolved
lib/node/fullnode.js Outdated Show resolved Hide resolved
Copy link
Contributor

@nodech nodech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR as far as I can understand wants all disk write failures (not reads)
to be critical and shut down.
There are several critical errors in the code, but regardless
of hsds ability to recover from those errors (even some disk
failure could be recoverable) we want to shut down the node.
First of all, given the situation, I believe this approach is good
enough for now(with tweaks), but will also try to describe ideal scenario.


Missing handlers:

  • ChainDB.getLocks also has database corruption check, I would also include that in this.
  • Mempool cache flush if persistency is enabled
  • WalletDB same thing.

Ideally, graceful shutdown would shutdown everything w/o many
internal errors and assertion failures (even though they don't cause much
of a problem). For this, we would need to have better statica analyses on the
callers of these methods to properly guard ALL the places to orcestrate
shutdown. First thing that came to my mind is, all queued tasks need to
fail (after the currently running one). That can only happen in the
handlePreClose, where we could destroy the locks in the prioritized
task for that lock. We can't just destroy the locks in the process, because
we want to finish answering in-flight queries, like was write a success?
E.g. chain failure does not necessarily mean wallet failure and vice verca. So
write could be finished before we shut down. Current locks don't support this,
so that would be good thing to add for this.
There are more complicated calls, e.g. mempool verifyLocks/verifyContext or
anything related to that failure. Those failures does not mean mempool db had
issues but chain, so we want to abort chain and close mempool normally. But
mempool cache flush failure (if persistent is enabled) will need mempool
shutting down critically and everything else normally and vice versa.
Fortunately, even now it should not cause any complications for the writes
because our writes are atomic, whether be it via LevelDB or logically. (as we
can see in Tree compaction and Block Store PRs) Also, majority of users will
be using same disk for chain/wallet/mempool so failures affect all of them.

lib/blockchain/chain.js Outdated Show resolved Hide resolved
lib/blockchain/chain.js Show resolved Hide resolved
lib/blockchain/chain.js Outdated Show resolved Hide resolved
lib/node/fullnode.js Outdated Show resolved Hide resolved
lib/node/fullnode.js Outdated Show resolved Hide resolved
test/node-disk-fill-test.js Outdated Show resolved Hide resolved
test/node-disk-fill-test.js Outdated Show resolved Hide resolved
lib/node/fullnode.js Outdated Show resolved Hide resolved
lib/node/spvnode.js Outdated Show resolved Hide resolved
package.json Outdated Show resolved Hide resolved
@pinheadmz
Copy link
Member Author

rebased on master

@pinheadmz
Copy link
Member Author

pinheadmz commented Jun 7, 2022

Added a new commit to throw CriticalError and abort the node if compactTree throws. I can reproduce the behavior:

The error is a bit unexpected but makes sense:
ENOENT: no such file or directory, mkdir '/Volumes/test_4MB/regtest/tree~'

On OSX at least, the ~ directory can not even be created. It looks like a failed mkdir doesn't throw but when we try to open that directory we throw, sure ok.

In my test setup, adding only 713 blocks instead of 723 leaves enough room on the volume for compactTree to succeed


Otherwise, tiny-volume testing on OSX went great, I used full and SPV nodes to fill the volume and close gracefully:

Full

[info] (chain) Block 00000000000000d37e43c16b2a4aa86b83ea85bb831dbe58acc97f0ebb642d4e (1522) added to chain (size=354 txs=1 time=0.718416).
[debug] (chain) Retargetting to: 00000000000008546a0000000000000000000000000000000000000000000000 (0x1a08546a).
[error] (node) Critical Error: IO error: /Volumes/test_4MB/chain/000003.log: No space left on device
    at Chain.setBestChain (/Users/matthewzipkin/Desktop/work/hsd/lib/blockchain/chain.js:1903:21)

SPV

[debug] (wallet) Adding block: 706.
[debug] (chain) Retargetting to: 00000000000040ce9a0000000000000000000000000000000000000000000000 (0x1a40ce9a).
[error] (node) Critical Error: IO error: /Volumes/test_4MB/spvchain/000003.log: No space left on device
    at Chain.setBestChain (/Users/matthewzipkin/Desktop/work/hsd/lib/blockchain/chain.js:1903:21)

@pinheadmz
Copy link
Member Author

The error makes more sense on Ubuntu, otherwise the same test (using a 4MB tmpfs volume)

[info] (chain) Compacting Urkel Tree...
[error] (node) Critical Error: ENOSPC: no space left on device, write
    at Chain.compactTree (/home/pinheadmz/Desktop/work/hsd/lib/blockchain/chain.js:2176:21)
    at async RPC.compactTree (/home/pinheadmz/Desktop/work/hsd/lib/node/rpc.js:1117:7)

@pinheadmz
Copy link
Member Author

Looking good on windows as well inside MSYS2:

[debug] (chain) Retargetting to: 000000000000091b8e0000000000000000000000000000000000000000000000 (0x1a091b8e).
[error] (node) Critical Error: IO error: E:\hsd\chain/000003.log: There is not enough space on the disk.


    at Chain.setBestChain (C:\msys64\home\pinheadmz\work\hsd\lib\blockchain\chain.js:1903:21)

@pinheadmz pinheadmz merged commit dfccf4e into handshake-org:master Jun 8, 2022
@nodech nodech added breaking-minor Backwards compatible - Release version and removed patch Release version labels Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blockchain part of the codebase breaking-minor Backwards compatible - Release version enhancement general - improving existing feature node part of the codebase - Full, SPV, Pruned nodes tests part of the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consensus failure due to full disk
3 participants