Gracefully shut down node on critical errors like full disk #650

pinheadmz · 2021-10-21T00:57:06Z

Closes #642
(although there may be more places where this extreme action is warranted)

This PR adds a new test utility: RAMDisk. This is a cute trick where we create a virtual storage device in RAM that is really small (1 MB) then run a full node inside it until we fill up the disk. I tried to ensure that even in test suite failures we don't end up with a bunch of tiny volumes still mounted, but it may take some more work to make sure everything is super clean.

TODO:

Apply to MIgrateBlockstore and test
Either skip test on Windows or implement RAMDisk for Windows (currently CI only runs on Linux)
Remove logConsole from test nodes (I just wanna see how it runs on CI first)

coveralls · 2021-10-21T18:57:57Z

Coverage increased (+0.05%) to 66.773% when pulling 6b4fea0 on pinheadmz:disk-full into a53f877 on handshake-org:master.

pinheadmz · 2021-10-21T19:01:10Z

So creating the RAMDisk on Linux requires sudo which is "free" on Github actions but on my laptop here requires a password, which sucks. OSX did not require root. I may try to trick this test in to only running on CI so developers don't have to run anything as root locally.

Interestingly, the two platforms crash at different stages of the block writing process. Linux always crashes on the levelDB log, but OSX always crashes on the blockstore fs.write():

Linux

[info] (chain) Block 73003637e1028461c67ad83b4adebf6a9c1ed7073e500c39226bcd4a96cf659b (827) added to chain (size=351 txs=1 time=0.942803).
[debug] (miner) Created block tmpl (height=828, weight=4000, fees=0, txs=1, diff=0, bits=545259519).
Warning:  (node) Critical error, shutting down: IO error: /tmp/hsd-ramdisk-1634842610320/regtest/chain/000003.log: No space left on device
[info] (blockstore) Closing FileBlockStore...

OSX

[info] (chain) Block 3541fe44db876076e1c0969aabfabbec53795ecd53b4feb91a3c11ab3d74ad06 (706) added to chain (size=351 txs=1 time=0.543917).
[debug] (miner) Created block tmpl (height=707, weight=4000, fees=0, txs=1, diff=0, bits=545259519).
[warning] (node) Critical error, shutting down: Could not write block.
[info] (blockstore) Closing FileBlockStore...

pinheadmz · 2021-12-14T16:13:01Z

This error should be emitted on out-of-bounds reorg on a pruning or tree-compacted node as well (see #669)

pinheadmz

Other questions:

Should we cover WalletDB with these as well and test?
Do we actually need to change anyhting in migration code? Since a failed migration will throw in chain.open() which will crash FullNode.open()

package.json

lib/node/fullnode.js

nodech

This PR as far as I can understand wants all disk write failures (not reads)
to be critical and shut down.
There are several critical errors in the code, but regardless
of hsds ability to recover from those errors (even some disk
failure could be recoverable) we want to shut down the node.
First of all, given the situation, I believe this approach is good
enough for now(with tweaks), but will also try to describe ideal scenario.

Missing handlers:

ChainDB.getLocks also has database corruption check, I would also include that in this.
Mempool cache flush if persistency is enabled
WalletDB same thing.

Ideally, graceful shutdown would shutdown everything w/o many
internal errors and assertion failures (even though they don't cause much
of a problem). For this, we would need to have better statica analyses on the
callers of these methods to properly guard ALL the places to orcestrate
shutdown. First thing that came to my mind is, all queued tasks need to
fail (after the currently running one). That can only happen in the
handlePreClose, where we could destroy the locks in the prioritized
task for that lock. We can't just destroy the locks in the process, because
we want to finish answering in-flight queries, like was write a success?
E.g. chain failure does not necessarily mean wallet failure and vice verca. So
write could be finished before we shut down. Current locks don't support this,
so that would be good thing to add for this.
There are more complicated calls, e.g. mempool verifyLocks/verifyContext or
anything related to that failure. Those failures does not mean mempool db had
issues but chain, so we want to abort chain and close mempool normally. But
mempool cache flush failure (if persistent is enabled) will need mempool
shutting down critically and everything else normally and vice versa.
Fortunately, even now it should not cause any complications for the writes
because our writes are atomic, whether be it via LevelDB or logically. (as we
can see in Tree compaction and Block Store PRs) Also, majority of users will
be using same disk for chain/wallet/mempool so failures affect all of them.

lib/blockchain/chain.js

lib/node/fullnode.js

test/node-disk-fill-test.js

lib/node/fullnode.js

lib/node/spvnode.js

package.json

pinheadmz · 2022-06-07T14:18:02Z

rebased on master

pinheadmz · 2022-06-07T17:17:39Z

Added a new commit to throw CriticalError and abort the node if compactTree throws. I can reproduce the behavior:

created a fixed 4MB volume on OSX
start hsd --prefix=
run script that adds 180 name updates to Urkel Tree and then generate additional 723 blocks, filling the volume
execute rpc compacttree

The error is a bit unexpected but makes sense:
ENOENT: no such file or directory, mkdir '/Volumes/test_4MB/regtest/tree~'

On OSX at least, the ~ directory can not even be created. It looks like a failed mkdir doesn't throw but when we try to open that directory we throw, sure ok.

In my test setup, adding only 713 blocks instead of 723 leaves enough room on the volume for compactTree to succeed

Otherwise, tiny-volume testing on OSX went great, I used full and SPV nodes to fill the volume and close gracefully:

Full

[info] (chain) Block 00000000000000d37e43c16b2a4aa86b83ea85bb831dbe58acc97f0ebb642d4e (1522) added to chain (size=354 txs=1 time=0.718416).
[debug] (chain) Retargetting to: 00000000000008546a0000000000000000000000000000000000000000000000 (0x1a08546a).
[error] (node) Critical Error: IO error: /Volumes/test_4MB/chain/000003.log: No space left on device
    at Chain.setBestChain (/Users/matthewzipkin/Desktop/work/hsd/lib/blockchain/chain.js:1903:21)

SPV

[debug] (wallet) Adding block: 706.
[debug] (chain) Retargetting to: 00000000000040ce9a0000000000000000000000000000000000000000000000 (0x1a40ce9a).
[error] (node) Critical Error: IO error: /Volumes/test_4MB/spvchain/000003.log: No space left on device
    at Chain.setBestChain (/Users/matthewzipkin/Desktop/work/hsd/lib/blockchain/chain.js:1903:21)

pinheadmz · 2022-06-07T17:47:22Z

The error makes more sense on Ubuntu, otherwise the same test (using a 4MB tmpfs volume)

[info] (chain) Compacting Urkel Tree...
[error] (node) Critical Error: ENOSPC: no space left on device, write
    at Chain.compactTree (/home/pinheadmz/Desktop/work/hsd/lib/blockchain/chain.js:2176:21)
    at async RPC.compactTree (/home/pinheadmz/Desktop/work/hsd/lib/node/rpc.js:1117:7)

pinheadmz · 2022-06-07T18:12:59Z

Looking good on windows as well inside MSYS2:

[debug] (chain) Retargetting to: 000000000000091b8e0000000000000000000000000000000000000000000000 (0x1a091b8e).
[error] (node) Critical Error: IO error: E:\hsd\chain/000003.log: There is not enough space on the disk.


    at Chain.setBestChain (C:\msys64\home\pinheadmz\work\hsd\lib\blockchain\chain.js:1903:21)

pinheadmz force-pushed the disk-full branch 2 times, most recently from 5b9a7c3 to 4cd7381 Compare October 21, 2021 18:39

pinheadmz force-pushed the disk-full branch 2 times, most recently from fdde5ff to 0b6c1f5 Compare October 22, 2021 18:52

nodech added blockchain part of the codebase node part of the codebase - Full, SPV, Pruned nodes tests part of the codebase enhancement general - improving existing feature labels Nov 18, 2021

nodech added the patch Release version label Dec 9, 2021

pinheadmz mentioned this pull request Feb 6, 2022

Synchronization appears to be stuck kyokan/bob-wallet#453

Closed

pinheadmz added this to the 4.0.0 milestone Mar 31, 2022

pinheadmz commented Apr 6, 2022

View reviewed changes

package.json Outdated Show resolved Hide resolved

lib/node/fullnode.js Outdated Show resolved Hide resolved

nodech requested changes May 3, 2022

View reviewed changes

nodech reviewed May 6, 2022

View reviewed changes

lib/node/fullnode.js Outdated Show resolved Hide resolved

nodech requested changes May 6, 2022

View reviewed changes

lib/node/spvnode.js Outdated Show resolved Hide resolved

package.json Outdated Show resolved Hide resolved

pinheadmz force-pushed the disk-full branch 2 times, most recently from 33b95ee to 9f5fedf Compare May 6, 2022 15:59

nodech force-pushed the disk-full branch from be6d8bf to 6b4fea0 Compare June 3, 2022 14:18

pinheadmz and others added 4 commits June 7, 2022 10:17

test: critical errors and graceful shutdown

9bf932f

chain, node: catch critical errors and shut down

f3d1856

node: listen for abort event from plugins

4c665cb

fullnode: reemit abort and let bin handle the shutdown.

ee5d45f

pinheadmz force-pushed the disk-full branch from 6b4fea0 to ee5d45f Compare June 7, 2022 14:17

chain: abort on compacttree failure

4a112da

nodech approved these changes Jun 8, 2022

View reviewed changes

pinheadmz merged commit dfccf4e into handshake-org:master Jun 8, 2022

nodech added breaking-minor Backwards compatible - Release version and removed patch Release version labels Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully shut down node on critical errors like full disk #650

Gracefully shut down node on critical errors like full disk #650

pinheadmz commented Oct 21, 2021 •

edited

Loading

coveralls commented Oct 21, 2021 •

edited

Loading

pinheadmz commented Oct 21, 2021

pinheadmz commented Dec 14, 2021

pinheadmz left a comment

nodech left a comment

pinheadmz commented Jun 7, 2022

pinheadmz commented Jun 7, 2022 •

edited

Loading

pinheadmz commented Jun 7, 2022

pinheadmz commented Jun 7, 2022

Gracefully shut down node on critical errors like full disk #650

Gracefully shut down node on critical errors like full disk #650

Conversation

pinheadmz commented Oct 21, 2021 • edited Loading

coveralls commented Oct 21, 2021 • edited Loading

pinheadmz commented Oct 21, 2021

Linux

OSX

pinheadmz commented Dec 14, 2021

pinheadmz left a comment

Choose a reason for hiding this comment

nodech left a comment

Choose a reason for hiding this comment

pinheadmz commented Jun 7, 2022

pinheadmz commented Jun 7, 2022 • edited Loading

Full

SPV

pinheadmz commented Jun 7, 2022

pinheadmz commented Jun 7, 2022

pinheadmz commented Oct 21, 2021 •

edited

Loading

coveralls commented Oct 21, 2021 •

edited

Loading

pinheadmz commented Jun 7, 2022 •

edited

Loading