lightningd question: Can we expect db_write's after a "shutdown" notification? #4785

SimonVrouwe · 2021-09-13T06:14:27Z

While working on the backup plugin here to use the "shutdown" notification for cleanup, a questions arose: Can there be db_write's after the "shutdown" notification?

The code below suggest it can:

lightning/lightningd/lightningd.c

Lines 1150 to 1159 in cdb93ee

    
           	/* Tell plugins we're shutting down. */ 
        
           	shutdown_plugins(ld); 
        
           	shutdown_subdaemons(ld); 
        
           	/* Clean up the JSON-RPC. This needs to happen in a DB transaction since 
        
           	 * it might actually be touching the DB in some destructors, e.g., 
        
           	 * unreserving UTXOs (see #1737) */ 
        
           	db_begin_transaction(ld->wallet->db); 
        
           	tal_free(ld->jsonrpc); 
        
           	db_commit_transaction(ld->wallet->db);

The separate freeing of ld->jsonrpc inside a db transaction was added long time ago in PR #1755 for a particular case of un-reserving the utxos (in destructor destroy_utxos) of a fundchannel command that didn't return when the remote peer hanged.

But since v0.9.1 (PR #3867) that destructor (destroy_utxos) is not used anymore, AFAIU utxos are now unreserved by blockheight.

Are there other destructors hanging on ld->jsonrpc that do db transaction? If not, then PR #1755 can be reverted?

Maybe the documentation can be more explicit what plugins can expect, such as no rpc calls (e.g. datastore) after "shutdown".

The text was updated successfully, but these errors were encountered:

vincenzopalazzo · 2021-09-13T09:57:10Z

Can there be db_write's after the "shutdown" notification?

Yes, lightning doesn't wait for an answer here, so at this point, you can receive all the types of sequences, and it is not deterministic at all. The shutdown_subdaemons will make a transaction on db, as you correctly said.

Maybe the documentation can be more explicit what plugins can expect, such as no rpc calls (e.g. datastore) after "shutdown".

I'm wondering if we need to admit all the types of calls during the "shutdown", I mean, if we have a datatore, in my opinion, the plugin needs to have time to store the data (but not much time here, imagine that the plugin have a while true {}, it is not good at all).

SimonVrouwe · 2021-09-13T13:46:23Z

TLDR; I think plugins that subscribe to db_write hook and are important-plugin should be killed separately, after all the other ones (and their callbacks handled), i.e. after the last potential db write.

Ok I think I get the intention of the "shutdown" notification. But if the backup plugin is killed on purpose (which also unregisters the db_write hook?) before the last dB transaction, then the backup misses these or there could be a deadlock, see examples below.

BTW when DEVELOPER=1, all build-in plugin ~~plugins~~ are subscribed by default.

lightning/plugins/libplugin.c

Lines 639 to 642 in cdb93ee

    
           #if DEVELOPER 
        
           	/* For memleak detection, always get notified of shutdown. */ 
        
           	if (!has_shutdown_notif) 
        
           		json_add_string(params, NULL, "shutdown");

This already gave me already one deadlock in DEVELOPER=1 mode when running with a backup.py (modified) plugin.
edit: below gdb backtrace, when stop command hangs. Not 100% sure its not my own fault, but the trace originated at shutdown_plugins.

#0  0x00007f4ace5367e4 in __GI___poll (fds=0x55bcfea99208, nfds=7, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000055bcfe8bda27 in daemon_poll (fds=0x55bcfea99208, nfds=7, timeout=-1) at common/daemon.c:95
#2  0x000055bcfe85e5a9 in io_poll_lightningd (fds=0x55bcfea99208, nfds=7, timeout=-1) at lightningd/lightningd.c:676
#3  0x000055bcfe92518f in io_loop (timers=0x0, expired=0x0) at ccan/ccan/io/poll.c:412
#4  0x000055bcfe88c2bf in plugins_exclusive_loop (plugins=0x55bcff536178) at lightningd/plugin.c:2014
#5  0x000055bcfe88e33e in plugin_hook_db_sync (db=0x55bcfeaaae28) at lightningd/plugin_hook.c:388
#6  0x000055bcfe897792 in db_report_changes (db=0x55bcfeaaae28, final=0x0, min=0) at wallet/db.c:941
#7  0x000055bcfe8979d0 in db_commit_transaction (db=0x55bcfeaaae28) at wallet/db.c:1011
#8  0x000055bcfe8381ff in getrawblockbyheight_callback (
    buf=0x7f4acd22e038 "\n\n{\"jsonrpc\":\"2.0\",\"id\":30,\"result\":{\"blockhash\":\"", '0' <repeats 19 times>, "e452ff32311401cc77f4b6bf1a431bc3a36ec28eb2128\",\"block\":\"0400002081ba55399f7b030250c16c016866df56e95d82a1ae5d09", '0' <repeats 18 times>, "221"..., 
    toks=0x55bcff2b1148, idtok=0x55bcff2b1198, call=0x55bd0230a798) at lightningd/bitcoind.c:394
#9  0x000055bcfe887d43 in plugin_response_handle (plugin=0x55bcfea7e1d8, toks=0x55bcff2b1148, idtok=0x55bcff2b1198) at lightningd/plugin.c:517
#10 0x000055bcfe887f68 in plugin_read_json_one (plugin=0x55bcfea7e1d8, complete=0x7ffe160367fd, destroyed=0x7ffe160367fe) at lightningd/plugin.c:623
#11 0x000055bcfe888147 in plugin_read_json (conn=0x55bcfea89cd8, plugin=0x55bcfea7e1d8) at lightningd/plugin.c:668
#12 0x000055bcfe922503 in next_plan (conn=0x55bcfea89cd8, plan=0x55bcfea89cf8) at ccan/ccan/io/io.c:59
#13 0x000055bcfe9230a3 in do_plan (conn=0x55bcfea89cd8, plan=0x55bcfea89cf8, idle_on_epipe=false) at ccan/ccan/io/io.c:407
#14 0x000055bcfe9230e1 in io_ready (conn=0x55bcfea89cd8, pollflags=1) at ccan/ccan/io/io.c:417
#15 0x000055bcfe9252b6 in io_loop (timers=0x55bcfea648c8, expired=0x7ffe16036958) at ccan/ccan/io/poll.c:445
#16 0x000055bcfe85907c in io_loop_with_timers (ld=0x55bcfea62118) at lightningd/io_loop_with_timers.c:24
#17 0x000055bcfe88c650 in shutdown_plugins (ld=0x55bcfea62118) at lightningd/plugin.c:2103
#18 0x000055bcfe85f262 in main (argc=2, argv=0x7ffe16036b68) at lightningd/lightningd.c:1151

Another example: using current backup plugin (which doesn't subscribe "shutdown and is thus immediately killed at shutdown) in combination with another plugin that subscribes "shutdown" and (at shutdown) waits a few seconds and then calls datastore, then the data appears to be written to the DB but NOT to the backup and subsequent startup says:
plugin-backup.py: Backup is out of date, we cannot continue safely. Emergency shutdown.

because:

lightning/lightningd/plugin.c

Lines 2087 to 2091 in cdb93ee

    
           /* Tell them all to shutdown; if they care. */ 
        
           list_for_each_safe(&ld->plugins->plugins, p, next, list) { 
        
           	/* Kill immediately, deletes self from list. */ 
        
           	if (!notify_plugin_shutdown(ld, p)) 
        
           		tal_free(p);

and

lightning/lightningd/plugin_hook.c

Line 63 in cdb93ee

/* When we destroy a plugin, we remove its hooks */

SimonVrouwe · 2021-09-17T12:15:58Z

So to answer

Can there be db_write's after the "shutdown" notification?

I have been running pytests on a test branch, which logs dirty db transactions when they happen right after plugin_shutdown:

PYTHONPATH=contrib/pylightning:contrib/pyln-client:contrib/pyln-testing:contrib/pyln-proto PYTEST_PAR=5 PRESERVE_TEST_DIRS=1 DEVELOPER=1 VALGRIND=0 pytest tests/

note: needed modify pyln-testing to preserve test_* directories to grep for the particular log line afterward:

grep -r 'sivr calling plugin_hook_db_sync with plugins_are_shuttingdown' /tmp/ltests-*

returned none.

So I guess that's good news and probably explains there have not been issues with it so far with nodes running the backup.py plugin. But it doesn't proof that the lines after shutdown_plugins() never trigger a dirty db transaction, or can this be guaranteed somehow @cdecker ?

lightning/lightningd/lightningd.c

Lines 1137 to 1146 in 7401b26

    
           	/* Tell plugins we're shutting down. */ 
        
           	shutdown_plugins(ld); 
        
           	shutdown_subdaemons(ld); 
        
           	/* Clean up the JSON-RPC. This needs to happen in a DB transaction since 
        
           	 * it might actually be touching the DB in some destructors, e.g., 
        
           	 * unreserving UTXOs (see #1737) */ 
        
           	db_begin_transaction(ld->wallet->db); 
        
           	tal_free(ld->jsonrpc); 
        
           	db_commit_transaction(ld->wallet->db);

Anyway, inside plugins_shutdown() we can trigger a dirty db transaction, for example by making a datastore call when a plugin received the "shutdown" notification, see this test.

So I think plugins-that-registered-db_write-hook should be kept alive a little longer. At least until other plugins are killed and maybe until after the last db_commit_transaction in line 1146. The latter seems the safest, but probably requires shutdown_plugins to be hacked in two. I will give it a try and post a PR for that.

vincenzopalazzo · 2021-09-17T13:15:07Z

So I guess that's good news and probably explains there have not been issues with it so far with nodes running the backup.py plugin. But it doesn't proof that the lines after shutdown_plugins() never trigger a dirty db transaction, or can this be guaranteed somehow @cdecker ?

Can be a lucky order inside the array of plugins?

So I think plugins-that-registered-db_write-hook should be kept alive a little longer. At least until other plugins are killed and maybe until after the last db_commit_transaction in line 1146. The latter seems the safest, but probably requires shutdown_plugins to be hacked in two. I will give it a try and post a PR for that.

Why preserve only a specific hook like the? IMO, with this new plugin notification shutdown_plugins we need to traverse by importance, or maybe we can use also the important-plugin for this case? I would like to talk about the general solution to make safe all the plugins.

so the ones that registered db_write hook are kept alive until after the last db transaction issue ElementsProject#4785 TODO: - maybe revert PR ElementsProject#3867, i.e. remove obsolete db wrapper arround freeing jsonrpc the wrapper was added in PR ElementsProject#1755, but since PR ElementsProject#3867 nothing touches the db at this point?

rustyrussell · 2021-09-18T03:35:42Z

Note that db plugin hooks already get called before init, so calling after shutdown is probably minor.

But we should definitely not kill these until last, and ideally put an assert in if we do a db operation after that!

SimonVrouwe · 2021-09-19T11:13:35Z

@rustyrussell

Note that db plugin hooks already get called before init, so calling after shutdown is probably minor.

Yes, but 'before init' the plugin is already running. At shutdown it could've been killed, which also un-registers its hooks, even for important-plugins.

Thanks for sharing that perspective, considered how plugin processes are started quite early, wouldn't it make sense to kill them late?

edit How about a shutdown_hook instead of a notification (as also mentioned here)? Then we don't need to wait the 30s at shutdown every time some plugin subscribes to "shutdown" and give them control. And then keep them running until the very end, somewhere just before here maybe? No, hooks (except db_write) don't work in shutdown.

lightning/lightningd/lightningd.c

Lines 1165 to 1171 in 7401b26

    
           tal_free(ld); 
        
           timers_cleanup(timers); 
        
           tal_free(timers); 
        
           opt_free_table(); 
        
           daemon_shutdown();

The tal_free(ld) closes the db, so no worries about transactions.

so that plugins that registered db_write hook are kept alive until after the last db transaction individual plugins can now be in state SHUTDOWN issue ElementsProject#4785

*first* call keeps db_write plugins alive, subscribed plugins have 30s to finish their business *second* call notifies subscribed db_write plugins for 2nd time and gives then 5s to self-terminate issue ElementsProject#4785

ZmnSCPxj · 2021-10-08T11:08:06Z

Ideally, a backup plugin should not terminate on a shutdown notification, only on an EOF on the stdin. Basically, my understanding is that a backup plugin must in fact stall its db_write response until after it has ensured that everything has been saved as indelibly as possible, and only then respond to db_write.

Since not responding to db_write causes the lightningd event loop to actually stall, if done that way then there is no way to cleanly shutdown while a db_write is in-flight. In that case, there should not be anything left for a backup plugin to do at shutdown, because all writes should have completed before db_write was responded to.

Basically, my understanding is that a backup plugin should ignore shutdown notifications and only terminate on EOF of stdin, and should not respond to db_write until after write has been indelibly confirmed in the backup medium.

By ignoring shutdown it will keep on working even if we have a bug somewhere and accidentally trigger a db_write after the shutdown notification (admittedly this arguably just hides a bug....).
By not relying on shutdown the backup plugin is more robust against something (Linux OOM killer, SystemD getting impatient) SIGKILLing the main lightningd.
The safety model of db_write assumes that all backups have been updated by the time db_write responds; it is unsafe for us to send revocation keys to the counterparty unless all replicas of our current state are up-to-date, and responding to db_write "early" (and if you do not respond "early" what are you doing in the shutdown notification handler? there should not be anything to do in that situation!) would allow revocation keys to be sent, which makes the backups toxic waste (admittedly only in the short term, but still).

cdecker · 2021-10-22T16:45:19Z

Ideally, a backup plugin should not terminate on a shutdown notification, only on an EOF on the stdin. Basically, my understanding is that a backup plugin must in fact stall its db_write response until after it has ensured that everything has been saved as indelibly as possible, and only then respond to db_write.

Since not responding to db_write causes the lightningd event loop to actually stall, if done that way then there is no way to cleanly shutdown while a db_write is in-flight. In that case, there should not be anything left for a backup plugin to do at shutdown, because all writes should have completed before db_write was responded to.

Basically, my understanding is that a backup plugin should ignore shutdown notifications and only terminate on EOF of stdin, and should not respond to db_write until after write has been indelibly confirmed in the backup medium.

Correct, the exact reason for db_write to be synchronous, locking the entire event loop, is that in order to be safe the transaction must be written before lightningd commits the DB transaction to disk, and before we start the communication round with the peers, which would then potentially lock us into an uncomfortable situation if we were to lose our DB right then (sending commitment revocations basically). So unless you synchronously stream the changes to the backup medium and commit them there, before returning to the db_write, you're likely at risk of losing funds.

In this sense I do agree that any subscriber to the db_write hook must ignore any shutdown notification, and should better perform pre-flight checks when starting, potentially aborting if we notice that the backup is not in sync with the actual database (data_version of the DB must be equal, or at most lag by 1 from the backup, and that's the case the DB commit was not completed successfully).

*first* call keeps db_write plugins alive, subscribed plugins have 30s to finish their business *second* call notifies subscribed db_write plugins for 2nd time and gives then 5s to self-terminate issue ElementsProject#4785

… in shutdown Here important-plugin implies `important hook`. Before this commit, when in shutdown: - existing in-flight hooks where abandoned, cutting the hook-chain and never call hook_final_cb - hooks where removed when its plugin died, even important-plugin because `shutdown` overrules - but hook events can be called while waiting for plugins to self-terminate (up to 30s) and subdaemons still alive and it looks as if no plugin ever registered the hook. After this commit, when in shutdown: - existing in-flight hook (chains) are honoured and can finalize, same semantics as LD_STATE_RUNNING - important-plugins are kept alive until after shutdown_subdaemons, so they don't miss hooks - JSON RPC commands are functional, but anything unimportant-plugin related cannot be relied on TODO: - Run tests -> hangs forever on test_closing, so skip them - Q. Does this open a can of worms or races when (normal) plugins with hooks die randomly? A. Yes, for example htlc_accepted calls triggers hook invoice_payment, but plugin (fetchinvoice?) already died ** * CONCLUSION: If you want to give more control over shutdown, I think there could be * a plugin `shutdown_clean.py` with RPC method `shutdown_clean`. When called, that * plugin starts additional (important) plugin(s) that register relevant hooks and, for example, hold-off * new htcl's and wait for existing inflight htlc's to resolve ... and finally call RPC `stop`. * * Note: --important-plugin only seems to work at start, not via `plugin start shutdown_clean.py` * maybe we can add? Or do something with disable? * * Some parts of this commit is stil good, i.e. hook semantics of important plugins should be consistent * untill the very last potential hook call. ** - What if important-plugin dies unexpectatly and lightningd_exit() calls io_break() is that bad? - What are the benefits? Add example where on shutdown inflight htlc's are resolved/cleared and new htlc's blocked, see ElementsProject#4842 - Split commit into hook-related stuff and others, for clarity of reasoning - Q. How does this relate (hook-wise) to db_write plugins? A. Looks like this hook is treated like any other hook: when plugin dies, hook is removed, so to be safe backup needs to be `important`. Hook documentation does not mention `important-plugin` but BACKUP.md does. TODO: Tested this -> `plugin stop backup.py` -> "plugin-backup.py: Killing plugin: exited during normal operation" In fact, running current backup.py with current master misses a couple of writes in shutdown (because its hook is removed, see issue ElementsProject#4785).

Until these issues with shutdown are resolved: ElementsProject/lightning#4785 ElementsProject/lightning#4883

Fixes: ElementsProject#4785 Fixes: ElementsProject#4883 Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, making RPC calls fail with error code -5.

Fixes: ElementsProject#4785 Fixes: ElementsProject#4883 Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, RPC calls then fail with error code -5.

So a db_write plugin can still self-terminate when its ready for it. Credit to ZmnSCPxj for the EOF idea, see issue ElementsProject#4785

…er closing db in second shutdown_plugin call, trigger EOF in plugin stdin. This correctly handles corner cases in shutdown_plugin: - in a non-dev build, buildin plugins don't subscribe shutdown, but we still need a (brief) io_loop to notify interested db_write plugins - only an awaited-for plugin should call io_break - allways send EOF to db_write plugins in the final call (subscribed or not) This adds two helper functions: plugin_registered_db_write_hook jsonrpc_command_del and a new plugin state: SHUTDOWN inspired by ElementsProject#4790 and also credit to ZmnSCPxj for mentioning the EOF idea in ElementsProject#4785

So a db_write plugin can still self-terminate when its ready for it. Credit to ZmnSCPxj for the EOF idea, see issue ElementsProject#4785

…er closing db in second shutdown_plugin call, trigger EOF in plugin stdin. This correctly handles corner cases in shutdown_plugin: - in a non-dev build, buildin plugins don't subscribe shutdown, but we still need a (brief) io_loop to notify interested db_write plugins - only an awaited-for plugin should call io_break - allways send EOF to db_write plugins in the final call (subscribed or not) This adds two helper functions: plugin_registered_db_write_hook jsonrpc_command_del and a new plugin state: SHUTDOWN inspired by ElementsProject#4790 and also credit to ZmnSCPxj for mentioning the EOF idea in ElementsProject#4785 Hopefully this fixes the "Nope" from prev commit, these: ERROR tests/test_invoices.py::test_amountless_invoice - ValueError: ERROR tests/test_misc.py::test_htlc_out_timeout - ValueError: ERROR tests/test_pay.py::test_sendpay_msatoshi_arg - ValueError: ERROR tests/test_pay.py::test_pay_peer - ValueError: ERROR tests/test_pay.py::test_listpays_with_filter_by_status - ValueError: ERROR tests/test_pay.py::test_channel_receivable - ValueError: where still triggering db_write's: lightningd: FATAL SIGNAL 11 (version v22.11.1-149-g93f1458-modded) 0x555e78c30d0f send_backtrace common/daemon.c:33 0x555e78c30db4 crashdump common/daemon.c:46 0x7fd20cc60d5f ??? ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 0x555e78c53665 db_begin_transaction_ db/exec.c:124 0x555e78bc7557 read_json lightningd/jsonrpc.c:1090 0x555e78ce8bc3 next_plan ccan/ccan/io/io.c:59 0x555e78ce976d do_plan ccan/ccan/io/io.c:407 0x555e78ce97ab io_ready ccan/ccan/io/io.c:417 0x555e78ceb9c1 io_loop ccan/ccan/io/poll.c:453 0x555e78bfdcb2 shutdown_plugins lightningd/plugin.c:2179 0x555e78bcae76 main lightningd/lightningd.c:1281 0x7fd20cc4bd09 __libc_start_main ../csu/libc-start.c:308 0x555e78b9d3c9 ??? ???:0 0xffffffffffffffff ??? ???:0 Log dumped in crash.log.20230126160704 Not from plugin_response_handle !!! What are they, seems every parse_request() in read_json() is wrapped in a db transaction: ``` if (!in_transaction) { db_begin_transaction(jcon->ld->wallet->db); in_transaction = true; } parse_request(jcon, jcon->input_toks); ``` But wasn't jsonrpc_stop_listening supposed to block new requests? <== Well sort-off, it only blocks new rpc connections But (suprise) the buildin plugins reuse their rpc connection and don't don't close it, so their request go through!

SimonVrouwe mentioned this issue Sep 17, 2021

lightningd: shutdown plugin in two steps, db_write hook registered plugins last #4790

Closed

2 tasks

SimonVrouwe mentioned this issue Oct 24, 2021

plugins: hooks are removed too early in shutdown #4883

Closed

SimonVrouwe mentioned this issue Nov 1, 2021

lightningd: make shutdown smoother and safer #4897

Merged

1 task

SimonVrouwe added a commit to SimonVrouwe/plugins that referenced this issue Nov 18, 2021

backup: blacklist lightningd v0.10.2, seems prudent

06f8bc5

Until these issues with shutdown are resolved: ElementsProject/lightning#4785 ElementsProject/lightning#4883

rustyrussell closed this as completed in a2a480e Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightningd question: Can we expect db_write's after a "shutdown" notification? #4785

lightningd question: Can we expect db_write's after a "shutdown" notification? #4785

SimonVrouwe commented Sep 13, 2021

vincenzopalazzo commented Sep 13, 2021

SimonVrouwe commented Sep 13, 2021 •

edited

Loading

SimonVrouwe commented Sep 17, 2021

vincenzopalazzo commented Sep 17, 2021

rustyrussell commented Sep 18, 2021

SimonVrouwe commented Sep 19, 2021 •

edited

Loading

ZmnSCPxj commented Oct 8, 2021 •

edited

Loading

cdecker commented Oct 22, 2021

lightningd question: Can we expect db_write's after a "shutdown" notification? #4785

lightningd question: Can we expect db_write's after a "shutdown" notification? #4785

Comments

SimonVrouwe commented Sep 13, 2021

vincenzopalazzo commented Sep 13, 2021

SimonVrouwe commented Sep 13, 2021 • edited Loading

SimonVrouwe commented Sep 17, 2021

vincenzopalazzo commented Sep 17, 2021

rustyrussell commented Sep 18, 2021

SimonVrouwe commented Sep 19, 2021 • edited Loading

ZmnSCPxj commented Oct 8, 2021 • edited Loading

cdecker commented Oct 22, 2021

SimonVrouwe commented Sep 13, 2021 •

edited

Loading

SimonVrouwe commented Sep 19, 2021 •

edited

Loading

ZmnSCPxj commented Oct 8, 2021 •

edited

Loading