Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightningd question: Can we expect db_write's after a "shutdown" notification? #4785

Closed
SimonVrouwe opened this issue Sep 13, 2021 · 8 comments

Comments

@SimonVrouwe
Copy link
Collaborator

While working on the backup plugin here to use the "shutdown" notification for cleanup, a questions arose: Can there be db_write's after the "shutdown" notification?

The code below suggest it can:

/* Tell plugins we're shutting down. */
shutdown_plugins(ld);
shutdown_subdaemons(ld);
/* Clean up the JSON-RPC. This needs to happen in a DB transaction since
* it might actually be touching the DB in some destructors, e.g.,
* unreserving UTXOs (see #1737) */
db_begin_transaction(ld->wallet->db);
tal_free(ld->jsonrpc);
db_commit_transaction(ld->wallet->db);

The separate freeing of ld->jsonrpc inside a db transaction was added long time ago in PR #1755 for a particular case of un-reserving the utxos (in destructor destroy_utxos) of a fundchannel command that didn't return when the remote peer hanged.

But since v0.9.1 (PR #3867) that destructor (destroy_utxos) is not used anymore, AFAIU utxos are now unreserved by blockheight.

Are there other destructors hanging on ld->jsonrpc that do db transaction? If not, then PR #1755 can be reverted?

Maybe the documentation can be more explicit what plugins can expect, such as no rpc calls (e.g. datastore) after "shutdown".

@vincenzopalazzo
Copy link
Collaborator

Can there be db_write's after the "shutdown" notification?

Yes, lightning doesn't wait for an answer here, so at this point, you can receive all the types of sequences, and it is not deterministic at all. The shutdown_subdaemons will make a transaction on db, as you correctly said.

Maybe the documentation can be more explicit what plugins can expect, such as no rpc calls (e.g. datastore) after "shutdown".

I'm wondering if we need to admit all the types of calls during the "shutdown", I mean, if we have a datatore, in my opinion, the plugin needs to have time to store the data (but not much time here, imagine that the plugin have a while true {}, it is not good at all).

@SimonVrouwe
Copy link
Collaborator Author

SimonVrouwe commented Sep 13, 2021

TLDR; I think plugins that subscribe to db_write hook and are important-plugin should be killed separately, after all the other ones (and their callbacks handled), i.e. after the last potential db write.

Ok I think I get the intention of the "shutdown" notification. But if the backup plugin is killed on purpose (which also unregisters the db_write hook?) before the last dB transaction, then the backup misses these or there could be a deadlock, see examples below.

BTW when DEVELOPER=1, all build-in plugin plugins are subscribed by default.

#if DEVELOPER
/* For memleak detection, always get notified of shutdown. */
if (!has_shutdown_notif)
json_add_string(params, NULL, "shutdown");

This already gave me already one deadlock in DEVELOPER=1 mode when running with a backup.py (modified) plugin.
edit: below gdb backtrace, when stop command hangs. Not 100% sure its not my own fault, but the trace originated at shutdown_plugins.

#0  0x00007f4ace5367e4 in __GI___poll (fds=0x55bcfea99208, nfds=7, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000055bcfe8bda27 in daemon_poll (fds=0x55bcfea99208, nfds=7, timeout=-1) at common/daemon.c:95
#2  0x000055bcfe85e5a9 in io_poll_lightningd (fds=0x55bcfea99208, nfds=7, timeout=-1) at lightningd/lightningd.c:676
#3  0x000055bcfe92518f in io_loop (timers=0x0, expired=0x0) at ccan/ccan/io/poll.c:412
#4  0x000055bcfe88c2bf in plugins_exclusive_loop (plugins=0x55bcff536178) at lightningd/plugin.c:2014
#5  0x000055bcfe88e33e in plugin_hook_db_sync (db=0x55bcfeaaae28) at lightningd/plugin_hook.c:388
#6  0x000055bcfe897792 in db_report_changes (db=0x55bcfeaaae28, final=0x0, min=0) at wallet/db.c:941
#7  0x000055bcfe8979d0 in db_commit_transaction (db=0x55bcfeaaae28) at wallet/db.c:1011
#8  0x000055bcfe8381ff in getrawblockbyheight_callback (
    buf=0x7f4acd22e038 "\n\n{\"jsonrpc\":\"2.0\",\"id\":30,\"result\":{\"blockhash\":\"", '0' <repeats 19 times>, "e452ff32311401cc77f4b6bf1a431bc3a36ec28eb2128\",\"block\":\"0400002081ba55399f7b030250c16c016866df56e95d82a1ae5d09", '0' <repeats 18 times>, "221"..., 
    toks=0x55bcff2b1148, idtok=0x55bcff2b1198, call=0x55bd0230a798) at lightningd/bitcoind.c:394
#9  0x000055bcfe887d43 in plugin_response_handle (plugin=0x55bcfea7e1d8, toks=0x55bcff2b1148, idtok=0x55bcff2b1198) at lightningd/plugin.c:517
#10 0x000055bcfe887f68 in plugin_read_json_one (plugin=0x55bcfea7e1d8, complete=0x7ffe160367fd, destroyed=0x7ffe160367fe) at lightningd/plugin.c:623
#11 0x000055bcfe888147 in plugin_read_json (conn=0x55bcfea89cd8, plugin=0x55bcfea7e1d8) at lightningd/plugin.c:668
#12 0x000055bcfe922503 in next_plan (conn=0x55bcfea89cd8, plan=0x55bcfea89cf8) at ccan/ccan/io/io.c:59
#13 0x000055bcfe9230a3 in do_plan (conn=0x55bcfea89cd8, plan=0x55bcfea89cf8, idle_on_epipe=false) at ccan/ccan/io/io.c:407
#14 0x000055bcfe9230e1 in io_ready (conn=0x55bcfea89cd8, pollflags=1) at ccan/ccan/io/io.c:417
#15 0x000055bcfe9252b6 in io_loop (timers=0x55bcfea648c8, expired=0x7ffe16036958) at ccan/ccan/io/poll.c:445
#16 0x000055bcfe85907c in io_loop_with_timers (ld=0x55bcfea62118) at lightningd/io_loop_with_timers.c:24
#17 0x000055bcfe88c650 in shutdown_plugins (ld=0x55bcfea62118) at lightningd/plugin.c:2103
#18 0x000055bcfe85f262 in main (argc=2, argv=0x7ffe16036b68) at lightningd/lightningd.c:1151

Another example: using current backup plugin (which doesn't subscribe "shutdown and is thus immediately killed at shutdown) in combination with another plugin that subscribes "shutdown" and (at shutdown) waits a few seconds and then calls datastore, then the data appears to be written to the DB but NOT to the backup and subsequent startup says:
plugin-backup.py: Backup is out of date, we cannot continue safely. Emergency shutdown.

because:

lightning/lightningd/plugin.c

Lines 2087 to 2091 in cdb93ee

/* Tell them all to shutdown; if they care. */
list_for_each_safe(&ld->plugins->plugins, p, next, list) {
/* Kill immediately, deletes self from list. */
if (!notify_plugin_shutdown(ld, p))
tal_free(p);

and
/* When we destroy a plugin, we remove its hooks */

@SimonVrouwe
Copy link
Collaborator Author

So to answer

Can there be db_write's after the "shutdown" notification?

I have been running pytests on a test branch, which logs dirty db transactions when they happen right after plugin_shutdown:

PYTHONPATH=contrib/pylightning:contrib/pyln-client:contrib/pyln-testing:contrib/pyln-proto PYTEST_PAR=5 PRESERVE_TEST_DIRS=1 DEVELOPER=1 VALGRIND=0 pytest tests/

note: needed modify pyln-testing to preserve test_* directories to grep for the particular log line afterward:

grep -r 'sivr calling plugin_hook_db_sync with plugins_are_shuttingdown' /tmp/ltests-*

returned none.

So I guess that's good news and probably explains there have not been issues with it so far with nodes running the backup.py plugin. But it doesn't proof that the lines after shutdown_plugins() never trigger a dirty db transaction, or can this be guaranteed somehow @cdecker ?

/* Tell plugins we're shutting down. */
shutdown_plugins(ld);
shutdown_subdaemons(ld);
/* Clean up the JSON-RPC. This needs to happen in a DB transaction since
* it might actually be touching the DB in some destructors, e.g.,
* unreserving UTXOs (see #1737) */
db_begin_transaction(ld->wallet->db);
tal_free(ld->jsonrpc);
db_commit_transaction(ld->wallet->db);

Anyway, inside plugins_shutdown() we can trigger a dirty db transaction, for example by making a datastore call when a plugin received the "shutdown" notification, see this test.

So I think plugins-that-registered-db_write-hook should be kept alive a little longer. At least until other plugins are killed and maybe until after the last db_commit_transaction in line 1146. The latter seems the safest, but probably requires shutdown_plugins to be hacked in two. I will give it a try and post a PR for that.

@vincenzopalazzo
Copy link
Collaborator

So I guess that's good news and probably explains there have not been issues with it so far with nodes running the backup.py plugin. But it doesn't proof that the lines after shutdown_plugins() never trigger a dirty db transaction, or can this be guaranteed somehow @cdecker ?

Can be a lucky order inside the array of plugins?

So I think plugins-that-registered-db_write-hook should be kept alive a little longer. At least until other plugins are killed and maybe until after the last db_commit_transaction in line 1146. The latter seems the safest, but probably requires shutdown_plugins to be hacked in two. I will give it a try and post a PR for that.

Why preserve only a specific hook like the? IMO, with this new plugin notification shutdown_plugins we need to traverse by importance, or maybe we can use also the important-plugin for this case? I would like to talk about the general solution to make safe all the plugins.

SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Sep 17, 2021
so the ones that registered db_write hook are kept alive
until after the last db transaction

issue ElementsProject#4785

TODO:
- maybe revert PR ElementsProject#3867, i.e. remove obsolete db wrapper arround freeing jsonrpc
    the wrapper was added in PR ElementsProject#1755, but since PR ElementsProject#3867 nothing touches
    the db at this point?
@rustyrussell
Copy link
Contributor

Note that db plugin hooks already get called before init, so calling after shutdown is probably minor.

But we should definitely not kill these until last, and ideally put an assert in if we do a db operation after that!

@SimonVrouwe
Copy link
Collaborator Author

SimonVrouwe commented Sep 19, 2021

@rustyrussell

Note that db plugin hooks already get called before init, so calling after shutdown is probably minor.

Yes, but 'before init' the plugin is already running. At shutdown it could've been killed, which also un-registers its hooks, even for important-plugins.

Thanks for sharing that perspective, considered how plugin processes are started quite early, wouldn't it make sense to kill them late?

edit How about a shutdown_hook instead of a notification (as also mentioned here)? Then we don't need to wait the 30s at shutdown every time some plugin subscribes to "shutdown" and give them control. And then keep them running until the very end, somewhere just before here maybe? No, hooks (except db_write) don't work in shutdown.

tal_free(ld);
timers_cleanup(timers);
tal_free(timers);
opt_free_table();
daemon_shutdown();

The tal_free(ld) closes the db, so no worries about transactions.

SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Sep 24, 2021
so that plugins that registered db_write hook are kept alive
until after the last db transaction

individual plugins can now be in state SHUTDOWN

issue ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Sep 28, 2021
*first* call keeps db_write plugins alive, subscribed plugins have 30s to
finish their business

*second* call notifies subscribed db_write plugins for 2nd time
and gives then 5s to self-terminate

issue ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Sep 29, 2021
*first* call keeps db_write plugins alive, subscribed plugins have 30s to
finish their business

*second* call notifies subscribed db_write plugins for 2nd time
and gives then 5s to self-terminate

issue ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Oct 5, 2021
*first* call keeps db_write plugins alive, subscribed plugins have 30s to
finish their business

*second* call notifies subscribed db_write plugins for 2nd time
and gives then 5s to self-terminate

issue ElementsProject#4785
@ZmnSCPxj
Copy link
Collaborator

ZmnSCPxj commented Oct 8, 2021

Ideally, a backup plugin should not terminate on a shutdown notification, only on an EOF on the stdin. Basically, my understanding is that a backup plugin must in fact stall its db_write response until after it has ensured that everything has been saved as indelibly as possible, and only then respond to db_write.

Since not responding to db_write causes the lightningd event loop to actually stall, if done that way then there is no way to cleanly shutdown while a db_write is in-flight. In that case, there should not be anything left for a backup plugin to do at shutdown, because all writes should have completed before db_write was responded to.

Basically, my understanding is that a backup plugin should ignore shutdown notifications and only terminate on EOF of stdin, and should not respond to db_write until after write has been indelibly confirmed in the backup medium.

  • By ignoring shutdown it will keep on working even if we have a bug somewhere and accidentally trigger a db_write after the shutdown notification (admittedly this arguably just hides a bug....).
  • By not relying on shutdown the backup plugin is more robust against something (Linux OOM killer, SystemD getting impatient) SIGKILLing the main lightningd.
  • The safety model of db_write assumes that all backups have been updated by the time db_write responds; it is unsafe for us to send revocation keys to the counterparty unless all replicas of our current state are up-to-date, and responding to db_write "early" (and if you do not respond "early" what are you doing in the shutdown notification handler? there should not be anything to do in that situation!) would allow revocation keys to be sent, which makes the backups toxic waste (admittedly only in the short term, but still).

@cdecker
Copy link
Member

cdecker commented Oct 22, 2021

Ideally, a backup plugin should not terminate on a shutdown notification, only on an EOF on the stdin. Basically, my understanding is that a backup plugin must in fact stall its db_write response until after it has ensured that everything has been saved as indelibly as possible, and only then respond to db_write.

Since not responding to db_write causes the lightningd event loop to actually stall, if done that way then there is no way to cleanly shutdown while a db_write is in-flight. In that case, there should not be anything left for a backup plugin to do at shutdown, because all writes should have completed before db_write was responded to.

Basically, my understanding is that a backup plugin should ignore shutdown notifications and only terminate on EOF of stdin, and should not respond to db_write until after write has been indelibly confirmed in the backup medium.

Correct, the exact reason for db_write to be synchronous, locking the entire event loop, is that in order to be safe the transaction must be written before lightningd commits the DB transaction to disk, and before we start the communication round with the peers, which would then potentially lock us into an uncomfortable situation if we were to lose our DB right then (sending commitment revocations basically). So unless you synchronously stream the changes to the backup medium and commit them there, before returning to the db_write, you're likely at risk of losing funds.

In this sense I do agree that any subscriber to the db_write hook must ignore any shutdown notification, and should better perform pre-flight checks when starting, potentially aborting if we notice that the backup is not in sync with the actual database (data_version of the DB must be equal, or at most lag by 1 from the backup, and that's the case the DB commit was not completed successfully).

SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Oct 25, 2021
*first* call keeps db_write plugins alive, subscribed plugins have 30s to
finish their business

*second* call notifies subscribed db_write plugins for 2nd time
and gives then 5s to self-terminate

issue ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Oct 25, 2021
… in shutdown

Here important-plugin implies `important hook`.

Before this commit, when in shutdown:
- existing in-flight hooks where abandoned, cutting the hook-chain and never call hook_final_cb
- hooks where removed when its plugin died, even important-plugin because `shutdown` overrules
- but hook events can be called while waiting for plugins to self-terminate (up to 30s) and
  subdaemons still alive and it looks as if no plugin ever registered the hook.

After this commit, when in shutdown:
- existing in-flight hook (chains) are honoured and can finalize, same semantics as LD_STATE_RUNNING
- important-plugins are kept alive until after shutdown_subdaemons, so they don't miss hooks
- JSON RPC commands are functional, but anything unimportant-plugin related cannot be relied on

TODO:
- Run tests -> hangs forever on test_closing, so skip them

- Q. Does this open a can of worms or races when (normal) plugins with hooks die randomly?
  A. Yes, for example htlc_accepted calls triggers hook invoice_payment, but plugin (fetchinvoice?) already died

**
* CONCLUSION: If you want to give more control over shutdown, I think there could be
* a plugin `shutdown_clean.py` with RPC method `shutdown_clean`. When called, that
* plugin starts additional (important) plugin(s) that register relevant hooks and, for example, hold-off
* new htcl's and wait for existing inflight htlc's to resolve ... and finally call RPC `stop`.
*
* Note: --important-plugin only seems to work at start, not via `plugin start shutdown_clean.py`
* maybe we can add? Or do something with disable?
*
* Some parts of this commit is stil good, i.e. hook semantics of important plugins should be consistent
* untill the very last potential hook call.
**

- What if important-plugin dies unexpectatly and lightningd_exit() calls io_break() is that bad?
- What are the benefits? Add example where on shutdown inflight htlc's are resolved/cleared and
  new htlc's blocked, see ElementsProject#4842

- Split commit into hook-related stuff and others, for clarity of reasoning
- Q. How does this relate (hook-wise) to db_write plugins? A. Looks like this hook is treated like
  any other hook: when plugin dies, hook is removed, so to be safe backup needs to be `important`.
  Hook documentation does not mention `important-plugin` but BACKUP.md does.
  TODO: Tested this -> `plugin stop backup.py` -> "plugin-backup.py: Killing plugin: exited during normal operation"
  In fact, running current backup.py with current master misses a couple of writes in
  shutdown (because its hook is removed, see issue ElementsProject#4785).
SimonVrouwe added a commit to SimonVrouwe/plugins that referenced this issue Nov 18, 2021
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Nov 25, 2021
Fixes: ElementsProject#4785
Fixes: ElementsProject#4883

Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, making RPC calls fail with error code -5.
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Nov 26, 2021
Fixes: ElementsProject#4785
Fixes: ElementsProject#4883

Changelog-Changed: Plugins: `shutdown` notification is now send when lightningd is almost completely shutdown, RPC calls then fail with error code -5.
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Jan 4, 2023
So a db_write plugin can still self-terminate when its ready for it.

Credit to ZmnSCPxj for the EOF idea, see issue ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Jan 8, 2023
So a db_write plugin can still self-terminate when its ready for it.

Credit to ZmnSCPxj for the EOF idea, see issue ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Jan 12, 2023
…er closing db

in second shutdown_plugin call, trigger EOF in plugin stdin.

This correctly handles corner cases in shutdown_plugin:
- in a non-dev build, buildin plugins don't subscribe shutdown,
  but we still need a (brief) io_loop to notify interested db_write
  plugins
- only an awaited-for plugin should call io_break
- allways send EOF to db_write plugins in the final call (subscribed or not)

This adds two helper functions:
 plugin_registered_db_write_hook
 jsonrpc_command_del

and a new plugin state: SHUTDOWN

inspired by ElementsProject#4790 and also
credit to ZmnSCPxj for mentioning the EOF idea in ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Jan 13, 2023
So a db_write plugin can still self-terminate when its ready for it.

Credit to ZmnSCPxj for the EOF idea, see issue ElementsProject#4785
SimonVrouwe added a commit to SimonVrouwe/lightning that referenced this issue Jan 29, 2023
…er closing db

in second shutdown_plugin call, trigger EOF in plugin stdin.

This correctly handles corner cases in shutdown_plugin:
- in a non-dev build, buildin plugins don't subscribe shutdown,
  but we still need a (brief) io_loop to notify interested db_write
  plugins
- only an awaited-for plugin should call io_break
- allways send EOF to db_write plugins in the final call (subscribed or not)

This adds two helper functions:
 plugin_registered_db_write_hook
 jsonrpc_command_del

and a new plugin state: SHUTDOWN

inspired by ElementsProject#4790 and also
credit to ZmnSCPxj for mentioning the EOF idea in ElementsProject#4785

Hopefully this fixes the "Nope" from prev commit, these:
ERROR tests/test_invoices.py::test_amountless_invoice - ValueError:
ERROR tests/test_misc.py::test_htlc_out_timeout - ValueError:
ERROR tests/test_pay.py::test_sendpay_msatoshi_arg - ValueError:
ERROR tests/test_pay.py::test_pay_peer - ValueError:
ERROR tests/test_pay.py::test_listpays_with_filter_by_status - ValueError:
ERROR tests/test_pay.py::test_channel_receivable - ValueError:

where still triggering db_write's:
lightningd: FATAL SIGNAL 11 (version v22.11.1-149-g93f1458-modded)
0x555e78c30d0f send_backtrace
        common/daemon.c:33
0x555e78c30db4 crashdump
        common/daemon.c:46
0x7fd20cc60d5f ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x555e78c53665 db_begin_transaction_
        db/exec.c:124
0x555e78bc7557 read_json
        lightningd/jsonrpc.c:1090
0x555e78ce8bc3 next_plan
        ccan/ccan/io/io.c:59
0x555e78ce976d do_plan
        ccan/ccan/io/io.c:407
0x555e78ce97ab io_ready
        ccan/ccan/io/io.c:417
0x555e78ceb9c1 io_loop
        ccan/ccan/io/poll.c:453
0x555e78bfdcb2 shutdown_plugins
        lightningd/plugin.c:2179
0x555e78bcae76 main
        lightningd/lightningd.c:1281
0x7fd20cc4bd09 __libc_start_main
        ../csu/libc-start.c:308
0x555e78b9d3c9 ???
        ???:0
0xffffffffffffffff ???
        ???:0
Log dumped in crash.log.20230126160704

Not from plugin_response_handle !!!
What are they, seems every parse_request() in read_json() is wrapped in a db transaction:
```
	if (!in_transaction) {
		db_begin_transaction(jcon->ld->wallet->db);
		in_transaction = true;
	}
	parse_request(jcon, jcon->input_toks);
```

But wasn't jsonrpc_stop_listening supposed to block new requests? <== Well sort-off, it only blocks new rpc connections
But (suprise) the buildin plugins reuse their rpc connection and don't don't close it, so their request go through!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants