Handle plugins being terminated correctly #3539

cdecker · 2020-02-19T16:11:19Z

Due to a bug in our plugin cleanup logic we'd be waiting for both the plugin's
stdout and stdin to get closed before cleaning it up. However, since we
only poll stdout for incoming messages from the plugin, and not poll its
stdin, we'd defer cleaning up indefinitely, until we either send something
to the plugin or we shut down the node entirely. This means that we'd also not
detect crashes in a reasonable time, and a plugin crashing while handling a
hook event could hang forever.

This PR first demonstrates this issue using a plugin that exits as soon as it
htlc_accepted hook is called, and then proceeds to fix the issue. We take
the closing of a plugin's stdout as the sole signal that the plugin is
exiting and trigger the cleanup immediately. This then surfaced a number of
issue where we had memory either lingering around or the tal_free order
being incorrect, resulting in a number of use-after-free issues. So I had to
dive in and clean things up a bit.

In order to facilitate skipping a crashed plugin I changed the call chain for
hook events to be a list instead of an array. Each hook event now has a list
of plugins that are still to call, and we pop off elements as we receive
responses or plugins exit.

Another issue was that we'd now be falsely detecting a node shutdown during
hook calls as the plugin exiting spontaneously. To facilitate these back out
cases in future I added a state variable that indicates whether we are
operational or we are shutting down. This used to be detected through a number
of side-effects that weren't well documented (variables being set to NULL
etc), so making this explicit should make this clearer.

cdecker · 2020-02-19T16:11:57Z

I have one more cleanup, removing the plugin array from the hook call altogether and rely solely on the call_chain list. Working on it now.

cdecker · 2020-02-20T21:32:21Z

Caught one more memory-leak :-)

darosior · 2020-02-21T13:52:04Z

lightningd/plugin_control.c

@@ -133,9 +144,21 @@ static struct command_result *plugin_start(struct dynamic_plugin *dp)
 	/* Give the plugin 20 seconds to respond to `getmanifest`, so we don't hang
 	 * too long on the RPC caller. */
 	p->timeout_timer = new_reltimer(dp->cmd->ld->timers, dp,
-	                                time_from_sec((10)),
+	                                time_from_sec((20)),


Wasn't 10 seconds long enough ?

The documentation says 20 seconds, and this was not matching, so I adapted the value to the documented one. We can also go the other way around :-)

darosior · 2020-02-21T15:59:47Z

I think I just forgot to update it last time i lowered the timeout ^^ But that's really a nit ! ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Le vendredi, février 21, 2020 4:46 PM, Christian Decker <[email protected]> a écrit :

…

@cdecker commented on this pull request. In lightningd/plugin_control.c: > @@ -133,9 +144,21 @@ static struct command_result *plugin_start(struct dynamic_plugin *dp) /* Give the plugin 20 seconds to respond to `getmanifest`, so we don't hang * too long on the RPC caller. */ p->timeout_timer = new_reltimer(dp->cmd->ld->timers, dp, - time_from_sec((10)), + time_from_sec((20)), The documentation says 20 seconds, and this was not matching, so I adapted the value to the documented one. We can also go the other way around :-) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.[https://github.com/notifications/beacon/AFLK3F7277I2HJNOEBFNS4TRD7ZLRA5CNFSM4KX4C4LKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCWPLOQY.gif]

darosior · 2020-02-22T16:39:34Z

lightningd/plugin_hook.c

+		/* Call next will unlink, so we don't need to. This is treated
+		 * equivalent to the plugin returning a continue-result.
+		 */
+		plugin_hook_callback(NULL, NULL, NULL, link->req);


Ok so this makes plugin_hook_callback call the callback with NULL so it thinks actually no plugin was registered for this hook and can continue operations ? This is neat as we don't crash as before, but don't we want to log broken or unusual here so we know the plugin actually crashed, and for which call ?
Also, since it crashed don't we want to unregister it ?

We get called due to plugin_killed being called or because we tal_free(plugin) which itself prints a warning to the logs and cleans up / unregisters the plugin from all hooks, hook-calls, and other things they might have registered for.

darosior

ACK 229202e

We were waiting for both stdin and stdout to close, however that resulted in us deferring cleanup indefinitely since we did not poll stdout for being writable most of the time. On the other hand we are almost always polling the plugin's stdout, so that notifies us as soon as the plugin stops. Changelog-Fixed: plugin: Plugins no longer linger indefinitely if their process terminates

Changelog-Fixed: plugin: A crashing plugin will no longer cause a hook call to be delayed indefinitely

We make the current state of `lightningd` explicit so we don't have to identify a shutdown by its side-effects. We then use this in order to prevent the killing and freeing of plugins to continue down the chain of registered plugins.

We are attaching the destructor to notify us when the plugin exits, but we also need to clear them once the request is handled correctly, so we don't call the destructor when it exits later.

We promised we'd be waiting up to 20 seconds, but were only waiting for 10. Fix that by bumping to the documented 20.

It was a pointer into the list of plugins for the hook, but it was rather unstable: if a plugin exits after handling the event we could end up skipping a later plugin. We now rely on the much more stable `call_chain` list, so we can clean up that useless field.

cdecker · 2020-02-25T09:44:56Z

Rebased on top of master, ACK was automatically re-applied (so @bitcoin-bot works... sometimes...)

cdecker added the plugin label Feb 19, 2020

cdecker added this to the 0.8.2 milestone Feb 19, 2020

cdecker requested a review from rustyrussell February 19, 2020 16:11

cdecker self-assigned this Feb 19, 2020

cdecker force-pushed the plugin-hook-crash branch from 4e6abbe to 61ab064 Compare February 19, 2020 18:08

cdecker force-pushed the plugin-hook-crash branch from b2f1fe5 to 229202e Compare February 21, 2020 07:34

cdecker marked this pull request as ready for review February 21, 2020 08:42

darosior reviewed Feb 21, 2020

View reviewed changes

darosior mentioned this pull request Feb 22, 2020

Test lightningd/plugins#92

Closed

darosior reviewed Feb 22, 2020

View reviewed changes

darosior approved these changes Feb 24, 2020

View reviewed changes

cdecker added 7 commits February 25, 2020 10:44

pytest: Test a plugin crash while handling a hook call

1901033

plugin: Fix hanging hook calls if the plugin dies

81da76c

Changelog-Fixed: plugin: A crashing plugin will no longer cause a hook call to be delayed indefinitely

plugin: Avoid calling a destructor on a request that was freed

d33a6aa

We are attaching the destructor to notify us when the plugin exits, but we also need to clear them once the request is handled correctly, so we don't call the destructor when it exits later.

plugin: Actually wait the 20 seconds promised in the docs

a8bc1ee

We promised we'd be waiting up to 20 seconds, but were only waiting for 10. Fix that by bumping to the documented 20.

cdecker force-pushed the plugin-hook-crash branch from 229202e to acaec67 Compare February 25, 2020 09:44

rustyrussell merged commit 8f87579 into ElementsProject:master Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle plugins being terminated correctly #3539

Handle plugins being terminated correctly #3539

cdecker commented Feb 19, 2020

cdecker commented Feb 19, 2020

cdecker commented Feb 20, 2020

darosior Feb 21, 2020

cdecker Feb 21, 2020

darosior commented Feb 21, 2020 via email

darosior Feb 22, 2020 •

edited

Loading

cdecker Feb 23, 2020

darosior left a comment

cdecker commented Feb 25, 2020

Handle plugins being terminated correctly #3539

Handle plugins being terminated correctly #3539

Conversation

cdecker commented Feb 19, 2020

cdecker commented Feb 19, 2020

cdecker commented Feb 20, 2020

darosior Feb 21, 2020

Choose a reason for hiding this comment

cdecker Feb 21, 2020

Choose a reason for hiding this comment

darosior commented Feb 21, 2020 via email

darosior Feb 22, 2020 • edited Loading

Choose a reason for hiding this comment

cdecker Feb 23, 2020

Choose a reason for hiding this comment

darosior left a comment

Choose a reason for hiding this comment

cdecker commented Feb 25, 2020

darosior Feb 22, 2020 •

edited

Loading