Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle plugins being terminated correctly #3539

Merged
merged 7 commits into from
Feb 26, 2020

Conversation

cdecker
Copy link
Member

@cdecker cdecker commented Feb 19, 2020

Due to a bug in our plugin cleanup logic we'd be waiting for both the plugin's
stdout and stdin to get closed before cleaning it up. However, since we
only poll stdout for incoming messages from the plugin, and not poll its
stdin, we'd defer cleaning up indefinitely, until we either send something
to the plugin or we shut down the node entirely. This means that we'd also not
detect crashes in a reasonable time, and a plugin crashing while handling a
hook event could hang forever.

This PR first demonstrates this issue using a plugin that exits as soon as it
htlc_accepted hook is called, and then proceeds to fix the issue. We take
the closing of a plugin's stdout as the sole signal that the plugin is
exiting and trigger the cleanup immediately. This then surfaced a number of
issue where we had memory either lingering around or the tal_free order
being incorrect, resulting in a number of use-after-free issues. So I had to
dive in and clean things up a bit.

In order to facilitate skipping a crashed plugin I changed the call chain for
hook events to be a list instead of an array. Each hook event now has a list
of plugins that are still to call, and we pop off elements as we receive
responses or plugins exit.

Another issue was that we'd now be falsely detecting a node shutdown during
hook calls as the plugin exiting spontaneously. To facilitate these back out
cases in future I added a state variable that indicates whether we are
operational or we are shutting down. This used to be detected through a number
of side-effects that weren't well documented (variables being set to NULL
etc), so making this explicit should make this clearer.

@cdecker cdecker added the plugin label Feb 19, 2020
@cdecker cdecker added this to the 0.8.2 milestone Feb 19, 2020
@cdecker cdecker self-assigned this Feb 19, 2020
@cdecker
Copy link
Member Author

cdecker commented Feb 19, 2020

I have one more cleanup, removing the plugin array from the hook call altogether and rely solely on the call_chain list. Working on it now.

@cdecker
Copy link
Member Author

cdecker commented Feb 20, 2020

Caught one more memory-leak :-)

@cdecker cdecker marked this pull request as ready for review February 21, 2020 08:42
@@ -133,9 +144,21 @@ static struct command_result *plugin_start(struct dynamic_plugin *dp)
/* Give the plugin 20 seconds to respond to `getmanifest`, so we don't hang
* too long on the RPC caller. */
p->timeout_timer = new_reltimer(dp->cmd->ld->timers, dp,
time_from_sec((10)),
time_from_sec((20)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't 10 seconds long enough ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation says 20 seconds, and this was not matching, so I adapted the value to the documented one. We can also go the other way around :-)

@darosior
Copy link
Collaborator

darosior commented Feb 21, 2020 via email

@darosior darosior mentioned this pull request Feb 22, 2020
/* Call next will unlink, so we don't need to. This is treated
* equivalent to the plugin returning a continue-result.
*/
plugin_hook_callback(NULL, NULL, NULL, link->req);
Copy link
Collaborator

@darosior darosior Feb 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so this makes plugin_hook_callback call the callback with NULL so it thinks actually no plugin was registered for this hook and can continue operations ? This is neat as we don't crash as before, but don't we want to log broken or unusual here so we know the plugin actually crashed, and for which call ?
Also, since it crashed don't we want to unregister it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get called due to plugin_killed being called or because we tal_free(plugin) which itself prints a warning to the logs and cleans up / unregisters the plugin from all hooks, hook-calls, and other things they might have registered for.

Copy link
Collaborator

@darosior darosior left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 229202e

We were waiting for both stdin and stdout to close, however that resulted in
us deferring cleanup indefinitely since we did not poll stdout for being
writable most of the time. On the other hand we are almost always polling
the plugin's stdout, so that notifies us as soon as the plugin stops.

Changelog-Fixed: plugin: Plugins no longer linger indefinitely if their process terminates
Changelog-Fixed: plugin: A crashing plugin will no longer cause a hook call to be delayed indefinitely
We make the current state of `lightningd` explicit so we don't have to
identify a shutdown by its side-effects. We then use this in order to prevent
the killing and freeing of plugins to continue down the chain of registered
plugins.
We are attaching the destructor to notify us when the plugin exits, but we
also need to clear them once the request is handled correctly, so we don't
call the destructor when it exits later.
We promised we'd be waiting up to 20 seconds, but were only waiting for
10. Fix that by bumping to the documented 20.
It was a pointer into the list of plugins for the hook, but it was rather
unstable: if a plugin exits after handling the event we could end up skipping
a later plugin. We now rely on the much more stable `call_chain` list, so we
can clean up that useless field.
@cdecker
Copy link
Member Author

cdecker commented Feb 25, 2020

Rebased on top of master, ACK was automatically re-applied (so @bitcoin-bot works... sometimes...)

@rustyrussell rustyrussell merged commit 8f87579 into ElementsProject:master Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants