Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightningd: make shutdown smoother and safer, PART II (again) #4959

Merged

Conversation

SimonVrouwe
Copy link
Collaborator

@SimonVrouwe SimonVrouwe commented Dec 6, 2021

Follow up of #4897, fixes the issues raised in that PR. (similar to #4947, but could not reopen it after force push)

  • Simply close the db in the shutdown_plugins, this partially reverts 5f69674 (the part that touched wallet code).
  • Use a simpler io_loop in shutdown_plugins
  • Removed test_htlc_accepted_hook_shutdown, moved relevant parts into test_plugin_shutdown

Some cleanup: timer_expired() had an obsolete parameter.

BTW I found below comment inside io_loop confusing:

/* Call functions for expired timers. */
*expired = timers_expire(timers, now);

Maybe more descriptive would be: /* Look for expired timers. */

The only thing that needs ld->wallet after this is destroy_invoices_waiter (off jsonrpc)
Could not find any other destructors (destroy_*) that need wallet or db access after this.
Any db access would now segfault.
@rustyrussell
Copy link
Contributor

CI failure not your fault:

2021-12-06T19:53:16.984Z **BROKEN** plugin-offers: Failed invoice_request lnr1qvsqvgnwgcg35z6ee2h3yczraddm72xrfua9uve2rlrm9deu7xyfzrcyypwtes9k0sxxc6djuglrps2nygs5900e7l357c9d882krxghs4lt2rqrqfpqqf3q3kwsuapwqgw3j2nqtz0pecn27vavghx8tt84zwnm0gv6zsgvcpxnyyxaul4fyuky2ru90r854h6glznq7pqdsh2acq3qevz62lpr3d69k0qr0nvh440txmm4h8nq4jzc0fn0q4ujwdf5kterusewlsfekhlncfeapkhzcrxg39dm4e847xjgcw7gdq for offer 5cbcc0b67c0c6c69b2e23e30c153222142bdf9f7e34f60ad39d5619917857eb5: Got JSON error: {\"code\":-5,\"message\":\"lightningd is shutting down\"}

I think that we shouldn't log broken if the error is -5.

Did you want to patch it, or should I?

@SimonVrouwe
Copy link
Collaborator Author

Did you want to patch it, or should I?

@rustyrussell Sure go ahead! I guess l4 is stopped too quick?

@vincenzopalazzo
Copy link
Collaborator

@rustyrussell Sure go ahead! I guess l4 is stopped too quick?

lightningd-4: 2021-12-06T19:53:16.984Z **BROKEN** plugin-offers: Failed invoice_request lnr1qvsqvgnwgcg35z6ee2h3yczraddm72xrfua9uve2rlrm9deu7xyfzrcyypwtes9k0sxxc6djuglrps2nygs5900e7l357c9d882krxghs4lt2rqrqfpqqf3q3kwsuapwqgw3j2nqtz0pecn27vavghx8tt84zwnm0gv6zsgvcpxnyyxaul4fyuky2ru90r854h6glznq7pqdsh2acq3qevz62lpr3d69k0qr0nvh440txmm4h8nq4jzc0fn0q4ujwdf5kterusewlsfekhlncfeapkhzcrxg39dm4e847xjgcw7gdq for offer 5cbcc0b67c0c6c69b2e23e30c153222142bdf9f7e34f60ad39d5619917857eb5: Got JSON error: {\"code\":-5,\"message\":\"lightningd is shutting down\"}

Mh with the following error also available here is possible that the error is expected from c-lightning? and we just ignore the error -5?

Not sure where we can ignore it, the idea is it ignores all the plugin error that returns -5, and print it as debug info.

I'm missing some things?

@SimonVrouwe
Copy link
Collaborator Author

It looks that in test_fetchinvoice_3hop

lightning/tests/test_pay.py

Lines 4353 to 4356 in bad0988

l1.rpc.call('fetchinvoice', {'offer': offer1['bolt12']})
# Test with obsolete onion.
l4.stop()

The rpc.call returns immediately after it received the first onion_reply from l4. But l4 receives 2 onions (one obsolete and one not) and is stopped while handling the 2nd onion, causing the "lightningd is shutting down" error.

So we have a race between l4.stop() and l4 handling the 2nd onion message. I could not reproduce the error, but a simple solution seems a sleep before l4.stop() so it has time to finalize handling the 2nd onion.

@SimonVrouwe
Copy link
Collaborator Author

The (hopefully) last failing test_htlc_accepted_hook_shutdown, I have seen this fail before on my local machine, but only with VALGRIND=1 (where in CI case it's 0). It appears an unresponsive l1 that won't stop by itself (stopping it manually makes the test pass).

Running test_htlc_accepted_hook_shutdown locally also showed difference in good cases (passing the test), which seems related to the adaptive (payment) splitter.
The last log entries of the test then become:

    lightningd-2: 2021-12-07T09:26:12.112Z INFO    plugin-misc_notifications.py: delaying shutdown with 5s
    lightningd-1: 2021-12-07T09:26:12.192Z INFO    plugin-pay: cmd 71 partid 1: No path found
    lightningd-1: 2021-12-07T09:26:12.271Z INFO    plugin-pay: cmd 71 partid 2: No path found
    lightningd-1: 2021-12-07T09:26:12.289Z INFO    plugin-pay: cmd 71 partid 3: No path found
    lightningd-1: 2021-12-07T09:26:12.308Z INFO    plugin-pay: cmd 71 partid 4: No path found
    lightningd-1: 2021-12-07T09:26:12.376Z INFO    plugin-pay: cmd 71 partid 5: No path found
    lightningd-1: 2021-12-07T09:26:12.505Z INFO    plugin-pay: cmd 71 partid 6: No path found
    lightningd-1: 2021-12-07T09:26:12.528Z INFO    plugin-pay: cmd 71 partid 7: No path found
    lightningd-1: 2021-12-07T09:26:12.553Z INFO    plugin-pay: cmd 71 partid 8: No path found
    lightningd-1: 2021-12-07T09:26:12.579Z INFO    plugin-pay: cmd 71 partid 9: No path found
    lightningd-1: 2021-12-07T09:26:12.674Z INFO    plugin-pay: cmd 71 partid 10: No path found
    lightningd-1: 2021-12-07T09:26:12.675Z INFO    plugin-pay: cmd 71 partid 10: Lower limit of adaptive splitter reached (1000msat < 100000msat), not splitting further.
    lightningd-2: 2021-12-07T09:26:17.155Z DEBUG   plugin-misc_notifications.py: Killing plugin: exited during normal operation
    lightningd-2: 2021-12-07T09:26:17.160Z DEBUG   lightningd: Command returned result after jcon close
    .lightningd-1: 2021-12-07T09:26:17.189Z UNUSUAL lightningd: JSON-RPC shutdow

or so:

    lightningd-2: 2021-12-07T09:32:20.894Z INFO    plugin-misc_notifications.py: delaying shutdown with 5s
    lightningd-1: 2021-12-07T09:32:20.895Z INFO    022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-chan#1: Attempt to send HTLC but unowned (CHANNELD_NORMAL)
    lightningd-1: 2021-12-07T09:32:20.896Z INFO    plugin-pay: cmd 48 partid 1: Failing a partial payment due to a failed RPC call: {\"code\":204,\"message\":\"failed: WIRE_TEMPORARY_CHANNEL_FAILURE (First peer not ready)\",\"data\":{\"erring_index\":0,\"failcode\":4103,\"failcodename\":\"WIRE_TEMPORARY_CHANNEL_FAILURE\",\"erring_node\":\"0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518\",\"erring_channel\":\"103x1x1\",\"erring_direction\":1}}
    lightningd-1: 2021-12-07T09:32:20.896Z INFO    plugin-pay: cmd 48 partid 1: Lower limit of adaptive splitter reached (1000msat < 100000msat), not splitting further.
    lightningd-2: 2021-12-07T09:32:25.938Z DEBUG   plugin-misc_notifications.py: Killing plugin: exited during normal operation
    lightningd-2: 2021-12-07T09:32:25.938Z DEBUG   lightningd: Command returned result after jcon close
    .lightningd-1: 2021-12-07T09:32:25.950Z UNUSUAL lightningd: JSON-RPC shutdown

Not sure what is happening there, will dig into it.

@SimonVrouwe
Copy link
Collaborator Author

Think I found the cause for flakeness, the wait_for_log (below) didn't really wait because that line occurs twice. So l2 was in a state somewhere between stop command and the shutdown loop, when l1 tries to pay, giving flaky results.

lightning/tests/test_plugin.py

Lines 1108 to 1113 in c2d2cc1

f_stop = executor.submit(l2.rpc.stop)
l2.daemon.wait_for_log(r'plugin-misc_notifications.py: delaying shutdown with 5s')
# Should still fail htlc while shutting down
with pytest.raises(RpcError):
l1.rpc.pay(i1)

In retrospect, the HTLC-part of this test was mainly to demonstrate failure before this (and previous) PR because hooks where silently unregistered when a plugin was shutdown. Now if we can incorporate the shutdown notification and datastore part of this test into another one, that is probably sufficient and then this test can be removed.

into test_plugin_shutdown

The fact that plugins are kept alive untill the very end also ensures
their hooks are not silently unregistered.
@SimonVrouwe
Copy link
Collaborator Author

Now if we can incorporate the shutdown notification and datastore part of this test into another one, that is probably sufficient and then this test (test_htlc_accepted_hook_shutdown) can be removed.

I found the flake, but used the time to consolidate the tests.

The FreeBSD test fails on installing packages, not related.
@rustyrussell All patched, I think this is complete.

After leaving the main event loop, the only path to destroy_plugin
goes via shutdown_plugins.
Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack 8132509

We could do a little test cleanup using wait_for_logs(), but I really want to apply this so I'm going to do that immediately (we're seeing some flakes in unrelated PRs).

tests/test_plugin.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants