Connectd demux part 2 #4985

rustyrussell · 2021-12-20T05:42:08Z

~~(based on #4984 )~~ (Merged)

This removes the direct connection from per-peer daemons to gossipd.

Removes the requirement to query gossipd to get the latest channel_update for errors; it now pushes them out.
Makes connectd<->gossipd fd async, uses it for transferring gossip messages for each peer.
Makes connectd handle onion messages, pings and custom messages itself.

rustyrussell · 2022-01-05T00:03:51Z

Trivial rebase.

This is in preparation for gossipd feeding us the latest channel_updates, rather than having lightningd and channeld query gossipd when it wants to send an onion error with an update included. This means gossipd will start telling us the updates, so we need the channels loaded first. Signed-off-by: Rusty Russell <[email protected]>

We want it to keep the latest, so it can make its own error msgs without asking us. This installs (but does not use!) the message handler. Signed-off-by: Rusty Russell <[email protected]>

Even if we're deferring putting them in the store and broadcasting them, we tell lightningd so it will use it in any error messages. Signed-off-by: Rusty Russell <[email protected]>

This way it can flush it if it was pending. Signed-off-by: Rusty Russell <[email protected]>

…g gossipd. We also no longer strip the type off: everyone handles both forms, and Eclair doesn't strip (and it's easier!). Signed-off-by: Rusty Russell <[email protected]>

Now we don't ask gossipd, but lightningd keeps channeld up-to-date. Signed-off-by: Rusty Russell <[email protected]>

We're weaning per-peer daemons off having a direct gossipd connection. Signed-off-by: Rusty Russell <[email protected]>

The last change exposed a race: the peer sends funding_locked then immediately sends an update_channel. channeld used to process the funding_locked from the peer, tell gossipd about the new channel, then finally forward the channel_update. We can have the channel_update hit gossipd before we've told it about the channel. It ignores the channel_update for the currently-unknown channel: we get a 'bad gossip' message, but the immediate symptom is a timeout in tests/test_closing.py::test_onchain_multihtlc_their_unilateral: ``` node_factory = <pyln.testing.utils.NodeFactory object at 0x7fdf93f42190> bitcoind = <pyln.testing.utils.BitcoinD object at 0x7fdf940b99d0> @pytest.mark.developer("needs DEVELOPER=1 for dev_ignore_htlcs") @pytest.mark.slow_test def test_onchain_multihtlc_their_unilateral(node_factory, bitcoind): """Node pushes a channel onchain with multiple HTLCs with same payment_hash """ > h, nodes = setup_multihtlc_test(node_factory, bitcoind) tests/test_closing.py:2938: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tests/test_closing.py:2780: in setup_multihtlc_test nodes = node_factory.line_graph(7, wait_for_announce=True, /usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:1416: in line_graph self.join_nodes(nodes, fundchannel, fundamount, wait_for_announce, announce_channels) /usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:1394: in join_nodes nodes[i + 1].wait_channel_active(scids[i]) /usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:958: in wait_channel_active wait_for(lambda: self.is_channel_active(chanid)) ``` Note that we are usually much faster to send between subds than we are between peers, but during CI this is common, as we're all running on the same machine. Signed-off-by: Rusty Russell <[email protected]>

Once we send funding_locked, gossipd could start seeing channel_updates from the peer (which get sent so we can use the channel in routehints even before it's announcable). Signed-off-by: Rusty Russell <[email protected]>

We want to stream gossip through this, but currently connectd treats the fd as synchronous. While we work on getting rid of that, it's easiest to have two fds. Signed-off-by: Rusty Russell <[email protected]>

Signed-off-by: Rusty Russell <[email protected]>

Next patch starts a timeout ping, which can interfere with results. In theory, we should reply, but in practice (so far!) we seem to get enough time that it doesn't hang up on us. Signed-off-by: Rusty Russell <[email protected]>

Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: JSON_RPC: `ping` now works with connected peers, even without a channel.

Signed-off-by: Rusty Russell <[email protected]>

We don't need to log msgs from subds, but we do our own, and we weren't. 1. Rename queue_peer_msg to inject_peer_msg for clarity, make it do logging 2. In the one place where we're relaying, call msg_queue() directly. Signed-off-by: Rusty Russell <[email protected]>

This is mainly useful for connectd. Signed-off-by: Rusty Russell <[email protected]>

This is neater than what we had before, and slightly more general. Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: JSON_RPC: `sendcustommsg` now works with any connected peer, even when shutting down a channel.

We currently die when gossipd vanishes, but our direct connection will go away. We then complain if the node is shutting down while we're talking to hsmd. Signed-off-by: Rusty Russell <[email protected]>

Signed-off-by: Rusty Russell <[email protected]>

Gossipd now simply gets told by channeld when peers arrive or leave. (it only needs to know for the seeker). Signed-off-by: Rusty Russell <[email protected]>

Now we only send and receive gossip messages on this fd. Signed-off-by: Rusty Russell <[email protected]>

…msg. We don't need the connection to ourselves, just to free it. Signed-off-by: Rusty Russell <[email protected]>

Don't send EOF marker to peer, e.g. in tests/test_gossip.py::test_gossip_store_compact: ``` lightningd-2: 2022-01-24T03:34:22.925Z DEBUG connectd: gossip_store at end, new fd moved to 1875 lightningd-2: 2022-01-24T03:34:22.933Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Sending gossip INVALID 4105 lightningd-2: 2022-01-24T03:34:22.933Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: peer_in WIRE_WARNING lightningd-2: 2022-01-24T03:34:22.941Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: peer_out INVALID 4105 lightningd-2: 2022-01-24T03:34:22.949Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: billboard perm: Received warning channel 2c7cf1dc9dada7ed14f10c78ade8f0de907c1b70e736c12ff6f7472dc69c3db3: Peer sent unknown message 4105 (INVALID 4105) ``` Signed-off-by: Rusty Russell <[email protected]>

We were relying on the fee update to create an additional tx. That's ugly; do an actual payment and make sure we definitely complete a new tx by waiting for that *then* both revoke_and_ack. (Without this, we could get a unilateral close instead of a penalty). Signed-off-by: Rusty Russell <[email protected]>

Don't assume gossip send order: explicitly disconnect and reconnect. Signed-off-by: Rusty Russell <[email protected]>

…fast. If we fund a channel between two nodes, then mine all the blocks to announce it, any other nodes may see the announcement before the blocks, causing CI to complain about "bad gossip": ``` lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Ignoring future channel_announcment for 113x1x1 (current block 112) lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement 113x1x1/0 lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement 113x1x1/1 lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_NODE_ANNOUNCEMENT before announcement 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e ``` Add a new helper for this case, and use it where there are more than 2 nodes. Cleans up test_routing_gossip and a few other places which did this manually. Signed-off-by: Rusty Russell <[email protected]>

`hc` is never NULL, since it's `hc = &chan->half[direction];`; we really meant "is it initialized", and valgrind under CI finally caught it: ``` ==69243== Conditional jump or move depends on uninitialised value(s) ==69243== at 0x11C595: handle_local_channel_update (gossip_generation.c:758) ==69243== by 0x115254: recv_req (gossipd.c:986) ==69243== by 0x128F8D: handle_read (daemon_conn.c:31) ==69243== by 0x16BEE1: next_plan (io.c:59) ==69243== by 0x16CAE9: do_plan (io.c:407) ==69243== by 0x16CB2B: io_ready (io.c:417) ==69243== by 0x16EE1E: io_loop (poll.c:453) ==69243== by 0x1154DA: main (gossipd.c:1089) ==69243== ``` Signed-off-by: Rusty Russell <[email protected]>

Otherwise we get weird effects, as htlcs are being freed: ``` 2022-01-26T05:07:37.8774610Z lightningd-1: 2022-01-26T04:47:48.770Z DEBUG 030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Failing HTLC 18446744073709551615 due to peer death 2022-01-26T05:07:37.8775287Z lightningd-1: 2022-01-26T04:47:48.770Z **BROKEN** 030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Neither origin nor in? ``` Signed-off-by: Rusty Russell <[email protected]>

…ght. If we call update_channel_from_inflight *twice* with the same inflight, we will get bad results. Using tal_steal() here was a premature optimization: ``` Valgrind error file: valgrind-errors.496395 ==496395== Invalid read of size 8 ==496395== at 0x22A9D3: to_tal_hdr (tal.c:174) ==496395== by 0x22B4B5: tal_steal_ (tal.c:498) ==496395== by 0x16A13D: update_channel_from_inflight (peer_control.c:1225) ==496395== by 0x16A4C7: funding_depth_cb (peer_control.c:1299) ==496395== by 0x182807: txw_fire (watch.c:232) ==496395== by 0x182AA9: watch_topology_changed (watch.c:300) ==496395== by 0x1290ED: updates_complete (chaintopology.c:624) ==496395== by 0x129BF4: get_new_block (chaintopology.c:835) ==496395== by 0x125EEF: getrawblockbyheight_callback (bitcoind.c:362) ==496395== by 0x176ECC: plugin_response_handle (plugin.c:584) ==496395== by 0x1770F5: plugin_read_json_one (plugin.c:690) ==496395== by 0x1772D9: plugin_read_json (plugin.c:735) ==496395== Address 0x89fbb08 is 24 bytes inside a block of size 104 free'd ==496395== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==496395== by 0x22B193: del_tree (tal.c:421) ==496395== by 0x22B461: tal_free (tal.c:486) ==496395== by 0x16A123: update_channel_from_inflight (peer_control.c:1223) ==496395== by 0x16A4C7: funding_depth_cb (peer_control.c:1299) ==496395== by 0x182807: txw_fire (watch.c:232) ==496395== by 0x182AA9: watch_topology_changed (watch.c:300) ==496395== by 0x1290ED: updates_complete (chaintopology.c:624) ==496395== by 0x129BF4: get_new_block (chaintopology.c:835) ==496395== by 0x125EEF: getrawblockbyheight_callback (bitcoind.c:362) ==496395== by 0x176ECC: plugin_response_handle (plugin.c:584) ==496395== by 0x1770F5: plugin_read_json_one (plugin.c:690) ==496395== Block was alloc'd at ==496395== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==496395== by 0x22AC1C: allocate (tal.c:250) ==496395== by 0x22B1DD: tal_alloc_ (tal.c:428) ==496395== by 0x22B3A6: tal_alloc_arr_ (tal.c:471) ==496395== by 0x22C094: tal_dup_ (tal.c:805) ==496395== by 0x12B274: new_inflight (channel.c:187) ==496395== by 0x136D4C: wallet_commit_channel (dual_open_control.c:1260) ==496395== by 0x13B084: handle_commit_received (dual_open_control.c:2839) ==496395== by 0x13B6AF: dual_opend_msg (dual_open_control.c:2976) ==496395== by 0x1809FF: sd_msg_read (subd.c:553) ==496395== by 0x218F5D: next_plan (io.c:59) ==496395== by 0x219B65: do_plan (io.c:407) ``` Signed-off-by: Rusty Russell <[email protected]>

If the HTLCs are completely negotiated, we can get a channel break when we mine a pile of blocks. This is mainly seen with Postgres, due to the db speed. Signed-off-by: Rusty Russell <[email protected]>

We really need our own lnprototest tests for packet-based stuff; these message-based tests are inherently delicate and awkward. In particular, connectd now does dev-disconnect, so the socket is not immediately closed after a dev-disconnect command. In this case, the WIRE_SHUTDOWN has often already been written from connectd to channeld. But it sometimes works, too. Signed-off-by: Rusty Russell <[email protected]>

vincenzopalazzo · 2022-01-31T09:23:46Z

tests/test_closing.py

@@ -3438,6 +3438,7 @@ def test_closing_higherfee(node_factory, bitcoind, executor):
 @pytest.mark.developer("needs dev_disconnect")
 def test_htlc_rexmit_while_closing(node_factory, executor):
    """Retranmitting an HTLC revocation while shutting down should work"""
+    # FIXME: This should be in lnprototest!  UNRELIABLE.


I got Sir :) I will add it in my todo list

cdecker

Excellent pull request! This massively simplifies a lot of our former
headaches and stabilizes a whole lot of tests as a bonus. I just have
some minor nits and questions that I found while reviewing.

Other than those clarifications I look forward to merging this :-)

lightningd/gossip_control.c

gossipd/gossip_generation.c

lightningd/peer_htlcs.c

gossipd/gossip_generation.c

gossipd/gossip_store.c

connectd/connectd.c

gossipd/gossipd.c

connectd/connectd.h

connectd/multiplex.h

cdecker · 2022-02-03T13:51:57Z

tests/test_pay.py

@@ -1748,7 +1746,9 @@ def listpays_nofail(b11):
 def test_pay_routeboost(node_factory, bitcoind, compat):
    """Make sure we can use routeboost information. """
    # l1->l2->l3--private-->l4
-    l1, l2 = node_factory.line_graph(2, announce_channels=True, wait_for_announce=True)
+    # Note: l1 gets upset because it extracts update for private channel.


I'm not sure I understand the comment here. Since we trigger "Bad gossip" only when receiving messages from peers, does this mean that we are accidentally sending a private update for which we don't have a matching announcement (being private)?

Hmm, good question! I've removed the suppression of bad gossip detection, and am re-running.

Yes, we are. Apparently Eclair will actually use these, though we don't currently.

Should we suppress them?

Signed-off-by: Rusty Russell <[email protected]>

vincenzopalazzo

ACK 73e325c

rustyrussell requested a review from cdecker as a code owner December 20, 2021 05:42

rustyrussell mentioned this pull request Dec 20, 2021

Add option to allow even custom messages to be sent #4963

Closed

cdecker added the needs-rebase label Jan 3, 2022

rustyrussell force-pushed the connectd-demux-part-2 branch from 8950384 to 0d7bf21 Compare January 5, 2022 00:03

rustyrussell force-pushed the connectd-demux-part-2 branch from 0d7bf21 to d02bc73 Compare January 21, 2022 04:58

cdecker removed the needs-rebase label Jan 21, 2022

rustyrussell force-pushed the connectd-demux-part-2 branch 5 times, most recently from aba5c0c to 0e88938 Compare January 24, 2022 11:51

rustyrussell added 13 commits January 25, 2022 06:27

gossipd: infrastructure to tell lightningd about local channel updates.

627404a

We want it to keep the latest, so it can make its own error msgs without asking us. This installs (but does not use!) the message handler. Signed-off-by: Rusty Russell <[email protected]>

gossipd: feed lightningd the channel_updates as soon as we make them.

5a9e8f3

Even if we're deferring putting them in the store and broadcasting them, we tell lightningd so it will use it in any error messages. Signed-off-by: Rusty Russell <[email protected]>

lightningd: tell gossipd when we use the channel_update.

34e1038

This way it can flush it if it was pending. Signed-off-by: Rusty Russell <[email protected]>

lightningd: use our cached channel_update for errors instead of askin…

24fc4d4

…g gossipd. We also no longer strip the type off: everyone handles both forms, and Eclair doesn't strip (and it's easier!). Signed-off-by: Rusty Russell <[email protected]>

channeld: keep local copy of latest channel_update for errors.

042a6ed

Now we don't ask gossipd, but lightningd keeps channeld up-to-date. Signed-off-by: Rusty Russell <[email protected]>

channeld: send channel updates and announcements via lightningd.

cff1d7e

We're weaning per-peer daemons off having a direct gossipd connection. Signed-off-by: Rusty Russell <[email protected]>

connectd: temporarily have two fds to gossipd.

95b0dec

We want to stream gossip through this, but currently connectd treats the fd as synchronous. While we work on getting rid of that, it's easiest to have two fds. Signed-off-by: Rusty Russell <[email protected]>

gossipd: add routines to send gossip messages to and from connectd.

5b7cccf

Signed-off-by: Rusty Russell <[email protected]>

connectd: divert gossip messages directly to gossipd.

4b48a40

Signed-off-by: Rusty Russell <[email protected]>

gossipd: send all gossip msgs directly to connectd, not peer.

4c9e67c

Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the connectd-demux-part-2 branch 5 times, most recently from 456f9bb to b4ca774 Compare January 26, 2022 02:36

rustyrussell force-pushed the connectd-demux-part-2 branch 3 times, most recently from 58c9044 to b99d304 Compare January 28, 2022 06:25

rustyrussell added 14 commits January 29, 2022 14:01

pytest: ignore pings when doing query_gossip.

bc18acf

Next patch starts a timeout ping, which can interfere with results. In theory, we should reply, but in practice (so far!) we seem to get enough time that it doesn't hang up on us. Signed-off-by: Rusty Russell <[email protected]>

connectd: handle pings and pongs.

b4c45af

Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: JSON_RPC: `ping` now works with connected peers, even without a channel.

connectd: handle onion messages.

fb2d208

Signed-off-by: Rusty Russell <[email protected]>

common: make sure we hand through peer for io logging.

c7ec0a3

This is mainly useful for connectd. Signed-off-by: Rusty Russell <[email protected]>

connectd: handle custom messages.

8ffb126

This is neater than what we had before, and slightly more general. Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: JSON_RPC: `sendcustommsg` now works with any connected peer, even when shutting down a channel.

channeld: simply exit if hsmd vanishes.

cee59bc

We currently die when gossipd vanishes, but our direct connection will go away. We then complain if the node is shutting down while we're talking to hsmd. Signed-off-by: Rusty Russell <[email protected]>

subdaemons: remove gossipd fd from per-peer daemons.

8d6d86d

Signed-off-by: Rusty Russell <[email protected]>

gossipd: don't hand out fds.

5dea211

Gossipd now simply gets told by channeld when peers arrive or leave. (it only needs to know for the seeker). Signed-off-by: Rusty Russell <[email protected]>

connectd: remove second gossip fd.

cd61786

Now we only send and receive gossip messages on this fd. Signed-off-by: Rusty Russell <[email protected]>

connectd: don't received useless peer fd if we're told to send final …

9b51d5b

…msg. We don't need the connection to ourselves, just to free it. Signed-off-by: Rusty Russell <[email protected]>

pytest: make test_gossip_no_empty_announcements more robust.

e2e6fb4

Don't assume gossip send order: explicitly disconnect and reconnect. Signed-off-by: Rusty Russell <[email protected]>

rustyrussell force-pushed the connectd-demux-part-2 branch from b99d304 to 1c4481b Compare January 29, 2022 03:33

rustyrussell added 4 commits January 30, 2022 14:07

rustyrussell force-pushed the connectd-demux-part-2 branch from 1c4481b to 93f8a11 Compare January 30, 2022 03:37

rustyrussell added 2 commits January 31, 2022 13:09

pytest: make test_mpp_adaptive more reliable

333b0e8

If the HTLCs are completely negotiated, we can get a channel break when we mine a pile of blocks. This is mainly seen with Postgres, due to the db speed. Signed-off-by: Rusty Russell <[email protected]>

vincenzopalazzo reviewed Jan 31, 2022

View reviewed changes

cdecker approved these changes Feb 3, 2022

View reviewed changes

misc: clarifications from cdecker review.

73e325c

Signed-off-by: Rusty Russell <[email protected]>

vincenzopalazzo approved these changes Feb 7, 2022

View reviewed changes

rustyrussell merged commit d4fee83 into ElementsProject:master Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connectd demux part 2 #4985

Connectd demux part 2 #4985

rustyrussell commented Dec 20, 2021 •

edited

Loading

rustyrussell commented Jan 5, 2022

vincenzopalazzo Jan 31, 2022

cdecker left a comment

cdecker Feb 3, 2022

rustyrussell Feb 7, 2022

rustyrussell Feb 7, 2022

vincenzopalazzo left a comment

Connectd demux part 2 #4985

Connectd demux part 2 #4985

Conversation

rustyrussell commented Dec 20, 2021 • edited Loading

rustyrussell commented Jan 5, 2022

vincenzopalazzo Jan 31, 2022

Choose a reason for hiding this comment

cdecker left a comment

Choose a reason for hiding this comment

cdecker Feb 3, 2022

Choose a reason for hiding this comment

rustyrussell Feb 7, 2022

Choose a reason for hiding this comment

rustyrussell Feb 7, 2022

Choose a reason for hiding this comment

vincenzopalazzo left a comment

Choose a reason for hiding this comment

rustyrussell commented Dec 20, 2021 •

edited

Loading