Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectd demux part 2 #4985

Merged

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Dec 20, 2021

(based on #4984 ) (Merged)

This removes the direct connection from per-peer daemons to gossipd.

  1. Removes the requirement to query gossipd to get the latest channel_update for errors; it now pushes them out.
  2. Makes connectd<->gossipd fd async, uses it for transferring gossip messages for each peer.
  3. Makes connectd handle onion messages, pings and custom messages itself.

@rustyrussell
Copy link
Contributor Author

Trivial rebase.

This is in preparation for gossipd feeding us the latest channel_updates,
rather than having lightningd and channeld query gossipd when it wants
to send an onion error with an update included.

This means gossipd will start telling us the updates, so we need the
channels loaded first.

Signed-off-by: Rusty Russell <[email protected]>
We want it to keep the latest, so it can make its own error msgs without
asking us.  This installs (but does not use!) the message handler.

Signed-off-by: Rusty Russell <[email protected]>
Even if we're deferring putting them in the store and broadcasting them,
we tell lightningd so it will use it in any error messages.

Signed-off-by: Rusty Russell <[email protected]>
This way it can flush it if it was pending.

Signed-off-by: Rusty Russell <[email protected]>
…g gossipd.

We also no longer strip the type off: everyone handles both forms, and
Eclair doesn't strip (and it's easier!).

Signed-off-by: Rusty Russell <[email protected]>
Now we don't ask gossipd, but lightningd keeps channeld up-to-date.

Signed-off-by: Rusty Russell <[email protected]>
We're weaning per-peer daemons off having a direct gossipd connection.

Signed-off-by: Rusty Russell <[email protected]>
The last change exposed a race: the peer sends funding_locked
then immediately sends an update_channel.  channeld used to process
the funding_locked from the peer, tell gossipd about the new
channel, then finally forward the channel_update.

We can have the channel_update hit gossipd before we've told it about
the channel.  It ignores the channel_update for the currently-unknown
channel: we get a 'bad gossip' message, but the immediate symptom
is a timeout in tests/test_closing.py::test_onchain_multihtlc_their_unilateral:

```
node_factory = <pyln.testing.utils.NodeFactory object at 0x7fdf93f42190>
bitcoind = <pyln.testing.utils.BitcoinD object at 0x7fdf940b99d0>

    @pytest.mark.developer("needs DEVELOPER=1 for dev_ignore_htlcs")
    @pytest.mark.slow_test
    def test_onchain_multihtlc_their_unilateral(node_factory, bitcoind):
        """Node pushes a channel onchain with multiple HTLCs with same payment_hash """
>       h, nodes = setup_multihtlc_test(node_factory, bitcoind)

tests/test_closing.py:2938: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_closing.py:2780: in setup_multihtlc_test
    nodes = node_factory.line_graph(7, wait_for_announce=True,
/usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:1416: in line_graph
    self.join_nodes(nodes, fundchannel, fundamount, wait_for_announce, announce_channels)
/usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:1394: in join_nodes
    nodes[i + 1].wait_channel_active(scids[i])
/usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:958: in wait_channel_active
    wait_for(lambda: self.is_channel_active(chanid))
```

Note that we are usually much faster to send between subds than we are
between peers, but during CI this is common, as we're all running on
the same machine.

Signed-off-by: Rusty Russell <[email protected]>
Once we send funding_locked, gossipd could start seeing channel_updates
from the peer (which get sent so we can use the channel in routehints
even before it's announcable).

Signed-off-by: Rusty Russell <[email protected]>
We want to stream gossip through this, but currently connectd treats the
fd as synchronous.  While we work on getting rid of that, it's easiest to
have two fds.

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the connectd-demux-part-2 branch 5 times, most recently from 456f9bb to b4ca774 Compare January 26, 2022 02:36
@rustyrussell rustyrussell force-pushed the connectd-demux-part-2 branch 3 times, most recently from 58c9044 to b99d304 Compare January 28, 2022 06:25
Next patch starts a timeout ping, which can interfere with results.

In theory, we should reply, but in practice (so far!) we seem to get enough
time that it doesn't hang up on us.

Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
Changelog-Changed: JSON_RPC: `ping` now works with connected peers, even without a channel.
We don't need to log msgs from subds, but we do our own, and we weren't.

1. Rename queue_peer_msg to inject_peer_msg for clarity, make it do logging
2. In the one place where we're relaying, call msg_queue() directly.

Signed-off-by: Rusty Russell <[email protected]>
This is mainly useful for connectd.

Signed-off-by: Rusty Russell <[email protected]>
This is neater than what we had before, and slightly more general.

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Changed: JSON_RPC: `sendcustommsg` now works with any connected peer, even when shutting down a channel.
We currently die when gossipd vanishes, but our direct connection will
go away.  We then complain if the node is shutting down while we're talking
to hsmd.

Signed-off-by: Rusty Russell <[email protected]>
Gossipd now simply gets told by channeld when peers arrive or leave.
(it only needs to know for the seeker).

Signed-off-by: Rusty Russell <[email protected]>
Now we only send and receive gossip messages on this fd.

Signed-off-by: Rusty Russell <[email protected]>
…msg.

We don't need the connection to ourselves, just to free it.

Signed-off-by: Rusty Russell <[email protected]>
Don't send EOF marker to peer, e.g. in tests/test_gossip.py::test_gossip_store_compact:

```
lightningd-2: 2022-01-24T03:34:22.925Z DEBUG   connectd: gossip_store at end, new fd moved to 1875
lightningd-2: 2022-01-24T03:34:22.933Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Sending gossip INVALID 4105
lightningd-2: 2022-01-24T03:34:22.933Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: peer_in WIRE_WARNING
lightningd-2: 2022-01-24T03:34:22.941Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: peer_out INVALID 4105
lightningd-2: 2022-01-24T03:34:22.949Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: billboard perm: Received warning channel 2c7cf1dc9dada7ed14f10c78ade8f0de907c1b70e736c12ff6f7472dc69c3db3: Peer sent unknown message 4105 (INVALID 4105)
```

Signed-off-by: Rusty Russell <[email protected]>
We were relying on the fee update to create an additional tx.  That's
ugly; do an actual payment and make sure we definitely complete a new
tx by waiting for that *then* both revoke_and_ack.

(Without this, we could get a unilateral close instead of a penalty).

Signed-off-by: Rusty Russell <[email protected]>
Don't assume gossip send order: explicitly disconnect and reconnect.

Signed-off-by: Rusty Russell <[email protected]>
…fast.

If we fund a channel between two nodes, then mine all the blocks to
announce it, any other nodes may see the announcement before the
blocks, causing CI to complain about "bad gossip":

```
lightningd-4: 2022-01-25T22:33:25.468Z DEBUG   032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Ignoring future channel_announcment for 113x1x1 (current block 112)
lightningd-4: 2022-01-25T22:33:25.468Z DEBUG   032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement 113x1x1/0
lightningd-4: 2022-01-25T22:33:25.468Z DEBUG   032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement 113x1x1/1
lightningd-4: 2022-01-25T22:33:25.468Z DEBUG   032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_NODE_ANNOUNCEMENT before announcement 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e
```

Add a new helper for this case, and use it where there are more than 2 nodes.

Cleans up test_routing_gossip and a few other places which did this manually.

Signed-off-by: Rusty Russell <[email protected]>
`hc` is never NULL, since it's `hc = &chan->half[direction];`;
we really meant "is it initialized", and valgrind under CI finally
caught it:

```
==69243== Conditional jump or move depends on uninitialised value(s)
==69243==    at 0x11C595: handle_local_channel_update (gossip_generation.c:758)
==69243==    by 0x115254: recv_req (gossipd.c:986)
==69243==    by 0x128F8D: handle_read (daemon_conn.c:31)
==69243==    by 0x16BEE1: next_plan (io.c:59)
==69243==    by 0x16CAE9: do_plan (io.c:407)
==69243==    by 0x16CB2B: io_ready (io.c:417)
==69243==    by 0x16EE1E: io_loop (poll.c:453)
==69243==    by 0x1154DA: main (gossipd.c:1089)
==69243==
```

Signed-off-by: Rusty Russell <[email protected]>
Otherwise we get weird effects, as htlcs are being freed:

```
2022-01-26T05:07:37.8774610Z lightningd-1: 2022-01-26T04:47:48.770Z DEBUG   030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Failing HTLC 18446744073709551615 due to peer death
2022-01-26T05:07:37.8775287Z lightningd-1: 2022-01-26T04:47:48.770Z **BROKEN** 030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Neither origin nor in?
```

Signed-off-by: Rusty Russell <[email protected]>
…ght.

If we call update_channel_from_inflight *twice* with the same inflight, we
will get bad results.  Using tal_steal() here was a premature optimization:

```
Valgrind error file: valgrind-errors.496395
==496395== Invalid read of size 8
==496395==    at 0x22A9D3: to_tal_hdr (tal.c:174)
==496395==    by 0x22B4B5: tal_steal_ (tal.c:498)
==496395==    by 0x16A13D: update_channel_from_inflight (peer_control.c:1225)
==496395==    by 0x16A4C7: funding_depth_cb (peer_control.c:1299)
==496395==    by 0x182807: txw_fire (watch.c:232)
==496395==    by 0x182AA9: watch_topology_changed (watch.c:300)
==496395==    by 0x1290ED: updates_complete (chaintopology.c:624)
==496395==    by 0x129BF4: get_new_block (chaintopology.c:835)
==496395==    by 0x125EEF: getrawblockbyheight_callback (bitcoind.c:362)
==496395==    by 0x176ECC: plugin_response_handle (plugin.c:584)
==496395==    by 0x1770F5: plugin_read_json_one (plugin.c:690)
==496395==    by 0x1772D9: plugin_read_json (plugin.c:735)
==496395==  Address 0x89fbb08 is 24 bytes inside a block of size 104 free'd
==496395==    at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==496395==    by 0x22B193: del_tree (tal.c:421)
==496395==    by 0x22B461: tal_free (tal.c:486)
==496395==    by 0x16A123: update_channel_from_inflight (peer_control.c:1223)
==496395==    by 0x16A4C7: funding_depth_cb (peer_control.c:1299)
==496395==    by 0x182807: txw_fire (watch.c:232)
==496395==    by 0x182AA9: watch_topology_changed (watch.c:300)
==496395==    by 0x1290ED: updates_complete (chaintopology.c:624)
==496395==    by 0x129BF4: get_new_block (chaintopology.c:835)
==496395==    by 0x125EEF: getrawblockbyheight_callback (bitcoind.c:362)
==496395==    by 0x176ECC: plugin_response_handle (plugin.c:584)
==496395==    by 0x1770F5: plugin_read_json_one (plugin.c:690)
==496395==  Block was alloc'd at
==496395==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==496395==    by 0x22AC1C: allocate (tal.c:250)
==496395==    by 0x22B1DD: tal_alloc_ (tal.c:428)
==496395==    by 0x22B3A6: tal_alloc_arr_ (tal.c:471)
==496395==    by 0x22C094: tal_dup_ (tal.c:805)
==496395==    by 0x12B274: new_inflight (channel.c:187)
==496395==    by 0x136D4C: wallet_commit_channel (dual_open_control.c:1260)
==496395==    by 0x13B084: handle_commit_received (dual_open_control.c:2839)
==496395==    by 0x13B6AF: dual_opend_msg (dual_open_control.c:2976)
==496395==    by 0x1809FF: sd_msg_read (subd.c:553)
==496395==    by 0x218F5D: next_plan (io.c:59)
==496395==    by 0x219B65: do_plan (io.c:407)
```

Signed-off-by: Rusty Russell <[email protected]>
If the HTLCs are completely negotiated, we can get a channel break when
we mine a pile of blocks.  This is mainly seen with Postgres, due to the db
speed.

Signed-off-by: Rusty Russell <[email protected]>
We really need our own lnprototest tests for packet-based stuff;
these message-based tests are inherently delicate and awkward.

In particular, connectd now does dev-disconnect, so the socket is not
immediately closed after a dev-disconnect command.  In this case, the
WIRE_SHUTDOWN has often already been written from connectd to channeld.

But it sometimes works, too.

Signed-off-by: Rusty Russell <[email protected]>
@@ -3438,6 +3438,7 @@ def test_closing_higherfee(node_factory, bitcoind, executor):
@pytest.mark.developer("needs dev_disconnect")
def test_htlc_rexmit_while_closing(node_factory, executor):
"""Retranmitting an HTLC revocation while shutting down should work"""
# FIXME: This should be in lnprototest! UNRELIABLE.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got Sir :) I will add it in my todo list

Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent pull request! This massively simplifies a lot of our former
headaches and stabilizes a whole lot of tests as a bonus. I just have
some minor nits and questions that I found while reviewing.

Other than those clarifications I look forward to merging this :-)

lightningd/gossip_control.c Show resolved Hide resolved
gossipd/gossip_generation.c Show resolved Hide resolved
lightningd/peer_htlcs.c Show resolved Hide resolved
gossipd/gossip_generation.c Show resolved Hide resolved
gossipd/gossip_store.c Show resolved Hide resolved
connectd/connectd.c Outdated Show resolved Hide resolved
gossipd/gossipd.c Outdated Show resolved Hide resolved
connectd/connectd.h Show resolved Hide resolved
connectd/multiplex.h Outdated Show resolved Hide resolved
@@ -1748,7 +1746,9 @@ def listpays_nofail(b11):
def test_pay_routeboost(node_factory, bitcoind, compat):
"""Make sure we can use routeboost information. """
# l1->l2->l3--private-->l4
l1, l2 = node_factory.line_graph(2, announce_channels=True, wait_for_announce=True)
# Note: l1 gets upset because it extracts update for private channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the comment here. Since we trigger "Bad gossip" only when receiving messages from peers, does this mean that we are accidentally sending a private update for which we don't have a matching announcement (being private)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question! I've removed the suppression of bad gossip detection, and am re-running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are. Apparently Eclair will actually use these, though we don't currently.

Should we suppress them?

Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 73e325c

@rustyrussell rustyrussell merged commit d4fee83 into ElementsProject:master Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants