-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connectd demux part 2 #4985
Connectd demux part 2 #4985
Conversation
8950384
to
0d7bf21
Compare
Trivial rebase. |
0d7bf21
to
d02bc73
Compare
aba5c0c
to
0e88938
Compare
This is in preparation for gossipd feeding us the latest channel_updates, rather than having lightningd and channeld query gossipd when it wants to send an onion error with an update included. This means gossipd will start telling us the updates, so we need the channels loaded first. Signed-off-by: Rusty Russell <[email protected]>
We want it to keep the latest, so it can make its own error msgs without asking us. This installs (but does not use!) the message handler. Signed-off-by: Rusty Russell <[email protected]>
Even if we're deferring putting them in the store and broadcasting them, we tell lightningd so it will use it in any error messages. Signed-off-by: Rusty Russell <[email protected]>
This way it can flush it if it was pending. Signed-off-by: Rusty Russell <[email protected]>
…g gossipd. We also no longer strip the type off: everyone handles both forms, and Eclair doesn't strip (and it's easier!). Signed-off-by: Rusty Russell <[email protected]>
Now we don't ask gossipd, but lightningd keeps channeld up-to-date. Signed-off-by: Rusty Russell <[email protected]>
We're weaning per-peer daemons off having a direct gossipd connection. Signed-off-by: Rusty Russell <[email protected]>
The last change exposed a race: the peer sends funding_locked then immediately sends an update_channel. channeld used to process the funding_locked from the peer, tell gossipd about the new channel, then finally forward the channel_update. We can have the channel_update hit gossipd before we've told it about the channel. It ignores the channel_update for the currently-unknown channel: we get a 'bad gossip' message, but the immediate symptom is a timeout in tests/test_closing.py::test_onchain_multihtlc_their_unilateral: ``` node_factory = <pyln.testing.utils.NodeFactory object at 0x7fdf93f42190> bitcoind = <pyln.testing.utils.BitcoinD object at 0x7fdf940b99d0> @pytest.mark.developer("needs DEVELOPER=1 for dev_ignore_htlcs") @pytest.mark.slow_test def test_onchain_multihtlc_their_unilateral(node_factory, bitcoind): """Node pushes a channel onchain with multiple HTLCs with same payment_hash """ > h, nodes = setup_multihtlc_test(node_factory, bitcoind) tests/test_closing.py:2938: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ tests/test_closing.py:2780: in setup_multihtlc_test nodes = node_factory.line_graph(7, wait_for_announce=True, /usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:1416: in line_graph self.join_nodes(nodes, fundchannel, fundamount, wait_for_announce, announce_channels) /usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:1394: in join_nodes nodes[i + 1].wait_channel_active(scids[i]) /usr/local/lib/python3.8/dist-packages/pyln/testing/utils.py:958: in wait_channel_active wait_for(lambda: self.is_channel_active(chanid)) ``` Note that we are usually much faster to send between subds than we are between peers, but during CI this is common, as we're all running on the same machine. Signed-off-by: Rusty Russell <[email protected]>
Once we send funding_locked, gossipd could start seeing channel_updates from the peer (which get sent so we can use the channel in routehints even before it's announcable). Signed-off-by: Rusty Russell <[email protected]>
We want to stream gossip through this, but currently connectd treats the fd as synchronous. While we work on getting rid of that, it's easiest to have two fds. Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
456f9bb
to
b4ca774
Compare
58c9044
to
b99d304
Compare
Next patch starts a timeout ping, which can interfere with results. In theory, we should reply, but in practice (so far!) we seem to get enough time that it doesn't hang up on us. Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: JSON_RPC: `ping` now works with connected peers, even without a channel.
Signed-off-by: Rusty Russell <[email protected]>
We don't need to log msgs from subds, but we do our own, and we weren't. 1. Rename queue_peer_msg to inject_peer_msg for clarity, make it do logging 2. In the one place where we're relaying, call msg_queue() directly. Signed-off-by: Rusty Russell <[email protected]>
This is mainly useful for connectd. Signed-off-by: Rusty Russell <[email protected]>
This is neater than what we had before, and slightly more general. Signed-off-by: Rusty Russell <[email protected]> Changelog-Changed: JSON_RPC: `sendcustommsg` now works with any connected peer, even when shutting down a channel.
We currently die when gossipd vanishes, but our direct connection will go away. We then complain if the node is shutting down while we're talking to hsmd. Signed-off-by: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
Gossipd now simply gets told by channeld when peers arrive or leave. (it only needs to know for the seeker). Signed-off-by: Rusty Russell <[email protected]>
Now we only send and receive gossip messages on this fd. Signed-off-by: Rusty Russell <[email protected]>
…msg. We don't need the connection to ourselves, just to free it. Signed-off-by: Rusty Russell <[email protected]>
Don't send EOF marker to peer, e.g. in tests/test_gossip.py::test_gossip_store_compact: ``` lightningd-2: 2022-01-24T03:34:22.925Z DEBUG connectd: gossip_store at end, new fd moved to 1875 lightningd-2: 2022-01-24T03:34:22.933Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Sending gossip INVALID 4105 lightningd-2: 2022-01-24T03:34:22.933Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: peer_in WIRE_WARNING lightningd-2: 2022-01-24T03:34:22.941Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: peer_out INVALID 4105 lightningd-2: 2022-01-24T03:34:22.949Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-channeld-chan#2: billboard perm: Received warning channel 2c7cf1dc9dada7ed14f10c78ade8f0de907c1b70e736c12ff6f7472dc69c3db3: Peer sent unknown message 4105 (INVALID 4105) ``` Signed-off-by: Rusty Russell <[email protected]>
We were relying on the fee update to create an additional tx. That's ugly; do an actual payment and make sure we definitely complete a new tx by waiting for that *then* both revoke_and_ack. (Without this, we could get a unilateral close instead of a penalty). Signed-off-by: Rusty Russell <[email protected]>
Don't assume gossip send order: explicitly disconnect and reconnect. Signed-off-by: Rusty Russell <[email protected]>
b99d304
to
1c4481b
Compare
…fast. If we fund a channel between two nodes, then mine all the blocks to announce it, any other nodes may see the announcement before the blocks, causing CI to complain about "bad gossip": ``` lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Ignoring future channel_announcment for 113x1x1 (current block 112) lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement 113x1x1/0 lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_CHANNEL_UPDATE before announcement 113x1x1/1 lightningd-4: 2022-01-25T22:33:25.468Z DEBUG 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e-gossipd: Bad gossip order: WIRE_NODE_ANNOUNCEMENT before announcement 032cf15d1ad9c4a08d26eab1918f732d8ef8fdc6abb9640bf3db174372c491304e ``` Add a new helper for this case, and use it where there are more than 2 nodes. Cleans up test_routing_gossip and a few other places which did this manually. Signed-off-by: Rusty Russell <[email protected]>
`hc` is never NULL, since it's `hc = &chan->half[direction];`; we really meant "is it initialized", and valgrind under CI finally caught it: ``` ==69243== Conditional jump or move depends on uninitialised value(s) ==69243== at 0x11C595: handle_local_channel_update (gossip_generation.c:758) ==69243== by 0x115254: recv_req (gossipd.c:986) ==69243== by 0x128F8D: handle_read (daemon_conn.c:31) ==69243== by 0x16BEE1: next_plan (io.c:59) ==69243== by 0x16CAE9: do_plan (io.c:407) ==69243== by 0x16CB2B: io_ready (io.c:417) ==69243== by 0x16EE1E: io_loop (poll.c:453) ==69243== by 0x1154DA: main (gossipd.c:1089) ==69243== ``` Signed-off-by: Rusty Russell <[email protected]>
Otherwise we get weird effects, as htlcs are being freed: ``` 2022-01-26T05:07:37.8774610Z lightningd-1: 2022-01-26T04:47:48.770Z DEBUG 030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Failing HTLC 18446744073709551615 due to peer death 2022-01-26T05:07:37.8775287Z lightningd-1: 2022-01-26T04:47:48.770Z **BROKEN** 030eeb52087b9dbb27b7aec79ca5249369f6ce7b20a5684ce38d9f4595a21c2fda-chan#8: Neither origin nor in? ``` Signed-off-by: Rusty Russell <[email protected]>
…ght. If we call update_channel_from_inflight *twice* with the same inflight, we will get bad results. Using tal_steal() here was a premature optimization: ``` Valgrind error file: valgrind-errors.496395 ==496395== Invalid read of size 8 ==496395== at 0x22A9D3: to_tal_hdr (tal.c:174) ==496395== by 0x22B4B5: tal_steal_ (tal.c:498) ==496395== by 0x16A13D: update_channel_from_inflight (peer_control.c:1225) ==496395== by 0x16A4C7: funding_depth_cb (peer_control.c:1299) ==496395== by 0x182807: txw_fire (watch.c:232) ==496395== by 0x182AA9: watch_topology_changed (watch.c:300) ==496395== by 0x1290ED: updates_complete (chaintopology.c:624) ==496395== by 0x129BF4: get_new_block (chaintopology.c:835) ==496395== by 0x125EEF: getrawblockbyheight_callback (bitcoind.c:362) ==496395== by 0x176ECC: plugin_response_handle (plugin.c:584) ==496395== by 0x1770F5: plugin_read_json_one (plugin.c:690) ==496395== by 0x1772D9: plugin_read_json (plugin.c:735) ==496395== Address 0x89fbb08 is 24 bytes inside a block of size 104 free'd ==496395== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==496395== by 0x22B193: del_tree (tal.c:421) ==496395== by 0x22B461: tal_free (tal.c:486) ==496395== by 0x16A123: update_channel_from_inflight (peer_control.c:1223) ==496395== by 0x16A4C7: funding_depth_cb (peer_control.c:1299) ==496395== by 0x182807: txw_fire (watch.c:232) ==496395== by 0x182AA9: watch_topology_changed (watch.c:300) ==496395== by 0x1290ED: updates_complete (chaintopology.c:624) ==496395== by 0x129BF4: get_new_block (chaintopology.c:835) ==496395== by 0x125EEF: getrawblockbyheight_callback (bitcoind.c:362) ==496395== by 0x176ECC: plugin_response_handle (plugin.c:584) ==496395== by 0x1770F5: plugin_read_json_one (plugin.c:690) ==496395== Block was alloc'd at ==496395== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==496395== by 0x22AC1C: allocate (tal.c:250) ==496395== by 0x22B1DD: tal_alloc_ (tal.c:428) ==496395== by 0x22B3A6: tal_alloc_arr_ (tal.c:471) ==496395== by 0x22C094: tal_dup_ (tal.c:805) ==496395== by 0x12B274: new_inflight (channel.c:187) ==496395== by 0x136D4C: wallet_commit_channel (dual_open_control.c:1260) ==496395== by 0x13B084: handle_commit_received (dual_open_control.c:2839) ==496395== by 0x13B6AF: dual_opend_msg (dual_open_control.c:2976) ==496395== by 0x1809FF: sd_msg_read (subd.c:553) ==496395== by 0x218F5D: next_plan (io.c:59) ==496395== by 0x219B65: do_plan (io.c:407) ``` Signed-off-by: Rusty Russell <[email protected]>
1c4481b
to
93f8a11
Compare
If the HTLCs are completely negotiated, we can get a channel break when we mine a pile of blocks. This is mainly seen with Postgres, due to the db speed. Signed-off-by: Rusty Russell <[email protected]>
We really need our own lnprototest tests for packet-based stuff; these message-based tests are inherently delicate and awkward. In particular, connectd now does dev-disconnect, so the socket is not immediately closed after a dev-disconnect command. In this case, the WIRE_SHUTDOWN has often already been written from connectd to channeld. But it sometimes works, too. Signed-off-by: Rusty Russell <[email protected]>
@@ -3438,6 +3438,7 @@ def test_closing_higherfee(node_factory, bitcoind, executor): | |||
@pytest.mark.developer("needs dev_disconnect") | |||
def test_htlc_rexmit_while_closing(node_factory, executor): | |||
"""Retranmitting an HTLC revocation while shutting down should work""" | |||
# FIXME: This should be in lnprototest! UNRELIABLE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got Sir :) I will add it in my todo list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent pull request! This massively simplifies a lot of our former
headaches and stabilizes a whole lot of tests as a bonus. I just have
some minor nits and questions that I found while reviewing.
Other than those clarifications I look forward to merging this :-)
@@ -1748,7 +1746,9 @@ def listpays_nofail(b11): | |||
def test_pay_routeboost(node_factory, bitcoind, compat): | |||
"""Make sure we can use routeboost information. """ | |||
# l1->l2->l3--private-->l4 | |||
l1, l2 = node_factory.line_graph(2, announce_channels=True, wait_for_announce=True) | |||
# Note: l1 gets upset because it extracts update for private channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the comment here. Since we trigger "Bad gossip" only when receiving messages from peers, does this mean that we are accidentally sending a private update for which we don't have a matching announcement (being private)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good question! I've removed the suppression of bad gossip detection, and am re-running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are. Apparently Eclair will actually use these, though we don't currently.
Should we suppress them?
Signed-off-by: Rusty Russell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 73e325c
(based on #4984 )(Merged)This removes the direct connection from per-peer daemons to gossipd.