Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

channel updates and node announcements not propagating #7276

Closed
JssDWt opened this issue Apr 25, 2024 · 15 comments
Closed

channel updates and node announcements not propagating #7276

JssDWt opened this issue Apr 25, 2024 · 15 comments
Assignees
Labels
in diagnostic issue under diagnostic
Milestone

Comments

@JssDWt
Copy link
Contributor

JssDWt commented Apr 25, 2024

Issue and Steps to Reproduce

lightningd appears to fail to propagate its own channel updates through the network.
Yesterday, one of our nodes hadn't sent out a channel update or node announcement in over 22 days.
This corresponds with the latest restart of the node.
So it appears gossip is only exchanged when the node is restarted.
During this period there were several updates made to channel policies, but they also had not been propagated through the network.
Today the node was restarted and (some, but not all) gossip made it to other nodes.

I need some guidance how to troubleshoot this issue.

  • Which logs should I look for?
  • In which places in the codebase can I add logs to make it easier to troubleshoot?
  • Are there settings we can use to increase gossip propagation?

getinfo output

{
   "id": "redacted",
   "alias": "redacted",
   "color": "redacted",
   "num_peers": 25,
   "num_pending_channels": 0,
   "num_active_channels": 31,
   "num_inactive_channels": 1,
   "address": [redacted],
   "binding": [redacted],
   "version": "v24.02.1",
   "blockheight": 840822,
   "network": "bitcoin",
   "fees_collected_msat": redacted,
   "lightning-dir": "redacted",
   "our_features": {
      "init": "08a0000a8a5961",
      "node": "88a0000a8a5961",
      "channel": "",
      "invoice": "02000002024100"
   }
}
@JssDWt
Copy link
Contributor Author

JssDWt commented Apr 25, 2024

I have gathered a view of other nodes on this node's channels, especially the last update. It appears 'some' gossip has been trickling in to other nodes, especially around 14-15 April, but very sparse. So it's not that all gossip has not propagated. Most of the gossip has not propagated.

Does cln periodically update the channel policy? Can you point me to the place in the code where that happens?

@urza
Copy link
Contributor

urza commented Apr 29, 2024

I have also issue that my CLN channel updates are not propagating to the network.

In my case I traced it to issue with LND nodes in the network that mark some of my channels as "zombie" channels and refuse to propagate any gossip about them even if they are healthy channels.

It should hopefully be fixed by LND in 0.18 release.

lightningnetwork/lnd#8624

@JssDWt
Copy link
Contributor Author

JssDWt commented May 29, 2024

Querying the node announcement (with a little program written with LDK) from this node directly just now gives me a node announcement with the timestamp 1714019380, which is from April 24 (over a month ago). It appears the node announcement is not updated at all.

@vincenzopalazzo vincenzopalazzo self-assigned this May 30, 2024
@vincenzopalazzo
Copy link
Collaborator

Mh! looking at this while thinking how to debug the following one lightningdevkit/rust-lightning#3075

I think @rustyrussell or @endothermicdev know the code base well, but if they do not have time I can start looking into it because at some point I should understand well gossipd anyway

@vincenzopalazzo vincenzopalazzo added this to the v24.08 milestone May 30, 2024
@vincenzopalazzo vincenzopalazzo added protocol These issues are protocol level issues that should be discussed on the protocol spec repo in diagnostic issue under diagnostic and removed protocol These issues are protocol level issues that should be discussed on the protocol spec repo labels May 30, 2024
@JssDWt
Copy link
Contributor Author

JssDWt commented May 30, 2024

Mh! looking at this while thinking how to debug the following one lightningdevkit/rust-lightning#3075

I customized the LDK program to just send a gossip query to the node and log the returned node announcement message of the node in question. It is really the only node announcement returned by the node and doesn't have to do with ordering of the messages.

@vincenzopalazzo
Copy link
Collaborator

Oh sorry I did not noted that you created a custom script to pool gossips :) I was talking about the general problem.

@JssDWt
Copy link
Contributor Author

JssDWt commented May 30, 2024

It does appear to be sending channel announcement first, then channel update, then node announcement though.

@JssDWt
Copy link
Contributor Author

JssDWt commented Jun 2, 2024

Looking at this a little bit more.
I have a node with 41 public channels.

If I query the gossip directly on the node itself, almost all channels have an update after 2024-05-28

$ lightning-cli listchannels source=02442d4249f9a93464aaf8cd8d522faa869356707b5f1537a8d6def2af50058c5b | jq '.channels[] | .last_update | strftime("%Y-%m-%dT%H:%M:%S %Z")'
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:36 UTC"
"2024-05-30T10:09:58 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:02 UTC"
"2024-05-28T18:19:02 UTC"
"2024-05-28T18:19:36 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:19:05 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:18:00 UTC"
"2024-05-28T18:18:00 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-29T18:57:07 UTC"
"2024-05-28T18:20:02 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-31T21:49:11 UTC"
"2024-05-28T18:19:02 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-31T23:35:12 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:18:00 UTC"
"2024-05-18T17:58:17 UTC"
"2024-06-02T10:31:33 UTC"
"2024-06-02T10:31:33 UTC"

If I send a gossip timestamp filter to this node with first_timestamp == 0, I get all 41 public channels.
If, however, I send a gossip timestamp filter with first_timestamp == 7 days ago, which is before 2024-05-28, I would expect to get almost all the channels, but I get only 8 of its channels.
14 days I get 10 channels
28 days I get 17 channels
56 days I get 26 channels
112 days I get 35 channels
224 days I get all 41 channels

The oldest age of channels on the node is 127 days. So I think the opening timestamp is used to return channel updates, rather than the update date.

@rustyrussell rustyrussell self-assigned this Jun 26, 2024
@roeierez
Copy link

roeierez commented Jun 26, 2024

Reading this code here where the node announcement is broadcasted:

if (ld->node_announcement

It seems to me that if the node announcement hasn't been changed it is not get signed and sent again with a new timestamp.
Perhaps we can re-announce also if the last announcement time is too old?

@JssDWt
Copy link
Contributor Author

JssDWt commented Jun 28, 2024

Connecting out to a peer should send the peer gossip messages, especially when the peer requests it with a GOSSIP_TIMESTAMP_FILTER, but it doesn't happen (cln 24.05):

2024-06-28T21:43:31.459Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-connectd: Connected out, starting crypto
2024-06-28T21:43:31.609Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-connectd: Connect OUT
2024-06-28T21:43:31.610Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-connectd: peer_out WIRE_INIT
2024-06-28T21:43:31.756Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-connectd: peer_in WIRE_INIT
2024-06-28T21:43:31.756Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-lightningd: Peer says it sees our address as: REDACTED
2024-06-28T21:43:31.757Z DEBUG   lightningd: best_remote_addr: peer 000000000000000000000000000000000000000000000000000000000000000000 gave addr REDACTED (no chan)
2024-06-28T21:43:31.758Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: seeker: disabling gossip
2024-06-28T21:43:31.758Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-connectd: Handed peer, entering loop
2024-06-28T21:43:31.759Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER
2024-06-28T21:43:31.904Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_ANNOUNCEMENT
2024-06-28T21:43:31.905Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_UPDATE
2024-06-28T21:43:31.906Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: Received channel_update for channel 843114x1982x1/0 now ACTIVE
2024-06-28T21:43:31.906Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_UPDATE
2024-06-28T21:43:31.907Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_ANNOUNCEMENT
2024-06-28T21:43:31.907Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_UPDATE
2024-06-28T21:43:31.908Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_UPDATE
2024-06-28T21:43:31.908Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: Received channel_update for channel 843155x360x0/1 now ACTIVE
2024-06-28T21:43:32.050Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-connectd: peer_in WIRE_GOSSIP_TIMESTAMP_FILTER
2024-06-28T21:43:32.051Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_ANNOUNCEMENT
2024-06-28T21:43:32.051Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_UPDATE
2024-06-28T21:43:32.052Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_CHANNEL_UPDATE
2024-06-28T21:43:32.052Z DEBUG   000000000000000000000000000000000000000000000000000000000000000000-gossipd: handle_recv_gossip: WIRE_NODE_ANNOUNCEMENT
2024-06-28T21:44:01.008Z DEBUG   lightningd: best_remote_addr: peer 000000000000000000000000000000000000000000000000000000000000000000 gave addr REDACTED (no chan)

@rustyrussell
Copy link
Contributor

rustyrussell commented Aug 12, 2024

Reading this code here where the node announcement is broadcasted:

if (ld->node_announcement

It seems to me that if the node announcement hasn't been changed it is not get signed and sent again with a new timestamp. Perhaps we can re-announce also if the last announcement time is too old?

How does a node announcement get "too old"?

@rustyrussell
Copy link
Contributor

OK, let's clear some things up:

  1. We don't print out gossip messages without IO logging (there are simply way too many).
  2. We do, however, send all our own gossip on any incoming connection (independent of what they ask for).
  3. We don't update node_announcement unless something has changed. Why would we? They don't go stale.
  4. We should be refreshing our own channel_updates every 13 days. We won't do it if the peer is disconnected though, which could lead us to miss some?

Looking at this a little bit more. I have a node with 41 public channels.

If I query the gossip directly on the node itself, almost all channels have an update after 2024-05-28

$ lightning-cli listchannels source=02442d4249f9a93464aaf8cd8d522faa869356707b5f1537a8d6def2af50058c5b | jq '.channels[] | .last_update | strftime("%Y-%m-%dT%H:%M:%S %Z")'
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:36 UTC"
"2024-05-30T10:09:58 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:02 UTC"
"2024-05-28T18:19:02 UTC"
"2024-05-28T18:19:36 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:19:05 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:18:00 UTC"
"2024-05-28T18:18:00 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-29T18:57:07 UTC"
"2024-05-28T18:20:02 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-31T21:49:11 UTC"
"2024-05-28T18:19:02 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-28T18:19:06 UTC"
"2024-05-31T23:35:12 UTC"
"2024-05-28T18:19:03 UTC"
"2024-05-28T18:18:00 UTC"
"2024-05-18T17:58:17 UTC"
"2024-06-02T10:31:33 UTC"
"2024-06-02T10:31:33 UTC"

If I send a gossip timestamp filter to this node with first_timestamp == 0, I get all 41 public channels. If, however, I send a gossip timestamp filter with first_timestamp == 7 days ago, which is before 2024-05-28, I would expect to get almost all the channels, but I get only 8 of its channels. 14 days I get 10 channels 28 days I get 17 channels 56 days I get 26 channels 112 days I get 35 channels 224 days I get all 41 channels

The oldest age of channels on the node is 127 days. So I think the opening timestamp is used to return channel updates, rather than the update date.

Woah, that's weird! But we definitely filter by update timestamp: I just checked.

However, in the latest versions of CLN we only start from the beginning of the gossip_store if the timestamp is 0, otherwise we start from the first entry which is more recent than two hours rather then sweeping the entire store. i.e. only "recent" ones. (We still check the timestamps are in the range you ask for, just don't search everything).

I'm surprised you're seeing anything more than all or nothing though.

And even before you send the filter message, we will spew all our own gossip.

rustyrussell added a commit to rustyrussell/lightning that referenced this issue Aug 12, 2024
When a peer connects, we always send all our own gossip (even if they
had set the timestamp filters to filter it out).  But we weren't
forcing it out to them when it changed, so this logic only applied to
unstable or frequently-restarting nodes.

So now, we tell all the peers whenever we tell gossipd about our new
gossip.

Fixes: ElementsProject#7276
Signed-off-by: Rusty Russell <[email protected]>
Changelog-Changed: Protocol: We now send current peers our changed gossip (even if they set timestamp_filter otherwise), not just on reconnect.
@rustyrussell
Copy link
Contributor

OK. I think I've found it, and figured out why you are seeing this.

We force our gossip on peers when they reconnect, but not if they're already connected. Obviously we should do this, and it can be noticed on nodes with stable connections.

rustyrussell added a commit to rustyrussell/lightning that referenced this issue Aug 12, 2024
When a peer connects, we always send all our own gossip (even if they
had set the timestamp filters to filter it out).  But we weren't
forcing it out to them when it changed, so this logic only applied to
unstable or frequently-restarting nodes.

So now, we tell all the peers whenever we tell gossipd about our new
gossip.

Fixes: ElementsProject#7276
Signed-off-by: Rusty Russell <[email protected]>
Changelog-Changed: Protocol: We now send current peers our changed gossip (even if they set timestamp_filter otherwise), not just on reconnect.
@JssDWt
Copy link
Contributor Author

JssDWt commented Aug 12, 2024

We do, however, send all our own gossip on any incoming connection (independent of what they ask for).

I'm not seeing this when connecting to the peer with an LDK made program. It could be this is our specific node that has many many private channels and only a few public ones, running into the send_threshold with GOSSIP_SPAM_REDUNDANCY.

Sharing the gossip is most important to peers we have public channels with, because they are most likely to forward that gossip. Perhaps they can get an 'advantage' in the spam redundancy model?

How does a node announcement get "too old"?
We don't update node_announcement unless something has changed. Why would we? They don't go stale.

I see, I had in mind that a 2 week old node announcement would get you removed from the graph. I'd have to double check that. I at least verified that LND does not discard the old node announcement if it doesn't have it yet when receiving it. I'll check the graph pruning logic too.

We force our gossip on peers when they reconnect, but not if they're already connected. Obviously we should do this, and it can be noticed on nodes with stable connections.

This is great news!

However, in the latest versions of CLN we only start from the beginning of the gossip_store if the timestamp is 0, otherwise we start from the first entry which is more recent than two hours rather then sweeping the entire store. i.e. only "recent" ones. (We still check the timestamps are in the range you ask for, just don't search everything).

So this explains the behavior of the timestamp filtering? First they are filtered by entry date (which is the channel entry date?) and then by timestamp?

@JssDWt
Copy link
Contributor Author

JssDWt commented Aug 12, 2024

@rustyrussell another observation is that restarts/disconnects didn't change anything. Only deleting the gossip_store file was a solution to the gossip propagation issues. Do you have an explanation for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in diagnostic issue under diagnostic
Projects
None yet
Development

No branches or pull requests

5 participants