Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing link: unable to update commitment: cannot add duplicate keystone with error: internal error #6485

Closed
C-Otto opened this issue Apr 30, 2022 · 27 comments · Fixed by #6518
Assignees
Labels
commitments Commitment transactions containing the state of the channel force closes

Comments

@C-Otto
Copy link
Contributor

C-Otto commented Apr 30, 2022

Background

The error message "failing link: unable to update commitment: cannot add duplicate keystone with error: internal error" appeared in my logs, for reasons I don't understand. Afterwards the channel was unusable and my peer (running CLN) immediately force-closed the channel at 01:26.

The force-close transaction contains one outgoing HTLC (timeout, according to lncli closedchannels) with size 200003 sat.

My peer says:

I’m getting hit by several force closes for all kind of reasons in the last couple days. Ours is just “internal error” in my log

Relevant snippet from my logs:

01:26:09.438 [INF] PEER: NodeKey(PUBKEY) loading ChannelPoint(CHANPOINT)
01:26:09.439 [DBG] CNCT: New ChainEventSubscription(id=15) for ChannelPoint(CHANPOINT)
01:26:09.439 [INF] HSWC: ChannelLink(CHANPOINT): starting
01:26:09.439 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(CHANPOINT)
01:26:09.439 [INF] HSWC: ChannelLink(CHANPOINT): HTLC manager started, bandwidth=3663215408 mSAT
01:26:09.439 [INF] HSWC: ChannelLink(CHANPOINT): attempting to re-synchronize
01:26:09.439 [INF] PEER: Negotiated chan series queries with PUBKEY
01:26:09.519 [ERR] RPCS: [connectpeer]: error connecting to peer: already connected to peer: PUBKEY@IP2:48304
01:26:09.519 [ERR] RPCS: [/lnrpc.Lightning/ConnectPeer]: already connected to peer: PUBKEY@IP2:48304
01:26:09.839 [INF] HSWC: ChannelLink(CHANPOINT): received re-establishment message from remote side
01:26:09.851 [DBG] HSWC: ChannelLink(CHANPOINT): loaded 0 fwd pks
01:26:11.237 [DBG] HSWC: ChannelLink(CHANPOINT): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=6133225)->(Chan ID=CHAN_ID, HTLC ID=4619)
01:26:11.991 [DBG] HSWC: ChannelLink(CHANPOINT): removing Add packet (Chan ID=0:0:0, HTLC ID=6133225) from mailbox
01:26:13.366 [DBG] HSWC: ChannelLink(CHANPOINT): settle-fail-filter &{1 [0]}
01:26:13.366 [DBG] HSWC: ChannelLink(CHANPOINT): Failed to send 500059997 mSAT
01:26:15.297 [DBG] CNCT: ChannelArbitrator(CHANPOINT): attempting state step with trigger=chainTrigger from state=StateDefault
01:26:15.297 [DBG] CNCT: ChannelArbitrator(CHANPOINT): new block (height=734163) examining active HTLC's
01:26:15.297 [DBG] CNCT: ChannelArbitrator(CHANPOINT): checking commit chain actions at height=734163, in_htlc_count=0, out_htlc_count=2
01:26:15.297 [DBG] CNCT: ChannelArbitrator(CHANPOINT): no actions for chain trigger, terminating
01:26:15.297 [DBG] CNCT: ChannelArbitrator(CHANPOINT): terminating at state=StateDefault
01:26:16.485 [DBG] HSWC: ChannelLink(CHANPOINT): settle-fail-filter &{1 [0]}
01:26:16.485 [DBG] HSWC: ChannelLink(CHANPOINT): Failed to send 1000047000 mSAT
01:26:26.905 [DBG] HSWC: ChannelLink(CHANPOINT): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=6133240)->(Chan ID=CHAN_ID, HTLC ID=4620)

01:26:26.956 [ERR] HSWC: ChannelLink(CHANPOINT): failing link: unable to update commitment: cannot add duplicate keystone with error: internal error

01:26:26.956 [INF] HSWC: ChannelLink(CHANPOINT): exited
01:26:26.957 [INF] HSWC: ChannelLink(CHANPOINT): stopping

Your environment

  • lnd 0.14.3-beta-rc1
  • Linux server 5.10.0-10-amd64 #1 SMP Debian 5.10.84-1 (2021-12-08) x86_64 GNU/Linux
  • bitcoind v23

Steps to reproduce

Have non-anchor channel with CLN behind tor. Have somewhat flaky connection. Send HTLCs to peer.

@Roasbeef
Copy link
Member

Roasbeef commented May 2, 2022

lnd 0.14.3-beta-rc1

Any reason this node is running the production release of this version?

I wonder if this was actually an inadvertent HTLC replay attempt....

@C-Otto
Copy link
Contributor Author

C-Otto commented May 2, 2022

I checked the commits following rc1 and didn't see anything that interested me, so I didn't update (I might be wrong, of course, updating as we speak). What do you mean with "production release"? I'm running my own version with a few patches (logging, non-random channel picking for parallel channels).

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented May 2, 2022

This could happen if c-lightning was sending settles earlier than expected, but I don't think they're doing that: #6246

Do you have more in-depth logs you can share leading up to this point? Are PEER debug logs off?

@C-Otto
Copy link
Contributor Author

C-Otto commented May 2, 2022

I have debuglevel=info,CNCT=debug,DISC=warn,HSWC=debug,NANN=debug,WTCL=warn. What exactly do you need? The logs above show everything I have for the CHANPOINT (EDIT: In that timeframe).

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented May 2, 2022

Did your logs rotate or something? The channel has a history of 4620 htlc's so hopefully there are more logs from before the reestablish message was received

@C-Otto
Copy link
Contributor Author

C-Otto commented May 2, 2022

I have lots of logs, no worries. Just tell me what you need, I won't send over gigabytes of logs, though :)

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented May 2, 2022

I think you should have at least one log line that is identical to the one pasted above (HTLC ID=4620 is what matters):

queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=<can be anything>)->(Chan ID=CHAN_ID, HTLC ID=4620)

If you could send ~1min before the earliest identical log line until the point pasted in the OP, that would help narrow things down. I don't need the channel ids/payment hashes/ips/pubkeys etc, HSWC logs only.

@C-Otto
Copy link
Contributor Author

C-Otto commented May 2, 2022

Yes.

22:10:16.275 [INF] HSWC: ChannelLink(CHANPOINT): HTLC manager started, bandwidth=4163275405 mSAT
22:10:16.275 [INF] HSWC: ChannelLink(CHANPOINT): attempting to re-synchronize
22:10:16.276 [INF] PEER: Negotiated chan series queries with PUBKEY
22:10:18.611 [INF] HSWC: ChannelLink(CHANPOINT): received re-establishment message from remote side
22:10:18.618 [DBG] HSWC: ChannelLink(CHANPOINT): loaded 46 fwd pks
22:10:18.618 [DBG] HSWC: ChannelLink(CHANPOINT): removing completed fwd pkg for height=13475
[...]
22:10:19.218 [DBG] HSWC: ChannelLink(CHANPOINT): removing completed fwd pkg for height=13520
22:10:32.588 [DBG] HSWC: ChannelLink(CHANPOINT): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=6124133)->(Chan ID=CHANID, HTLC ID=4618)
22:10:32.659 [DBG] HSWC: ChannelLink(CHANPOINT): removing Add packet (Chan ID=0:0:0, HTLC ID=6124133) from mailbox
22:11:08.935 [DBG] HSWC: ChannelLink(CHANPOINT): queueing keystone of ADD open circuit: (Chan ID=OTHER_CHANID, HTLC ID=7856)->(Chan ID=CHANID, HTLC ID=4619)
22:11:39.744 [DBG] HSWC: ChannelLink(CHANPOINT): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=6124179)->(Chan ID=CHANID, HTLC ID=4620)
22:11:44.869 [DBG] CNCT: ChannelArbitrator(CHANPOINT): attempting state step with trigger=chainTrigger from state=StateDefault
22:11:44.869 [DBG] CNCT: ChannelArbitrator(CHANPOINT): new block (height=734142) examining active HTLC's
22:11:44.869 [DBG] CNCT: ChannelArbitrator(CHANPOINT): checking commit chain actions at height=734142, in_htlc_count=0, out_htlc_count=1
22:11:44.869 [DBG] CNCT: ChannelArbitrator(CHANPOINT): no actions for chain trigger, terminating
22:11:44.869 [DBG] CNCT: ChannelArbitrator(CHANPOINT): terminating at state=StateDefault
22:12:04.253 [DBG] HSWC: ChannelLink(CHANPOINT): queueing keystone of ADD open circuit: (Chan ID=0:0:0, HTLC ID=6124191)->(Chan ID=CHANID, HTLC ID=4621)
22:12:08.995 [ERR] HSWC: ChannelLink(CHANPOINT): failing link: unable to complete dance with error: remote unresponsive
22:12:08.995 [INF] HSWC: ChannelLink(CHANPOINT): exited
22:12:08.995 [INF] HSWC: ChannelLink(CHANPOINT): stopping

@C-Otto
Copy link
Contributor Author

C-Otto commented May 2, 2022

In between those (22:12:08 and 01:26:09) I don't have any HSWC log message for that peer/channel.

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented May 2, 2022

Do you see this and if so, what time? I might be missing a space or something - this will be under OTHER_CHANID's chanpoint

queueing removal of SETTLE closed circuit: (Chan ID=OTHER_CHANID, HTLC ID=7856)->(Chan ID=CHANID, HTLC ID=4619)

@C-Otto
Copy link
Contributor Author

C-Otto commented May 2, 2022

22:11:08.935 [DBG] HSWC: ChannelLink(CHANPOINT): queueing keystone of ADD open circuit: (Chan ID=OTHER_CHANID, HTLC ID=7856)->(Chan ID=CHAN_ID, HTLC ID=4619)
22:12:08.996 [DBG] HSWC: ChannelLink(OTHER_CHANPOINT): queueing removal of FAIL closed circuit: (Chan ID=OTHER_CHANID, HTLC ID=7856)->(Chan ID=0:0:0, HTLC ID=0)
22:12:09.018 [DBG] HSWC: ChannelLink(OTHER_CHANPOINT): removing Fail/Settle packet (Chan ID=OTHER_CHANID, HTLC ID=7856) from mailbox

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented May 2, 2022

Ok, I might have an idea of what's going on. If I'm right this is an lnd problem, not a c-l problem

@Roasbeef
Copy link
Member

Roasbeef commented May 3, 2022

Looks like this was introduced inadvertently in this PR (according to @Crypt-iQ): #4183

@Roasbeef
Copy link
Member

Roasbeef commented May 4, 2022

Noted a related thread in #6482 where in some cases we may not be properly cancelling inbound HTLCs if we attempt to send a commitment but the remote peer never replies. This is a bit trickier since we've technically already sent out that valid commitment, so we need to be playing that HTLC (may lead to a force close since we want to be able to safely time out that incoming HTLC).

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented May 4, 2022

We are cancelling back htlc's properly in #6482

@Crypt-iQ
Copy link
Collaborator

Crypt-iQ commented May 4, 2022

Looked into it more, and #4183 did not introduce this, but I think it made it harder to recover from when peered with another lnd node.

@Filouman
Copy link

I have seen 4 or 5 of these force closures in the last few days! All of them with CLN nodes.

Latest one:
2022-05-20T17:09:20.572Z UNUSUAL 03641a88d80a2a85bbecd770577aca9b5495616e9fef63d66ef2631b7cca1d395d-chan#4496: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel bea859af4b46d48f52d39b8e0e54b94c7e9bcb15785bccb0dbbc294759775115: internal error

Please let me know if I can offer any additional info that may help.

@zerofeerouting
Copy link

zerofeerouting commented May 23, 2022

I have had about 20 of these over the last couple of days. This is a really pressing issue.

@Roasbeef
Copy link
Member

Worth nothing that we'll send an error there, but won't actually attempt to force close the channel to see if a reconnection fixes things. Some CL nodes force close on any error message received. #5602 will enable us to send the new warning message instead, but some older nodes also still rely on that error message.

We have a fix here (for the issue that leads to us failing to process a certain HTLC action): #6518.

@Roasbeef
Copy link
Member

AFAICT, something might have been introduced on the CL end inadvertently during their recent re-write of some peer/channel handling code. Also is everyone here running CL on this point release that fixed a force close issue?

@Filouman
Copy link

Is there any relation between the 'Internal Error' and 'Unable to Read Message' errors? Have been seeing both of these regularly over the past few days with my CLN peers.

#6564

@Roasbeef
Copy link
Member

Unable to read message happens when the other peer "forcibly" hangs up the connection, that might have been in response to an lnd node sending an error message. Some CL nodes (or all of them still?) will auto force close on any error message.

@zerofeerouting
Copy link

I run the latest CLN (0.11.1) - my node has force closed multiple channels when my peers told me they had an internal error.

According to BOLT 2 (https://github.com/lightning/bolts/blob/master/01-messaging.md#requirements-2)

The receiving node:

upon receiving error:
    if channel_id is all zero:
        MUST fail all channels with the sending node.
    otherwise:
        MUST fail the channel referred to by channel_id, if that channel is with the sending node.

So the CLN behaviour is up to specs AFAICT.

@Roasbeef
Copy link
Member

I run the latest CLN (0.11.1) - my node has force closed multiple channels when my peers told me they had an internal error.

Ok good to know, just trying to pinpoint things here as I know there was a p2p overhaul for them and we haven't had any significant changes in this area lately. We're reviewing and testing #6518 to see if this addresses the issue. It also seems related to other reported issue when us writing to a zombie TCP socket (other side no longer listening on it) would cause eventually force closures.

@btweenthebars
Copy link

Is there any relation between the 'Internal Error' and 'Unable to Read Message' errors? Have been seeing both of these regularly over the past few days with my CLN peers.

#6564

Is there any "Keystone" error in your log ? I doubt they are the same issue

@Filouman
Copy link

Is there any "Keystone" error in your log ? I doubt they are the same issue

This one?

failing link: unable to update commitment: cannot add duplicate keystone with error: internal error

@btweenthebars
Copy link

failing link: unable to update commitment: cannot add duplicate keystone with error: internal error

Did it happen to the same peers that you have the "Unable to Read Message" error ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
commitments Commitment transactions containing the state of the channel force closes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants