-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BOLT 2: quiescence protocol (feature 34/35) option_quiesce #869
BOLT 2: quiescence protocol (feature 34/35) option_quiesce #869
Conversation
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 Signed-off-by: Rusty Russell <[email protected]>
99c0499
to
b96218b
Compare
Added the concept of "initiator" which makes splicing far simpler. |
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 Signed-off-by: Rusty Russell <[email protected]>
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 Signed-off-by: Rusty Russell <[email protected]>
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 Signed-off-by: Rusty Russell <[email protected]>
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 Signed-off-by: Rusty Russell <[email protected]>
Changelog-EXPERIMENTAL: Protocol: we support the quiescence protocol from lightning/bolts#869 Signed-off-by: Rusty Russell <[email protected]>
Trivial rebase on top of #880 |
b96218b
to
65f98ff
Compare
If
|
65f98ff
to
9832e17
Compare
Excellent point! The purpose of STFU is to support other things, but it's worth having its own feature bit IMHO anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
I think the wording around pending can be confusing and the cases if peers ignore the protocol could be elaborated a bit.
Also as a feedback: I initially thought that this could be generalized to the case of closing the channel but there we actually do want to send more updates to resolve / settle outstanding htlcs. I am writing this to demonstrate how easy it was to get confused by the notation.
The sender of `stfu`: | ||
- MUST NOT send `stfu` unless `option_quiesce` is negotiated. | ||
- MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals | ||
or fee updates are pending for either peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way how I initially understood the word pending
was that there are HTLCs in flight which have not been settled or removed. Yet after going through the text I think pending
means that the statemachine is not at a point where revoke_and_ack
has been sent.
Maybe we could elaborate this a bit and make it more concrete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This refers to the state machine, where changes are referred to as pending.
Concretely, this is not allowed if you have sent update_* messages, without a commitment_signed. Is that clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would have been much clearer. But maybe not necessary at that point of the document if it is clear within BOLT2 that the word pending always refers to the state machine and not to in flight HTLCs. If not I suggest to introduce such a clarification the first time the word pending is being used in BOLT 2. Also as said I eventually understood the meaning of pending so the issue seems to be not that big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having trouble understanding what "pending" means here.
In this flow, is the add considered pending?:
--update_add_htlc-->
--commit_sig------->
If so, is it only considered non-pending at the end of this flow?:
--update_add_htlc-->
--commit_sig------->
<--revoke_and_ack--
<--commit_sig------
--revoke_and_ack-->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that "pending" is confusing here. I think it would be more clear to write:
- MUST NOT send
stfu
if any of the sender's htlc additions, htlc removals
or fee updates have not been counter signed or failed by our peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that "pending" is confusing here. I think it would be more clear to write:
* MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals or fee updates have not been counter signed or failed by our peer.
I agree. I think it's still vague.
- MUST reply with `stfu` once it can do so. | ||
|
||
Upon disconnection: | ||
- the channel is no longer considered quiescent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the only condition under which a channel can get out of being considered quiescent?
I guess future protocols that need a quiescent channel end with stopping the channel to be considered quiescent. But maybe this could also be done with a separate message or an additional explicit field in the stfu
that resets the request to be quiescent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. For now we only use this for channel upgrade: other changes may have different needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably missed something but why is a disconnect (of the connection) necessary for a channel upgrade? Do the peers after upgrade need to execute the routines in connection reestablishment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether we add a new message that explicitly ends quiescence as suggested by @ProofOfKeags or leave it to be described in each downstream protocol, I think we need to say you can exit quiescence without sending a disconnect.
I do not think all downstream protocols should be forced to trigger a disconnect to leave quiescence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is addressed.
- MUST NOT send `stfu` unless `option_quiesce` is negotiated. | ||
- MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals | ||
or fee updates are pending for either peer. | ||
- MUST NOT send `stfu` twice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not specify what happens if a receiver receives two stfu
messages? I guess the second one SHOULD be ignored or should there be more drastic measures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's undefined; since the sender shouldn't do that, I'd probably reply with a warning and disconnect.
- MUST set `initiator` to 1 | ||
- MUST set `channel_id` to the id of the channel to quiesce. | ||
- MUST now consider the channel to be quiescing. | ||
- MUST NOT send an update message after `stfu`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again what happens if the sender does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, undefined. I would warning and disconnect, but that's not a requirement.
You're right! We would probably use the same behavior as what happens if the other peer disconnects while we're quiescent and haven't completed the inner protocol. In the splicing case, it means that if we've reached a state where we cannot abort anymore, we finalize the signatures exchange when reconnecting or we force-close (if our peer is buggy/malicious). That means that this explicit quiescence termination step is still some additional code, but it reuses existing behavior, so it may not be as bad as I thought. I agree with you that it is conceptually cleaner to have a way to explicitly end quiescence. I'm less strongly against this than I was in my previous comment, it's probably at least worth prototyping an implementation to see how much code it would add before deciding whether we want this or not.
Yes, that sounds good 👍. I'd also be very interested in hearing @rustyrussell's thoughts on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My primary suggestion is to replacing pending
and acknowledge
with language that is more precise.
I think we also need to add something about ending quiescence when a downstream protocol succeeds and does not disconnect.
easiest on channels where both commitment transactions match, and no | ||
pending updates are in flight. We define a protocol to quiesce the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce the ambiguity with 'pending' meaning that an htlc is waiting for a pre-image vs. 'pending' meaning the update has not been mutually committed to by both sides, how about this instead:
"and all proposed updates are fully committed to by both parties."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do this
The sender of `stfu`: | ||
- MUST NOT send `stfu` unless `option_quiesce` is negotiated. | ||
- MUST NOT send `stfu` if any of the sender's htlc additions, htlc removals | ||
or fee updates are pending for either peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that "pending" is confusing here. I think it would be more clear to write:
- MUST NOT send
stfu
if any of the sender's htlc additions, htlc removals
or fee updates have not been counter signed or failed by our peer.
|
||
If both sides send `stfu` simultaneously, they will both set | ||
`initiator` to `1`, in which case the "initiator" is arbitrarily | ||
considered to be the channel funder (the sender of `open_channel`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we also mention open_channel2
here for completeness after rebasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we should still add this.
### Rationale | ||
|
||
The normal use would be to cease sending updates, then wait for all | ||
the current updates to be acknowledged by both peers, then start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose we replace "acknowledged" with "counter signed or rejected", or with whatever term we use in the "MUST NOT send stfu
" section above to replace the ambiguous use of "pending".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
- MUST reply with `stfu` once it can do so. | ||
|
||
Upon disconnection: | ||
- the channel is no longer considered quiescent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether we add a new message that explicitly ends quiescence as suggested by @ProofOfKeags or leave it to be described in each downstream protocol, I think we need to say you can exit quiescence without sending a disconnect.
I do not think all downstream protocols should be forced to trigger a disconnect to leave quiescence.
I have now come to believe that if we want this to be a standalone protocol gadget, we actually need an explicit resume message if we want to be able to test these gadgets in isolation. At the moment I have been able to complete testing that we defer any update inducing state transitions while we are in an stfu state. To complete the testing we have to ensure that the updates that would have normally taken place while we were quiesced are run. However, there is a question where if we have no downstream protocol that can signify the termination of quiescence, we will stay in that state indefinitely. We can, of course, disconnect and test that things start up that way, however we would feasibly want to also do some live testing to ensure that once a channel is no longer quiesced, we can resume traffic and never find ourselves in an inconsistent state. As it stands right now, without a EDIT: spec update proposal |
From our experience, tests for quiescence ended by disconnect are very valuable. A I think for the purpose of treating quiescence as a stand alone gadget for testing, disconnect is sufficient. However, there could be a place for |
I am not suggesting this as a substitution. I think we need to test both. However, live resume has some different subtleties than resume via reconnect, which is why I was hoping to be able to test both situations. At the moment I'm limited to disconnect only in lieu of a complete implementation of splicing or dyncomms. |
- otherwise: | ||
- SHOULD NOT send any more update messages. | ||
- MUST reply with `stfu` once it can do so. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you should add, "must now consider the channel as quiescent here as well" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't do that because quiescence is only achieved once both sides have sent it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah both sides have sent it, if a peer receives stfu and has sent one before, the peer must now consider the channel as quiescent (This is already stated in the spec, line 527)
The otherwise branch(L528-530), where the peer receives but has not sent one and so replies, I think it should also be stated that peer must now consider the channel as quiescent as well, just as in the other scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's redundant as that case is already covered by the requirements on the sender
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it is. I could not find anywhere in the sender perspective where it is stated that the channel should now be considered as quiescent when stfu is sent. Rather, the channel is considered as quiescing in that case, line 522. I think when stfu is sent and received that is when the channel is considered quiescent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I will add this to the sender requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add it when the sender has received a reply so it might be more appropriate here, but maybe there is a better way to structure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Is it quiescent only when you've send AND received the message? Or only when you send it?
The effort to add an explicit resume message is being abandoned after the discussion taking place here. The core observation is that it increases the number of situations where we can result in "half-committed" state. Protocol state is considered half-committed when it can no longer be aborted (usually due to the transmission of a signature or secret) but has not yet reached a state where both parties have done so. Currently we resolve all half-committed state issues through the I will open up a new PR that clarifies these more subtle requirements and link it back here. |
Weighing in late here, but is it worth adding an advisory note about disconnecting to exit quiescence if you need to resolve a HTLC to prevent a force close? Given that we don't settle our in-flight HTLCs, seems to me that a badly timed quiescence with some almost-expired HTLCs could be unfortunate unless you're paying attention to what's in your quiesced channel. |
Yes this is an important observation. I think we do need to add an advisory note recommending that nodes disconnect if quiescence lasts "too long" at their discretion. Nominally a quiescence session should not take more than a handful of seconds, as any negotiation that takes place on a quiesced channel should be able to complete imminently. Given that all of the HTLCs are block bound this should be improbable. One way to combat the edge case though would be that if a channel is quiesced and a new block comes in (which would be what usually triggers any HTLC timeout action), then we can use a new block as a cue to cycle the connection on quiesced channels. I'll need to think a bit more about the consequences of this but I think it would be simpler to implement than a quiescence controller that was fully "htlc aware" |
Agreed, on our side we have a plain duration-based timeout: once quiescence is initiated, we start a 90 seconds timer (that value can be overridden by the node operator) that is tied to the quiescence session (and include the quiescent action, e.g. a splice). If that timer ticks before we get out of quiescence, we disconnect. |
Using a version of Eclair modified for testing quiescence at commits 87b141f and 2fcdc4a I performed interop testing of quiescence with the LND quiescence branch ProofOfKeags:feature/stfu at commit 9225d29. I confirmed the following basic scenarios played out as expected:
From these tests I feel comfortable about the interoperability between LND and Eclair. I think we should be good to merge this PR after addressing clean-up comments. |
Signed-off-by: Rusty Russell <[email protected]> Header from folded patch 'bolt_2__set_an_initiator_in_quiescence.patch': BOLT #2: Set an initiator in quiescence. This is especially useful for protocols such as splicing; for simplified commitment transactions, there is already an implied initiator at each point, so having the negotiation at splicing time would be redundant. Signed-off-by: Rusty Russell <[email protected]> Header from folded patch 'option_quiesce__feature_to_support_stfu_method.patch': option_quiesce: feature to support stfu method. In practice, sftu is useless unless you have something (e.g. channel_upgrade) which uses it, but adding a feature is best practice IMHO. Signed-off-by: Rusty Russell <[email protected]>
Rebased, merged rustyrussell#15 |
a68d7b0
to
b474755
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK b474755, I believe this is ready to go 🚀
We may want to add a paragraph to clarify #869 (comment), or that can be a follow-up PR.
…ith HTLCs pending. Signed-off-by: Rusty Russell <[email protected]>
Added timeout requirement, as agreed. |
Both nodes: | ||
- MUST disconnect after 60 seconds of quiescence if the HTLCs are pending. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this has already been merged but I do not believe this is a MUST
. This is up to node policy and can be initiated by either side. Especially considering that time is both relative and difficult in distributed systems this ought to be a SHOULD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "pending" here? I believe it is a different definition than the one above.
A simple protocol for upgrades, splices and other things which are easier when no updates are in-flight.