Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOLT#2: Add message retransmission sub-system #156

Closed

Conversation

andrewshvv
Copy link
Contributor

@andrewshvv andrewshvv commented Mar 2, 2017

Issue: #137

I know that we discussed that funding manager messages shouldn't be included in retransmission, but after some thinking about this I decided to include them anyway, because the specification is not about us, but more about convergency with other lightning network clients. If in specification the funding messages are included it means that other clients might be built with this logic in mind.

Instead, I believe that we should tolerantly ignore the funding messages that was already proceed, or expired.

@andrewshvv andrewshvv force-pushed the retransmission_subsystem branch 5 times, most recently from c18e4ec to 480beb6 Compare March 2, 2017 13:19
@andrewshvv andrewshvv changed the title Add a message retransmission sub-system BOLT#2 #137 BOLT#2: Add a message retransmission sub-system #137 Mar 2, 2017
@andrewshvv andrewshvv changed the title BOLT#2: Add a message retransmission sub-system #137 BOLT#2: Add message retransmission sub-system Mar 2, 2017
return nil
}

// Add indexes in additional array, because
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the keys are stored on disk in big-endian order, the sorting isn't necessary here. When one performs an in order scan, the items will be retrieved in chronological order by the sequence number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above about re-working the schema to eliminate this sorting step.

basePeerBucketKey = []byte("peermessages")
)

// MessagesStore represents the boltdb storage for messages inside
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping with the theme of only storing generic LN data with channeldb (the graph, invoices, etc.), I think this entire file should instead be moved to reside in the root lnd directory.

}

// Generate next sequence number to preserver the message order.
sequence, err := peerBucket.NextSequence()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what's implemented here currently isn't quire what we discussed offline. As is now, a sorting step is inserted before retrieving all the messages for a peer as the messages are stored in distinct buckets.

Instead, what I originally described was this:

  • A single top-level bucket that maps: index -> code || msg. Concatenating the message code to the stored data allows the db logic to properly parse the wire message without trial and error or needing an additional index.
  • Within this top-level bucket, another bucket would be stored which acts as an index into the top-level bucket. This bucket will be used to locate which messages can be deleted from the log in response to a retrieved ACK message.

The mapping for items in this bucket would be: messageCode -> {index_1, index_2, index_3, etc.}. So when receiving a new message, you check for the existence of the message code in this index bucket, then delete all the indexes from the top-level bucket that are returned.

Similarly, when adding a new message to the top-level bucket, another compile-time constant set of mappings needs to be consulted to determine which message ACKs the message being stored. So in addition to storing it in the top-level bucket, you'd also append to the record for the messageCode mappings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to the schema above eliminates the unnecessary sorting logic and also still retains the message order required to properly perform retransmissions.

lnd_test.go Outdated
@@ -1459,6 +1459,9 @@ func testRevokedCloseRetribution(net *networkHarness, t *harnessTest) {
return bobChannelInfo.Channels[0], nil
}

// Wait for channel to be acquired by router.
time.Sleep(time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current set of tests seems to pass relatively reliably without these added sleeps. Rather than adding additional sleeps, with the topology notification code merged in, we can add hooks into the integration testing framework to properly wait for messages to propagate before attempting to dispatch payments through newly opened channels.

@@ -20,50 +20,124 @@ const MessageHeaderSize = 12
// individual limits imposed by messages themselves.
const MaxMessagePayload = 1024 * 1024 * 32 // 32MB

// Code represent the unique identifier of the lnwire command.
type Code uint32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestion: MessageCode.

networktest.go Outdated
// CopyLogs copy/dumps Alice and Bob lnd daemon logs with specified period
// of time in temporary directory.
// NOTE: Panics error logs will not be logged.
func (n *networkHarness) CopyLogs(period time.Duration) (func(), <-chan error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this method? It doesn't look to be used anywhere with the PR currently.

Copy link
Contributor Author

@andrewshvv andrewshvv Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using it during debugging and I thought if might be useful to others, but essentially this is just copying of the lnd logs in the temp directory.

retranmission.go Outdated
type MessageStore interface {
// Get returns the sorted set of messages in the order they were
// added in storage.
Get() ([]lnwire.Message, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestion: GetUnackedMessages .

retranmission.go Outdated
//
// NOTE: The original purpose of creating this interface was in separation
// between the retransmission logic and specifics of storage itself. In case of
// such interface we may create the test storage and register the add,remove,get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spaces between the commas at the end of this sentence.

retranmission.go Outdated
// Is due to the fact of logic of retransmission subsystem, where we
// need remove messages not one by one but in contrary by groups of
// messages.
Remove(codes ...lnwire.Code) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the modification to the schema I suggested, I think this method would be changed to something along the lines of an Ack method and instead take a single lnwire.MessageCode.


// Ack encapsulates the specification logic about which messages should be
// acknowledged by receiving this one.
func (rt *retransmitter) Ack(msg lnwire.Message) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the comment above, this message would be simplified a good bit as it would attempt to perform a single unconditional delete from the database.

The mapping here (what get's deleted on receipt of a message) would be moved into the storage layer as it would need to be consulted each time a message is written.

@andrewshvv andrewshvv force-pushed the retransmission_subsystem branch 2 times, most recently from 08df3e9 to 31cb691 Compare March 13, 2017 19:46
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the latest iteration!

This PR is getting pretty close, I'm going to move to some local testing of the functionality while the latest comments are being addressed.

@@ -30,4 +30,8 @@ var (
ErrNodeAliasNotFound = fmt.Errorf("alias for node not found")

ErrSourceNodeNotSet = fmt.Errorf("source node does not exist")

// ErrPeerMessagesNotFound is returned when no message have been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have been found -> has been found

// MessagesStore represents the boltdb storage for messages inside
// retransmission sub-system.
type MessagesStore struct {
// id is a unique identificator of peer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is a unique slice of bytes identifying a peer. This value is typically a peer's identity public key serialized in compressed format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-thinked the meaning of this field a bit due to the recent changes in discovery PR. I think it would be better to keep id as not something coupled with peer at all, but mention that usually it is compressed pub key.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changes in the discovery PR? Peers within the network are identified globally by their public keys.

In any case, the comment should be replaced with the first sentence of my suggestion:

id is a unique slice of bytes identifying a peer.

}

return peerBucket.Put(indexBytes, b.Bytes())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: there's an extra new line here.

if m.ID != 1 {
t.Fatal("wrong order of message")
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should also assert the deep equality of the message read from disk vs the original message.

if err != nil && err != ErrPeerMessagesNotFound {
t.Fatalf("can't get the message: %v", err)
} else if len(messages) != 0 {
t.Fatal("wrong lenght of messages")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lenght -> length

lnwire.CmdUpdateFailHTLC,
lnwire.CmdUpdateFufillHTLC,
lnwire.CmdCommitSig,
lnwire.CmdCloseRequest,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the current spec draft correctly, CloseRequest and FundingLocked should be omitted. They're not ACK'd by a RevokeAndAck message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I am missing something, but from the spec:

funding_locked: acknowledged by update_ messages, commitment_signed, revoke_and_ack or shutdown messages.
shutdown: acknowledged by closing_signed or revoke_and_ack

server.go Outdated
}
p.Disconnect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the disconnect removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're unable to create the peer, then it must be removed from the connmgr's set of pending persistent connections, hence the use of Disconnect here.

Honestly, we need to revisit the current connmgr integration for coherency as it was put together rather quickly in order to get the functionality out the door.

Copy link
Contributor Author

@andrewshvv andrewshvv Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case p is nil which cause a panic if error occurred on this stage. Maybe we should always return instance of peer from newPeer function? In this case we can return the previous logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, nice find! In the future, I'd prefer for fixes like this to be either included in the PR in a distinct commit, or entirely within it's own PR. Otherwise, it's easy to miss amidst all the other changes within the PR.

Copy link
Contributor Author

@andrewshvv andrewshvv Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will create an additional PR for that.

server.go Outdated
@@ -542,7 +544,7 @@ func (s *server) addPeer(p *peer) {
return
}

// Track the new peer in our indexes so we can quickly look it up either
// Track the new peer in our messages so we can quickly look it up either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here about reverting this line diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, nice catch!

server.go Outdated
@@ -551,7 +553,7 @@ func (s *server) addPeer(p *peer) {
s.peersByPub[string(p.addr.IdentityKey.SerializeCompressed())] = p
s.peersMtx.Unlock()

// Once the peer has been added to our indexes, send a message to the
// Once the peer has been added to our messages, send a message to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here about reverting this line diff.

@@ -236,7 +235,7 @@ func TestSerializeKidOutput(t *testing.T) {

deserializedKid, err := deserializeKidOutput(&b)
if err != nil {
fmt.Printf(err.Error())
t.Fatalf("can't deserialize kid output: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! i'd missed this during my initial review.

@andrewshvv andrewshvv force-pushed the retransmission_subsystem branch 2 times, most recently from c37d1ba to 19239c7 Compare March 15, 2017 09:04
@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 67.801% when pulling 19239c7541bdc368fe75652bd5b75b3b6dd7199c on AndrewSamokhvalov:retransmission_subsystem into d723aad on lightningnetwork:master.

in this commit lnwire message header encode/decode tests
were added, without it newcommer programmer may change the
type inside message header and spend hours on debugging of
integration test trying to understand why his node can't
start and interact properly.
Issue: lightningnetwork#137

In this commit retranmission subsystem and boltdb mesage storage were
added. Retransmission subsystem described in details in BOLT #2
(Message Retransmission) section. This subsystem keeps records
of all messages that were sent to other peer and waits the ACK
message to be received from other side and after that removes all
acked messaged from the storage.
Issue: lightningnetwork#137

In this commit the retransmission subsystem was included in lnd,
now upon peer reconnection we fetch all messages from message storage
that were not acked and send them again to remote side.
@andrewshvv
Copy link
Contributor Author

I have added the PR with addition of the stable gometalinter, because otherwise travis will fail.

@coveralls
Copy link

coveralls commented Mar 16, 2017

Coverage Status

Coverage increased (+0.09%) to 67.801% when pulling f8b2624 on AndrewSamokhvalov:retransmission_subsystem into d723aad on lightningnetwork:master.

@Roasbeef
Copy link
Member

Roasbeef commented Mar 17, 2017

I've started to test this PR locally and noticed that it currently goes about implementing the retransmission is missing a key feature.

The description in the original issue stated that the retransmission sub-system should actually sit between the server and the peer. In the design as originally described, if the peer isn't currently online (sendToPeer can't locate the target peer), then the message would be queued on disk to be retransmitted once the peer comes online again as it's "outside" the interaction of the peer itself. With this architecture, sub-systems never directly call peer.queueMsg, instead they are passed directly or indirectly the server's sendToPeer message which transparently handles committing the relevant messages to disk if the peer if offline.

Such behavior would allow sub-systems like the fundingManager to opaquely gain access to a reliable messaging stream to the remote peer regardless whether the peer is/was online or not.

As an example, let's say we're nearing the completion of a funding workflow. The ultimate block finalizing the channel arrives, the fundingManager notifies the relevant systems, then goes send the FundingLocked message to the channel counterparty so the channel itself can start to be updated. However, let's say that the channel peer went offline right before the final block was mined. As implemented in this PR, the call to sendToPeer will fail (as the peer isn't online) and the fundingLocked message will never be sent.

This PR is 80% of the way there, functionality wise. To get that last 20%, the following behavior needs to be implemented:

  • When the server is handling sendToPeer, if the target peer isn't online, then the server should write directly to the MessageStore of the target peer.
  • Any routing/discovery messages should be omitted from the behavior above.

@Roasbeef
Copy link
Member

Roasbeef commented Mar 17, 2017

Here's an alternative to what's described above:

  • Rather then the fundingManager relying on the existence of a persistent messaging queue. It could instead, handle reliable completion of the funding workflow itself.
  • In this case, the fundingManager would gain some persistent state which records if the final step in the state machine has been completed or not.
  • The final step is the reliable reliably sending the FundingLocked message to complete a funding workflow.
  • The fundingManager maintains this state for all funding workflow which enter the final, waiting-for-channel-confirmation state.
  • Upon startup, for all funding workflows in this final limbo state, a channel barrier for the ChannelPoint is created.
  • The fundingManager either registers with the server for a notification of once the peer is online. Upon dispatch of the notification, the FundingLocked message is sent.

@Roasbeef
Copy link
Member

Decided that what I've described w.r.t the fundingManager is a special case for sub-systems within the codebase atm. I'll continue testing this PR and will implement the revision of the functionality I described above myself.

"to the peer(%v)", len(messages), p)

for _, message := range messages {
// Sending over sendToPeer will cause block because of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you insert a logging message here that just logs the MessageCode itself? Thanks!

func (rt *retransmitter) Ack(msg lnwire.Message) error {
switch msg.Command() {

case lnwire.CmdSingleFundingResponse:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, all funding message should be omitted from retransmission other than the FundingLocked message. While testing locally I just hit a bug that causes the funding manager to deadlock if lnd is restarted mid a >1 conf required channel opening.

Atm, the spec is incorrect. No funding messages until the point in which either side is committed to a funding transaction should be retransmitted at all.

)
case lnwire.CmdCloseComplete:
return rt.remove(
lnwire.CmdCloseRequest,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atm CloseComplete is never sent within the daemon. Therefore, this entry should be removed. Otherwise, on restart the node will keep sending the same CloseRequest message indefinitely upon each restart. The responding node will simply ignore the message as the the channel has already been closed.

lnwire.CmdCloseRequest:
return rt.remove(
lnwire.CmdFundingLocked,
lnwire.CmdRevokeAndAck,
Copy link
Member

@Roasbeef Roasbeef Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, all instances of RevokeAndAck should be omitted from retransmission. As is now, because we still use an "initial revocation window" of 1, peer restarts will cause lnd to send the initial RevokedAndAck twice with the same revocation values. This'll cause the channel to fail down the line as a state transition will re-use the same preimage rather than going to the next leaf node in the tree.

20:52:08 2017-03-16 [INF] PEER: retransmission subsystem resends 1 messages to the peer(020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019)
20:52:08 2017-03-16 [INF] CRTR: Synchronizing channel graph with 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b
20:52:08 2017-03-16 [TRC] PEER: writeMessage to 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019: (*lnwire.RevokeAndAck)(0xc42053f2d0)({
 ChannelPoint: (wire.OutPoint) 397d9ba617b1b2d81a8249e3de3749f4dd2efd792ce34b758ed97326b68bf0b9:0,
 Revocation: ([32]uint8) (len=32 cap=32) {
  00000000  6a 00 62 86 31 55 b1 4d  8f 20 e6 53 f2 8c f7 78  |j.b.1U.M. .S...x|
  00000010  1c b2 72 d3 07 86 57 2d  5d bc 55 4f b4 a4 c8 a1  |..r...W-].UO....|
 },
 NextRevocationKey: (*btcec.PublicKey)(0xc420318160)({
  Curve: (elliptic.Curve) <nil>,
  X: (*big.Int)(0xc420318180)(105223291483128089908537415774962877536378315872169081183677829390620736225739),
  Y: (*big.Int)(0xc4203181a0)(5542066621571236556856056711647061449395836182811543325992215950193357130663)
 }),
 NextRevocationHash: ([32]uint8) (len=32 cap=32) {
  00000000  46 b5 6c 1c 0e 0d 50 d4  a1 3c 97 c6 8c 8e 5d 6e  |F.l...P..<....]n|
  00000010  15 5b 62 f1 de 12 ec af  4a 11 a2 21 b2 4e a1 89  |.[b.....J..!.N..|
 }
})

......

20:52:08 2017-03-16 [TRC] PEER: writeMessage to 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019: (*lnwire.RevokeAndAck)(0xc420559f80)({
 ChannelPoint: (wire.OutPoint) 397d9ba617b1b2d81a8249e3de3749f4dd2efd792ce34b758ed97326b68bf0b9:0,
 Revocation: ([32]uint8) (len=32 cap=32) {
  00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
  00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
 },
 NextRevocationKey: (*btcec.PublicKey)(0xc42053bec0)({
  Curve: (elliptic.Curve) <nil>,
  X: (*big.Int)(0xc42053be40)(105223291483128089908537415774962877536378315872169081183677829390620736225739),
  Y: (*big.Int)(0xc42053be60)(5542066621571236556856056711647061449395836182811543325992215950193357130663)
 }),
 NextRevocationHash: ([32]uint8) (len=32 cap=32) {
  00000000  46 b5 6c 1c 0e 0d 50 d4  a1 3c 97 c6 8c 8e 5d 6e  |F.l...P..<....]n|
  00000010  15 5b 62 f1 de 12 ec af  4a 11 a2 21 b2 4e a1 89  |.[b.....J..!.N..|
 }
})

Copy link
Member

@Roasbeef Roasbeef Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this state, the state machines of both channels will actually enter a negative feedback cycle, continually failing as the wrong revocation message is being sent over and over again. As a result, the channels are no longer usable after a single restart.

// and may need to be re-established from time to time and reconnection
// introduces doubt as to what has been received such logic is needed to be sure
// that peers are in consistent state in terms of message communication.
type retransmitter struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed the filename has a typo, should probably be retransmission.go

@Roasbeef
Copy link
Member

Closing this as it has been replaced by #231. We might possibly integrate some sections of this into the project at a later point though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants