BOLT#2: Add message retransmission sub-system #156

andrewshvv · 2017-03-02T12:07:11Z

Issue: #137

I know that we discussed that funding manager messages shouldn't be included in retransmission, but after some thinking about this I decided to include them anyway, because the specification is not about us, but more about convergency with other lightning network clients. If in specification the funding messages are included it means that other clients might be built with this logic in mind.

Instead, I believe that we should tolerantly ignore the funding messages that was already proceed, or expired.

Roasbeef · 2017-03-02T22:40:32Z

channeldb/messagestore.go

+					return nil
+				}
+
+				// Add indexes in additional array, because


As the keys are stored on disk in big-endian order, the sorting isn't necessary here. When one performs an in order scan, the items will be retrieved in chronological order by the sequence number.

See the comment above about re-working the schema to eliminate this sorting step.

Roasbeef · 2017-03-02T22:53:09Z

channeldb/messagestore.go

+	basePeerBucketKey = []byte("peermessages")
+)
+
+// MessagesStore represents the boltdb storage for messages inside


Keeping with the theme of only storing generic LN data with channeldb (the graph, invoices, etc.), I think this entire file should instead be moved to reside in the root lnd directory.

Roasbeef · 2017-03-09T00:16:51Z

channeldb/messagestore.go

+		}
+
+		// Generate next sequence number to preserver the message order.
+		sequence, err := peerBucket.NextSequence()


Hmm, what's implemented here currently isn't quire what we discussed offline. As is now, a sorting step is inserted before retrieving all the messages for a peer as the messages are stored in distinct buckets.

Instead, what I originally described was this:

A single top-level bucket that maps: index -> code || msg. Concatenating the message code to the stored data allows the db logic to properly parse the wire message without trial and error or needing an additional index.

Within this top-level bucket, another bucket would be stored which acts as an index into the top-level bucket. This bucket will be used to locate which messages can be deleted from the log in response to a retrieved ACK message.

The mapping for items in this bucket would be: messageCode -> {index_1, index_2, index_3, etc.}. So when receiving a new message, you check for the existence of the message code in this index bucket, then delete all the indexes from the top-level bucket that are returned.

Similarly, when adding a new message to the top-level bucket, another compile-time constant set of mappings needs to be consulted to determine which message ACKs the message being stored. So in addition to storing it in the top-level bucket, you'd also append to the record for the messageCode mappings.

Switching to the schema above eliminates the unnecessary sorting logic and also still retains the message order required to properly perform retransmissions.

Roasbeef · 2017-03-09T00:21:49Z

lnd_test.go

@@ -1459,6 +1459,9 @@ func testRevokedCloseRetribution(net *networkHarness, t *harnessTest) {
 		return bobChannelInfo.Channels[0], nil
 	}

+	// Wait for channel to be acquired by router.
+	time.Sleep(time.Second)


The current set of tests seems to pass relatively reliably without these added sleeps. Rather than adding additional sleeps, with the topology notification code merged in, we can add hooks into the integration testing framework to properly wait for messages to propagate before attempting to dispatch payments through newly opened channels.

Roasbeef · 2017-03-09T00:22:17Z

lnwire/message.go

@@ -20,50 +20,124 @@ const MessageHeaderSize = 12
 // individual limits imposed by messages themselves.
 const MaxMessagePayload = 1024 * 1024 * 32 //  32MB

+// Code represent the unique identifier of the lnwire command.
+type Code uint32


Naming suggestion: MessageCode.

Roasbeef · 2017-03-09T00:26:23Z

networktest.go

+// CopyLogs copy/dumps Alice and Bob lnd daemon logs with specified period
+// of time in temporary directory.
+// NOTE: Panics error logs will not be logged.
+func (n *networkHarness) CopyLogs(period time.Duration) (func(), <-chan error) {


What's the purpose of this method? It doesn't look to be used anywhere with the PR currently.

I am using it during debugging and I thought if might be useful to others, but essentially this is just copying of the lnd logs in the temp directory.

Roasbeef · 2017-03-09T00:28:31Z

retranmission.go

+type MessageStore interface {
+	// Get returns the sorted set of messages in the order they were
+	// added in storage.
+	Get() ([]lnwire.Message, error)


Naming suggestion: GetUnackedMessages .

Roasbeef · 2017-03-09T00:28:54Z

retranmission.go

+//
+// NOTE: The original purpose of creating this interface was in separation
+// between the retransmission logic and specifics of storage itself. In case of
+// such interface we may create the test storage and register the add,remove,get


Missing spaces between the commas at the end of this sentence.

Roasbeef · 2017-03-09T00:32:38Z

retranmission.go

+	// Is due to the fact of logic of retransmission subsystem, where we
+	// need remove messages not one by one but in contrary by groups of
+	// messages.
+	Remove(codes ...lnwire.Code) error


With the modification to the schema I suggested, I think this method would be changed to something along the lines of an Ack method and instead take a single lnwire.MessageCode.

Roasbeef · 2017-03-09T00:35:05Z

retranmission.go

+
+// Ack encapsulates the specification logic about which messages should be
+// acknowledged by receiving this one.
+func (rt *retransmitter) Ack(msg lnwire.Message) error {


With the comment above, this message would be simplified a good bit as it would attempt to perform a single unconditional delete from the database.

The mapping here (what get's deleted on receipt of a message) would be moved into the storage layer as it would need to be consulted each time a message is written.

Roasbeef

Nice work on the latest iteration!

This PR is getting pretty close, I'm going to move to some local testing of the functionality while the latest comments are being addressed.

Roasbeef · 2017-03-15T03:37:04Z

channeldb/error.go

@@ -30,4 +30,8 @@ var (
 	ErrNodeAliasNotFound = fmt.Errorf("alias for node not found")

 	ErrSourceNodeNotSet = fmt.Errorf("source node does not exist")
+
+	// ErrPeerMessagesNotFound is returned when no message have been


have been found -> has been found

Roasbeef · 2017-03-15T03:38:54Z

channeldb/messagestore.go

+// MessagesStore represents the boltdb storage for messages inside
+// retransmission sub-system.
+type MessagesStore struct {
+	// id is a unique identificator of peer.


id is a unique slice of bytes identifying a peer. This value is typically a peer's identity public key serialized in compressed format

I re-thinked the meaning of this field a bit due to the recent changes in discovery PR. I think it would be better to keep id as not something coupled with peer at all, but mention that usually it is compressed pub key.

What changes in the discovery PR? Peers within the network are identified globally by their public keys.

In any case, the comment should be replaced with the first sentence of my suggestion:

id is a unique slice of bytes identifying a peer.

Roasbeef · 2017-03-15T03:39:33Z

channeldb/messagestore.go

+		}
+
+		return peerBucket.Put(indexBytes, b.Bytes())
+


Minor nit: there's an extra new line here.

Roasbeef · 2017-03-15T03:42:47Z

channeldb/messagestore_test.go

+	if m.ID != 1 {
+		t.Fatal("wrong order of message")
+	}
+


The test should also assert the deep equality of the message read from disk vs the original message.

Roasbeef · 2017-03-15T03:42:57Z

channeldb/messagestore_test.go

+	if err != nil && err != ErrPeerMessagesNotFound {
+		t.Fatalf("can't get the message: %v", err)
+	} else if len(messages) != 0 {
+		t.Fatal("wrong lenght of messages")


lenght -> length

Roasbeef · 2017-03-15T04:08:03Z

retranmission.go

+			lnwire.CmdUpdateFailHTLC,
+			lnwire.CmdUpdateFufillHTLC,
+			lnwire.CmdCommitSig,
+			lnwire.CmdCloseRequest,


If I'm reading the current spec draft correctly, CloseRequest and FundingLocked should be omitted. They're not ACK'd by a RevokeAndAck message.

Hmm, maybe I am missing something, but from the spec:

funding_locked: acknowledged by update_ messages, commitment_signed, revoke_and_ack or shutdown messages.
shutdown: acknowledged by closing_signed or revoke_and_ack

Roasbeef · 2017-03-15T04:14:49Z

server.go

 		}
-		p.Disconnect()


Why was the disconnect removed?

If we're unable to create the peer, then it must be removed from the connmgr's set of pending persistent connections, hence the use of Disconnect here.

Honestly, we need to revisit the current connmgr integration for coherency as it was put together rather quickly in order to get the functionality out the door.

In this case p is nil which cause a panic if error occurred on this stage. Maybe we should always return instance of peer from newPeer function? In this case we can return the previous logic.

Ahh, nice find! In the future, I'd prefer for fixes like this to be either included in the PR in a distinct commit, or entirely within it's own PR. Otherwise, it's easy to miss amidst all the other changes within the PR.

Sure, I will create an additional PR for that.

Roasbeef · 2017-03-15T04:15:27Z

server.go

@@ -542,7 +544,7 @@ func (s *server) addPeer(p *peer) {
 		return
 	}

-	// Track the new peer in our indexes so we can quickly look it up either
+	// Track the new peer in our messages so we can quickly look it up either


Similar comment here about reverting this line diff.

oops, nice catch!

Roasbeef · 2017-03-15T04:15:30Z

server.go

@@ -551,7 +553,7 @@ func (s *server) addPeer(p *peer) {
 	s.peersByPub[string(p.addr.IdentityKey.SerializeCompressed())] = p
 	s.peersMtx.Unlock()

-	// Once the peer has been added to our indexes, send a message to the
+	// Once the peer has been added to our messages, send a message to the


Similar comment here about reverting this line diff.

Roasbeef · 2017-03-15T04:15:52Z

utxonursery_test.go

@@ -236,7 +235,7 @@ func TestSerializeKidOutput(t *testing.T) {

 	deserializedKid, err := deserializeKidOutput(&b)
 	if err != nil {
-		fmt.Printf(err.Error())
+		t.Fatalf("can't deserialize kid output: %v", err)


Nice catch! i'd missed this during my initial review.

coveralls · 2017-03-15T09:08:39Z

Coverage increased (+0.09%) to 67.801% when pulling 19239c7541bdc368fe75652bd5b75b3b6dd7199c on AndrewSamokhvalov:retransmission_subsystem into d723aad on lightningnetwork:master.

in this commit lnwire message header encode/decode tests were added, without it newcommer programmer may change the type inside message header and spend hours on debugging of integration test trying to understand why his node can't start and interact properly.

Issue: lightningnetwork#137 In this commit retranmission subsystem and boltdb mesage storage were added. Retransmission subsystem described in details in BOLT #2 (Message Retransmission) section. This subsystem keeps records of all messages that were sent to other peer and waits the ACK message to be received from other side and after that removes all acked messaged from the storage.

Issue: lightningnetwork#137 In this commit the retransmission subsystem was included in lnd, now upon peer reconnection we fetch all messages from message storage that were not acked and send them again to remote side.

andrewshvv · 2017-03-16T14:02:26Z

I have added the PR with addition of the stable gometalinter, because otherwise travis will fail.

coveralls · 2017-03-16T14:03:46Z

Coverage increased (+0.09%) to 67.801% when pulling f8b2624 on AndrewSamokhvalov:retransmission_subsystem into d723aad on lightningnetwork:master.

Roasbeef · 2017-03-17T03:06:22Z

I've started to test this PR locally and noticed that it currently goes about implementing the retransmission is missing a key feature.

The description in the original issue stated that the retransmission sub-system should actually sit between the server and the peer. In the design as originally described, if the peer isn't currently online (sendToPeer can't locate the target peer), then the message would be queued on disk to be retransmitted once the peer comes online again as it's "outside" the interaction of the peer itself. With this architecture, sub-systems never directly call peer.queueMsg, instead they are passed directly or indirectly the server's sendToPeer message which transparently handles committing the relevant messages to disk if the peer if offline.

Such behavior would allow sub-systems like the fundingManager to opaquely gain access to a reliable messaging stream to the remote peer regardless whether the peer is/was online or not.

As an example, let's say we're nearing the completion of a funding workflow. The ultimate block finalizing the channel arrives, the fundingManager notifies the relevant systems, then goes send the FundingLocked message to the channel counterparty so the channel itself can start to be updated. However, let's say that the channel peer went offline right before the final block was mined. As implemented in this PR, the call to sendToPeer will fail (as the peer isn't online) and the fundingLocked message will never be sent.

This PR is 80% of the way there, functionality wise. To get that last 20%, the following behavior needs to be implemented:

When the server is handling sendToPeer, if the target peer isn't online, then the server should write directly to the MessageStore of the target peer.
Any routing/discovery messages should be omitted from the behavior above.

Roasbeef · 2017-03-17T03:19:38Z

Here's an alternative to what's described above:

Rather then the fundingManager relying on the existence of a persistent messaging queue. It could instead, handle reliable completion of the funding workflow itself.
In this case, the fundingManager would gain some persistent state which records if the final step in the state machine has been completed or not.
The final step is the reliable reliably sending the FundingLocked message to complete a funding workflow.
The fundingManager maintains this state for all funding workflow which enter the final, waiting-for-channel-confirmation state.
Upon startup, for all funding workflows in this final limbo state, a channel barrier for the ChannelPoint is created.
The fundingManager either registers with the server for a notification of once the peer is online. Upon dispatch of the notification, the FundingLocked message is sent.

Roasbeef · 2017-03-17T03:21:52Z

Decided that what I've described w.r.t the fundingManager is a special case for sub-systems within the codebase atm. I'll continue testing this PR and will implement the revision of the functionality I described above myself.

Roasbeef · 2017-03-17T03:31:11Z

peer.go

+			"to the peer(%v)", len(messages), p)
+
+		for _, message := range messages {
+			// Sending over sendToPeer will cause block because of


Can you insert a logging message here that just logs the MessageCode itself? Thanks!

Roasbeef · 2017-03-17T03:37:23Z

retranmission.go

+func (rt *retransmitter) Ack(msg lnwire.Message) error {
+	switch msg.Command() {
+
+	case lnwire.CmdSingleFundingResponse:


For now, all funding message should be omitted from retransmission other than the FundingLocked message. While testing locally I just hit a bug that causes the funding manager to deadlock if lnd is restarted mid a >1 conf required channel opening.

Atm, the spec is incorrect. No funding messages until the point in which either side is committed to a funding transaction should be retransmitted at all.

Roasbeef · 2017-03-17T03:44:43Z

retranmission.go

+		)
+	case lnwire.CmdCloseComplete:
+		return rt.remove(
+			lnwire.CmdCloseRequest,


Atm CloseComplete is never sent within the daemon. Therefore, this entry should be removed. Otherwise, on restart the node will keep sending the same CloseRequest message indefinitely upon each restart. The responding node will simply ignore the message as the the channel has already been closed.

Roasbeef · 2017-03-17T03:47:29Z

retranmission.go

+		lnwire.CmdCloseRequest:
+		return rt.remove(
+			lnwire.CmdFundingLocked,
+			lnwire.CmdRevokeAndAck,


For now, all instances of RevokeAndAck should be omitted from retransmission. As is now, because we still use an "initial revocation window" of 1, peer restarts will cause lnd to send the initial RevokedAndAck twice with the same revocation values. This'll cause the channel to fail down the line as a state transition will re-use the same preimage rather than going to the next leaf node in the tree.

20:52:08 2017-03-16 [INF] PEER: retransmission subsystem resends 1 messages to the peer(020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019) 20:52:08 2017-03-16 [INF] CRTR: Synchronizing channel graph with 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b 20:52:08 2017-03-16 [TRC] PEER: writeMessage to 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019: (*lnwire.RevokeAndAck)(0xc42053f2d0)({ ChannelPoint: (wire.OutPoint) 397d9ba617b1b2d81a8249e3de3749f4dd2efd792ce34b758ed97326b68bf0b9:0, Revocation: ([32]uint8) (len=32 cap=32) { 00000000 6a 00 62 86 31 55 b1 4d 8f 20 e6 53 f2 8c f7 78 |j.b.1U.M. .S...x| 00000010 1c b2 72 d3 07 86 57 2d 5d bc 55 4f b4 a4 c8 a1 |..r...W-].UO....| }, NextRevocationKey: (*btcec.PublicKey)(0xc420318160)({ Curve: (elliptic.Curve) <nil>, X: (*big.Int)(0xc420318180)(105223291483128089908537415774962877536378315872169081183677829390620736225739), Y: (*big.Int)(0xc4203181a0)(5542066621571236556856056711647061449395836182811543325992215950193357130663) }), NextRevocationHash: ([32]uint8) (len=32 cap=32) { 00000000 46 b5 6c 1c 0e 0d 50 d4 a1 3c 97 c6 8c 8e 5d 6e |F.l...P..<....]n| 00000010 15 5b 62 f1 de 12 ec af 4a 11 a2 21 b2 4e a1 89 |.[b.....J..!.N..| } }) ...... 20:52:08 2017-03-16 [TRC] PEER: writeMessage to 020dbf0df13b994e562c9ac52098b86afd2b2099463370fc78124ab3c88ef87a6b@127.0.0.1:10019: (*lnwire.RevokeAndAck)(0xc420559f80)({ ChannelPoint: (wire.OutPoint) 397d9ba617b1b2d81a8249e3de3749f4dd2efd792ce34b758ed97326b68bf0b9:0, Revocation: ([32]uint8) (len=32 cap=32) { 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| }, NextRevocationKey: (*btcec.PublicKey)(0xc42053bec0)({ Curve: (elliptic.Curve) <nil>, X: (*big.Int)(0xc42053be40)(105223291483128089908537415774962877536378315872169081183677829390620736225739), Y: (*big.Int)(0xc42053be60)(5542066621571236556856056711647061449395836182811543325992215950193357130663) }), NextRevocationHash: ([32]uint8) (len=32 cap=32) { 00000000 46 b5 6c 1c 0e 0d 50 d4 a1 3c 97 c6 8c 8e 5d 6e |F.l...P..<....]n| 00000010 15 5b 62 f1 de 12 ec af 4a 11 a2 21 b2 4e a1 89 |.[b.....J..!.N..| } })

In this state, the state machines of both channels will actually enter a negative feedback cycle, continually failing as the wrong revocation message is being sent over and over again. As a result, the channels are no longer usable after a single restart.

halseth · 2017-05-03T20:25:41Z

retranmission.go

+// and may need to be re-established from time to time and reconnection
+// introduces doubt as to what has been received such logic is needed to be sure
+// that peers are in consistent state in terms of message communication.
+type retransmitter struct {


just noticed the filename has a typo, should probably be retransmission.go

Roasbeef · 2017-07-12T00:57:10Z

Closing this as it has been replaced by #231. We might possibly integrate some sections of this into the project at a later point though.

andrewshvv force-pushed the retransmission_subsystem branch 5 times, most recently from c18e4ec to 480beb6 Compare March 2, 2017 13:19

andrewshvv changed the title ~~Add a message retransmission sub-system BOLT#2 #137~~ BOLT#2: Add a message retransmission sub-system #137 Mar 2, 2017

andrewshvv changed the title ~~BOLT#2: Add a message retransmission sub-system #137~~ BOLT#2: Add message retransmission sub-system Mar 2, 2017

andrewshvv force-pushed the retransmission_subsystem branch from 480beb6 to bddc034 Compare March 2, 2017 13:33

Roasbeef requested changes Mar 9, 2017

View reviewed changes

andrewshvv force-pushed the retransmission_subsystem branch 2 times, most recently from 08df3e9 to 31cb691 Compare March 13, 2017 19:46

Roasbeef requested changes Mar 15, 2017

View reviewed changes

andrewshvv force-pushed the retransmission_subsystem branch 2 times, most recently from c37d1ba to 19239c7 Compare March 15, 2017 09:04

andrewshvv added 10 commits March 16, 2017 13:46

tests: add timeout in 'revoked uncooperative close retribution' test

fbff9d4

linter: fix new warnings

2942c9e

gotest: use stable version of metalinter

72eabf3

gotest: add additional port

d621463

fundingmanager: fix print info

6391f63

utxonursery: change print -> fatal

08f4ece

lnwire: add lnwire.MessageCode type which represent the lnwire command

69be2de

lnd: add retransmission subsystem

f8b2624

Issue: lightningnetwork#137 In this commit the retransmission subsystem was included in lnd, now upon peer reconnection we fetch all messages from message storage that were not acked and send them again to remote side.

andrewshvv force-pushed the retransmission_subsystem branch from 19239c7 to f8b2624 Compare March 16, 2017 13:56

Roasbeef reviewed Mar 17, 2017

View reviewed changes

Roasbeef mentioned this pull request Apr 14, 2017

Implement Fully Asynchronous Channel Opening #182

Closed

4 tasks

halseth reviewed May 3, 2017

View reviewed changes

Roasbeef closed this Jul 12, 2017

vpecinka mentioned this pull request Jul 23, 2018

Pending channel with negative blocks_til_maturity #1610

Closed

BOLT#2: Add message retransmission sub-system #156

BOLT#2: Add message retransmission sub-system #156

Conversation

andrewshvv commented Mar 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewshvv Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewshvv Mar 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewshvv Mar 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Mar 15, 2017

andrewshvv commented Mar 16, 2017

coveralls commented Mar 16, 2017 • edited Loading

Roasbeef commented Mar 17, 2017 • edited Loading

Roasbeef commented Mar 17, 2017 • edited Loading

Roasbeef commented Mar 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Roasbeef Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Roasbeef commented Jul 12, 2017

andrewshvv commented Mar 2, 2017 •

edited

Loading

andrewshvv Mar 13, 2017 •

edited

Loading

andrewshvv Mar 15, 2017 •

edited

Loading

andrewshvv Mar 16, 2017 •

edited

Loading

coveralls commented Mar 16, 2017 •

edited

Loading

Roasbeef commented Mar 17, 2017 •

edited

Loading

Roasbeef commented Mar 17, 2017 •

edited

Loading

Roasbeef Mar 17, 2017 •

edited

Loading

Roasbeef Mar 17, 2017 •

edited

Loading