query: end completed queries in sendTransaction before timeout #240

chappjc · 2021-12-31T21:43:40Z

(*ChainService).sendTransaction always times out even if each peer does respond. This is because after receiving a getdata message in response to the outgoing inv, the checkResponse closure provided to queryAllPeers neglects to close the peerQuit channel. As a result, sendTransaction always blocks for the full configured BroadcastTimeout.

This change adds the missing close(peerQuit) when a matching response (either a getdata or reject) is received. I have tested this fix in our app, and sends no longer always time out when all peers respond with a getdata.

This change also makes a note in the sendTransaction function that the if numReplied == len(rejections) { test at the end is currently impossible to hit since numReplied would also need to be incremented when rejections are recorded. I did not fix this however since it would potentially start returning non-nil errors for harmless rejections, such as "already-in-mempool", "already have transaction", and "transaction already exists" (pushtx.Mempool and pushtx.Confirmed). There are existing TODO comments from @wpaulino (wilmer) to this effect. I did add a debug log though to help characterize what rejection messages are actually received in the wild.

Finally, this adds a comment in (*blockManager).fetchFilterFromAllPeers, which also appears to be lacking a close(peerQuit) in it's queryAllPeers check closure. I added a debug log instead. I'm pretty certain this fix is needed here too, but this code is only called from (*blockManager).detectBadPeers, and I haven't hit that in my tests yet today.

arshbot

Thanks for the bugfix @chappjc!

Just did a quick review - is this PR in a reviewable state? There is commented out code and notes jotted about.

blockmanager.go

query.go

arshbot

LGTM!

ellemouton

Nice catch @chappjc! 🚀
And thanks for also pointing out the other issues in sendTransaction.

I think, instead of adding comments to point those out and then removing them, either we should fix them all in this PR or we should just squash this PR down to fixing just the channel close issue and then open up a new issue to fix the rest of the bugs. Seems from the overall code change here that you would be happy with option 2?

query.go

ellemouton · 2022-01-18T08:08:38Z

query.go

+				// To ever get (numReplied == len(rejections)), it is necessary
+				// to also increment numReplied. However, this would be a
+				// breaking change since common innocuous errors such as Mempool
+				// and Confirmed do not presently cause a non-nil error return.


I think we should account for those cases then. Would you mind adding that? something like this:

func (c BroadcastErrorCode) IsCritical() bool { return c != Mempool && c != Confirmed }

Then, I think we can add the numReplied++ here and then below we tally the rejections that are not critical and only return then we can keep the if numReplied == numCriticalErrors type logic

ok i see now that the overall effect of the PR is just to close the quit channel and to not include these comments.

So then rather: squash all the commits into 1 that just does that. I will open an issue to fix the other things you have pointed out. Nice catch!

Right sorry for the confusion there. I kept the commits "unsquashed" so you could see the issues illustrated. Will squash it down to one commit.

query.go

ellemouton · 2022-01-18T08:56:12Z

query.go

@@ -1118,15 +1124,14 @@ func (s *ChainService) sendTransaction(tx *wire.MsgTx, options ...QueryOption) e
 	// it so we'll return the most rejected error between all of our peers.
 	//
 	// TODO(wilmer): This might be too naive, some rejections are more
-	// critical than others.
+	// critical than others. e.g. pushtx.Mempool and pushtx.Confirmed are OK.
 	//
 	// TODO(wilmer): This does not cover the case where a peer also rejected
 	// our transaction but didn't send the response within our given timeout
 	// and certain other cases. Due to this, we should probably decide on a
 	// threshold of rejections instead.
 	if numReplied == len(rejections) {


this check actually also doesnt work cause rejections is a map... so this check only would work if each rejections error received was a unique one since this len will return the number of keys in the map

Good point. They'd all have to be unique errors, not so helpful. Oh well.
Hopefully we see some rejected transactions logged to get an idea of when and if it's safe to throw an error back and remove the tx from the txn manager. Seems like almost all the time it's just no response if the peer already has it or if it is already mined.

chappjc · 2022-01-18T15:49:32Z

The assumption in the sync tests was indeed that when SendTransaction returned without error that the miner node would be ready to include that in a block, which worked when it always timed out because it gave the peer's out handler time to send it and the node time to accept it (although it was never guaranteed). I hope this update to the test is acceptable. The transient failure without the test update is actually an indication that the fix in this PR is working as intended.

ellemouton · 2022-01-31T09:33:46Z

sync_test.go

@@ -399,6 +399,23 @@ func testRescan(harness *neutrinoHarness, t *testing.T) {
 	}
 }

+func waitTx(t *testing.T, node *rpcclient.Client, hash chainhash.Hash) {


cool, this lgtm 👍

could also move this into an embedded function in the test itself since it is only used there. And we could even copy the pattern of wait.Predicate(...) from the lnd repo and write it like this (up to you but its nice that it waits a bit before calling the func the first time. But i like that your way uses time.NewTimer instead of time.After 👍 ):

waitTx := func(node *rpcclient.Client, hash chainhash.Hash) error { exitTimer := time.After(5 * time.Second) for { <-time.After(200 * time.Millisecond) select { case <-exitTimer: return fmt.Errorf("tx not found after time out") default: } if _, err := node.GetRawTransaction(&hash); err != nil { continue } return nil } }

Made it a closure, and incorporated the initial delay as you indicated was nice, but left it as a test helper (fatal in the closure to avoid cluttering the test case flow with err checking boilerplate).

ellemouton

Nice catch on why that test was failing! This LGTM. Left a nit & an optional suggestion. I will rebase & open my pr with the other small fixes once this is in 👍

ellemouton · 2022-01-31T09:35:41Z

query.go

+				log.Debugf("Transaction %v rejected by peer %v: code = %v, reason = %q",
+					txHash, sp.Addr(), broadcastErr.Code, broadcastErr.Reason)
+
+				close(peerQuit)


nit: pls wrap at 80 chars

lightninglabs-deploy · 2022-02-07T15:16:38Z

@chappjc, remember to re-request review from reviewers when ready

ellemouton · 2022-02-08T14:16:55Z

query.go

+				log.Debugf("Transaction %v rejected by peer %v: code = %v, "+
+					"reason = %q", txHash, sp.Addr(), broadcastErr.Code,


pls wrap at 80 chars

oof these big tabs are killing me 😉

ellemouton

LGTM pending 1 last line-wrap nit 🔥

Roasbeef · 2022-02-09T05:40:46Z

Needs a rebase!

Roasbeef

LGTM 🎋

(*ChainService).sendTransaction always timed out even if each peer did respond. This is because after receiving a getdata message in response to the outgoing inv, the checkResponse closure provided to queryAllPeers neglected to close the peerQuit channel. As a result, sendTransaction always blocks for the full configured BroadcastTimeout. This change adds the missing close(peerQuit) when a matching response (either a getdata or reject) is received. Note that as before this change, a non-nil error from sendTransaction does not guarantee that a remote peer actually received the MsgTx that was queued since the outgoing tx message is handled asynchronously. To account for the removed delay, the TestNeutrinoSync unit test is updated with a waitTx helper to ensure the mining node has the sent transaction prior to requesting a new block.

guggero · 2022-02-23T10:53:36Z

query.go

@@ -1078,6 +1078,8 @@ func (s *ChainService) sendTransaction(tx *wire.MsgTx, options ...QueryOption) e
 						)

 						numReplied++
+
+						close(peerQuit)


I think this is wrong. Why would we close the connection to the peer if we just got an INV message? That means they want to know about the TX, so we send it to them. Wouldn't we also want to wait for their response before quitting?

Closing the peerQuit channel is just used to indicate that the peer has responded so that we free up the resource so we can use the peer connection for other calls. Closing this channel doesnt close the connection to the peer

Ah okay. But are we guaranteed to get further responses from that peer? E.g. if a reject message from that peer comes in later, do we get it in any case or depending on the timing (e.g. there's a chance we might not get it because we quit the peer here already)?

I just tried this out in combination with the lnd test TestLightningWallet/btcwallet/neutrino:publish_transaction. If I remove the close(peerQuit) in the INV message case, I correctly get the reject for a double spent transaction. If I leave it in there, the reject is never received.
There's also a bug in the result interpretation, so I'm going to open a PR anyway. Can discuss this further in that new PR.

chappjc mentioned this pull request Dec 31, 2021

client/asset/btc: work around PublishTransaction delays decred/dcrdex#1376

Merged

Roasbeef requested review from ellemouton and arshbot January 3, 2022 22:11

arshbot suggested changes Jan 13, 2022

View reviewed changes

blockmanager.go Outdated Show resolved Hide resolved

blockmanager.go Outdated Show resolved Hide resolved

query.go Outdated Show resolved Hide resolved

query.go Show resolved Hide resolved

chappjc commented Jan 13, 2022

View reviewed changes

query.go Outdated Show resolved Hide resolved

arshbot approved these changes Jan 14, 2022

View reviewed changes

ellemouton requested changes Jan 18, 2022

View reviewed changes

ellemouton reviewed Jan 18, 2022

View reviewed changes

chappjc force-pushed the send-success branch 2 times, most recently from cbef3a3 to e3f0b4e Compare January 18, 2022 13:35

This comment has been minimized.

Sign in to view

chappjc force-pushed the send-success branch from e3f0b4e to 232e262 Compare January 18, 2022 13:48

This comment has been minimized.

Sign in to view

chappjc force-pushed the send-success branch 2 times, most recently from 7611d10 to c61c599 Compare January 18, 2022 15:43

chappjc force-pushed the send-success branch from c61c599 to 2cb8084 Compare January 18, 2022 15:57

chappjc requested a review from ellemouton January 25, 2022 16:25

ellemouton reviewed Jan 31, 2022

View reviewed changes

chappjc force-pushed the send-success branch 2 times, most recently from 098e859 to c7fc088 Compare January 31, 2022 16:42

chappjc requested a review from ellemouton February 7, 2022 15:25

ellemouton reviewed Feb 8, 2022

View reviewed changes

ellemouton requested a review from arshbot February 8, 2022 14:19

chappjc force-pushed the send-success branch from c7fc088 to 4dee907 Compare February 8, 2022 14:34

chappjc requested a review from ellemouton February 8, 2022 14:34

ellemouton approved these changes Feb 8, 2022

View reviewed changes

ellemouton mentioned this pull request Feb 8, 2022

multi: small query fixes #244

Closed

chappjc mentioned this pull request Feb 9, 2022

multi: update to btcsuite/btcd/btcutil and btcec/v2 #245

Merged

Roasbeef approved these changes Feb 9, 2022

View reviewed changes

chappjc force-pushed the send-success branch from 4dee907 to bd59e96 Compare February 9, 2022 13:28

Roasbeef merged commit 86e8ff6 into lightninglabs:master Feb 10, 2022

Roasbeef mentioned this pull request Feb 22, 2022

Bump btcec to v2, use integrated btcutil module lightningnetwork/lnd#6285

Merged

guggero reviewed Feb 23, 2022

View reviewed changes

guggero mentioned this pull request Feb 23, 2022

Fix TX reject messages not being counted anymore #247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: end completed queries in sendTransaction before timeout #240

query: end completed queries in sendTransaction before timeout #240

chappjc commented Dec 31, 2021 •

edited

Loading

arshbot left a comment

arshbot left a comment

ellemouton left a comment

ellemouton Jan 18, 2022

ellemouton Jan 18, 2022

chappjc Jan 18, 2022

ellemouton Jan 18, 2022

chappjc Jan 18, 2022 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

chappjc commented Jan 18, 2022 •

edited

Loading

ellemouton Jan 31, 2022

chappjc Jan 31, 2022 •

edited

Loading

ellemouton left a comment •

edited

Loading

ellemouton Jan 31, 2022

lightninglabs-deploy commented Feb 7, 2022

ellemouton Feb 8, 2022

chappjc Feb 8, 2022

ellemouton left a comment

Roasbeef commented Feb 9, 2022

Roasbeef left a comment

guggero Feb 23, 2022

ellemouton Feb 23, 2022

guggero Feb 23, 2022

guggero Feb 23, 2022

		log.Debugf("Transaction %v rejected by peer %v: code = %v, "+
		"reason = %q", txHash, sp.Addr(), broadcastErr.Code,

query: end completed queries in sendTransaction before timeout #240

query: end completed queries in sendTransaction before timeout #240

Conversation

chappjc commented Dec 31, 2021 • edited Loading

arshbot left a comment

Choose a reason for hiding this comment

arshbot left a comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chappjc Jan 18, 2022 • edited Loading

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

chappjc commented Jan 18, 2022 • edited Loading

Choose a reason for hiding this comment

chappjc Jan 31, 2022 • edited Loading

Choose a reason for hiding this comment

ellemouton left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lightninglabs-deploy commented Feb 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

Roasbeef commented Feb 9, 2022

Roasbeef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chappjc commented Dec 31, 2021 •

edited

Loading

chappjc Jan 18, 2022 •

edited

Loading

chappjc commented Jan 18, 2022 •

edited

Loading

chappjc Jan 31, 2022 •

edited

Loading

ellemouton left a comment •

edited

Loading