Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: end completed queries in sendTransaction before timeout #240

Merged
merged 1 commit into from
Feb 10, 2022

Conversation

chappjc
Copy link
Contributor

@chappjc chappjc commented Dec 31, 2021

(*ChainService).sendTransaction always times out even if each peer does respond. This is because after receiving a getdata message in response to the outgoing inv, the checkResponse closure provided to queryAllPeers neglects to close the peerQuit channel. As a result, sendTransaction always blocks for the full configured BroadcastTimeout.

This change adds the missing close(peerQuit) when a matching response (either a getdata or reject) is received. I have tested this fix in our app, and sends no longer always time out when all peers respond with a getdata.

This change also makes a note in the sendTransaction function that the if numReplied == len(rejections) { test at the end is currently impossible to hit since numReplied would also need to be incremented when rejections are recorded. I did not fix this however since it would potentially start returning non-nil errors for harmless rejections, such as "already-in-mempool", "already have transaction", and "transaction already exists" (pushtx.Mempool and pushtx.Confirmed). There are existing TODO comments from @wpaulino (wilmer) to this effect. I did add a debug log though to help characterize what rejection messages are actually received in the wild.

Finally, this adds a comment in (*blockManager).fetchFilterFromAllPeers, which also appears to be lacking a close(peerQuit) in it's queryAllPeers check closure. I added a debug log instead. I'm pretty certain this fix is needed here too, but this code is only called from (*blockManager).detectBadPeers, and I haven't hit that in my tests yet today.

Copy link

@arshbot arshbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bugfix @chappjc!

Just did a quick review - is this PR in a reviewable state? There is commented out code and notes jotted about.

blockmanager.go Outdated Show resolved Hide resolved
blockmanager.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
query.go Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
Copy link

@arshbot arshbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @chappjc! 🚀
And thanks for also pointing out the other issues in sendTransaction.

I think, instead of adding comments to point those out and then removing them, either we should fix them all in this PR or we should just squash this PR down to fixing just the channel close issue and then open up a new issue to fix the rest of the bugs. Seems from the overall code change here that you would be happy with option 2?

query.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
query.go Outdated
// To ever get (numReplied == len(rejections)), it is necessary
// to also increment numReplied. However, this would be a
// breaking change since common innocuous errors such as Mempool
// and Confirmed do not presently cause a non-nil error return.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should account for those cases then. Would you mind adding that? something like this:


func (c BroadcastErrorCode) IsCritical() bool {
	return c != Mempool && c != Confirmed	
}

Then, I think we can add the numReplied++ here and then below we tally the rejections that are not critical and only return then we can keep the if numReplied == numCriticalErrors type logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i see now that the overall effect of the PR is just to close the quit channel and to not include these comments.

So then rather: squash all the commits into 1 that just does that. I will open an issue to fix the other things you have pointed out. Nice catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right sorry for the confusion there. I kept the commits "unsquashed" so you could see the issues illustrated. Will squash it down to one commit.

query.go Outdated Show resolved Hide resolved
query.go Outdated Show resolved Hide resolved
@@ -1118,15 +1124,14 @@ func (s *ChainService) sendTransaction(tx *wire.MsgTx, options ...QueryOption) e
// it so we'll return the most rejected error between all of our peers.
//
// TODO(wilmer): This might be too naive, some rejections are more
// critical than others.
// critical than others. e.g. pushtx.Mempool and pushtx.Confirmed are OK.
//
// TODO(wilmer): This does not cover the case where a peer also rejected
// our transaction but didn't send the response within our given timeout
// and certain other cases. Due to this, we should probably decide on a
// threshold of rejections instead.
if numReplied == len(rejections) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check actually also doesnt work cause rejections is a map... so this check only would work if each rejections error received was a unique one since this len will return the number of keys in the map

Copy link
Contributor Author

@chappjc chappjc Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. They'd all have to be unique errors, not so helpful. Oh well.
Hopefully we see some rejected transactions logged to get an idea of when and if it's safe to throw an error back and remove the tx from the txn manager. Seems like almost all the time it's just no response if the peer already has it or if it is already mined.

@chappjc chappjc force-pushed the send-success branch 2 times, most recently from cbef3a3 to e3f0b4e Compare January 18, 2022 13:35
@chappjc

This comment has been minimized.

@chappjc

This comment has been minimized.

@chappjc

This comment has been minimized.

@chappjc chappjc force-pushed the send-success branch 2 times, most recently from 7611d10 to c61c599 Compare January 18, 2022 15:43
@chappjc
Copy link
Contributor Author

chappjc commented Jan 18, 2022

The assumption in the sync tests was indeed that when SendTransaction returned without error that the miner node would be ready to include that in a block, which worked when it always timed out because it gave the peer's out handler time to send it and the node time to accept it (although it was never guaranteed). I hope this update to the test is acceptable. The transient failure without the test update is actually an indication that the fix in this PR is working as intended.

sync_test.go Outdated
@@ -399,6 +399,23 @@ func testRescan(harness *neutrinoHarness, t *testing.T) {
}
}

func waitTx(t *testing.T, node *rpcclient.Client, hash chainhash.Hash) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, this lgtm 👍

could also move this into an embedded function in the test itself since it is only used there. And we could even copy the pattern of wait.Predicate(...) from the lnd repo and write it like this (up to you but its nice that it waits a bit before calling the func the first time. But i like that your way uses time.NewTimer instead of time.After 👍 ):

waitTx := func(node *rpcclient.Client, hash chainhash.Hash) error {
		exitTimer := time.After(5 * time.Second)
		for {
			<-time.After(200 * time.Millisecond)

			select {
			case <-exitTimer:
				return fmt.Errorf("tx not found after time out")
			default:
			}

			if _, err := node.GetRawTransaction(&hash); err != nil {
				continue
			}

			return nil
		}
	}

Copy link
Contributor Author

@chappjc chappjc Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it a closure, and incorporated the initial delay as you indicated was nice, but left it as a test helper (fatal in the closure to avoid cluttering the test case flow with err checking boilerplate).

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on why that test was failing! This LGTM. Left a nit & an optional suggestion. I will rebase & open my pr with the other small fixes once this is in 👍

query.go Outdated
Comment on lines 1102 to 1107
log.Debugf("Transaction %v rejected by peer %v: code = %v, reason = %q",
txHash, sp.Addr(), broadcastErr.Code, broadcastErr.Reason)

close(peerQuit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pls wrap at 80 chars

@chappjc chappjc force-pushed the send-success branch 2 times, most recently from 098e859 to c7fc088 Compare January 31, 2022 16:42
@lightninglabs-deploy
Copy link

@chappjc, remember to re-request review from reviewers when ready

query.go Outdated
Comment on lines 1102 to 1103
log.Debugf("Transaction %v rejected by peer %v: code = %v, "+
"reason = %q", txHash, sp.Addr(), broadcastErr.Code,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls wrap at 80 chars

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof these big tabs are killing me 😉

Copy link
Member

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending 1 last line-wrap nit 🔥

@Roasbeef
Copy link
Member

Roasbeef commented Feb 9, 2022

Needs a rebase!

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎋

(*ChainService).sendTransaction always timed out even if each peer did
respond. This is because after receiving a getdata message in response
to the outgoing inv, the checkResponse closure provided to
queryAllPeers neglected to close the peerQuit channel. As a result,
sendTransaction always blocks for the full configured BroadcastTimeout.

This change adds the missing close(peerQuit) when a matching response
(either a getdata or reject) is received.

Note that as before this change, a non-nil error from sendTransaction
does not guarantee that a remote peer actually received the MsgTx that
was queued since the outgoing tx message is handled asynchronously.
To account for the removed delay, the TestNeutrinoSync unit test is
updated with a waitTx helper to ensure the mining node has the sent
transaction prior to requesting a new block.
@@ -1078,6 +1078,8 @@ func (s *ChainService) sendTransaction(tx *wire.MsgTx, options ...QueryOption) e
)

numReplied++

close(peerQuit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong. Why would we close the connection to the peer if we just got an INV message? That means they want to know about the TX, so we send it to them. Wouldn't we also want to wait for their response before quitting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing the peerQuit channel is just used to indicate that the peer has responded so that we free up the resource so we can use the peer connection for other calls. Closing this channel doesnt close the connection to the peer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. But are we guaranteed to get further responses from that peer? E.g. if a reject message from that peer comes in later, do we get it in any case or depending on the timing (e.g. there's a chance we might not get it because we quit the peer here already)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried this out in combination with the lnd test TestLightningWallet/btcwallet/neutrino:publish_transaction. If I remove the close(peerQuit) in the INV message case, I correctly get the reject for a double spent transaction. If I leave it in there, the reject is never received.
There's also a bug in the result interpretation, so I'm going to open a PR anyway. Can discuss this further in that new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants