-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: end completed queries in sendTransaction before timeout #240
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the bugfix @chappjc!
Just did a quick review - is this PR in a reviewable state? There is commented out code and notes jotted about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch @chappjc! 🚀
And thanks for also pointing out the other issues in sendTransaction
.
I think, instead of adding comments to point those out and then removing them, either we should fix them all in this PR or we should just squash this PR down to fixing just the channel close issue and then open up a new issue to fix the rest of the bugs. Seems from the overall code change here that you would be happy with option 2?
query.go
Outdated
// To ever get (numReplied == len(rejections)), it is necessary | ||
// to also increment numReplied. However, this would be a | ||
// breaking change since common innocuous errors such as Mempool | ||
// and Confirmed do not presently cause a non-nil error return. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should account for those cases then. Would you mind adding that? something like this:
func (c BroadcastErrorCode) IsCritical() bool {
return c != Mempool && c != Confirmed
}
Then, I think we can add the numReplied++
here and then below we tally the rejections that are not critical and only return then we can keep the if numReplied == numCriticalErrors
type logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i see now that the overall effect of the PR is just to close the quit channel and to not include these comments.
So then rather: squash all the commits into 1 that just does that. I will open an issue to fix the other things you have pointed out. Nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right sorry for the confusion there. I kept the commits "unsquashed" so you could see the issues illustrated. Will squash it down to one commit.
@@ -1118,15 +1124,14 @@ func (s *ChainService) sendTransaction(tx *wire.MsgTx, options ...QueryOption) e | |||
// it so we'll return the most rejected error between all of our peers. | |||
// | |||
// TODO(wilmer): This might be too naive, some rejections are more | |||
// critical than others. | |||
// critical than others. e.g. pushtx.Mempool and pushtx.Confirmed are OK. | |||
// | |||
// TODO(wilmer): This does not cover the case where a peer also rejected | |||
// our transaction but didn't send the response within our given timeout | |||
// and certain other cases. Due to this, we should probably decide on a | |||
// threshold of rejections instead. | |||
if numReplied == len(rejections) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check actually also doesnt work cause rejections
is a map... so this check only would work if each rejections error received was a unique one since this len will return the number of keys in the map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. They'd all have to be unique errors, not so helpful. Oh well.
Hopefully we see some rejected transactions logged to get an idea of when and if it's safe to throw an error back and remove the tx from the txn manager. Seems like almost all the time it's just no response if the peer already has it or if it is already mined.
cbef3a3
to
e3f0b4e
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
7611d10
to
c61c599
Compare
The assumption in the sync tests was indeed that when |
sync_test.go
Outdated
@@ -399,6 +399,23 @@ func testRescan(harness *neutrinoHarness, t *testing.T) { | |||
} | |||
} | |||
|
|||
func waitTx(t *testing.T, node *rpcclient.Client, hash chainhash.Hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, this lgtm 👍
could also move this into an embedded function in the test itself since it is only used there. And we could even copy the pattern of wait.Predicate(...)
from the lnd repo and write it like this (up to you but its nice that it waits a bit before calling the func the first time. But i like that your way uses time.NewTimer
instead of time.After
👍 ):
waitTx := func(node *rpcclient.Client, hash chainhash.Hash) error {
exitTimer := time.After(5 * time.Second)
for {
<-time.After(200 * time.Millisecond)
select {
case <-exitTimer:
return fmt.Errorf("tx not found after time out")
default:
}
if _, err := node.GetRawTransaction(&hash); err != nil {
continue
}
return nil
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it a closure, and incorporated the initial delay as you indicated was nice, but left it as a test helper (fatal in the closure to avoid cluttering the test case flow with err checking boilerplate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch on why that test was failing! This LGTM. Left a nit & an optional suggestion. I will rebase & open my pr with the other small fixes once this is in 👍
query.go
Outdated
log.Debugf("Transaction %v rejected by peer %v: code = %v, reason = %q", | ||
txHash, sp.Addr(), broadcastErr.Code, broadcastErr.Reason) | ||
|
||
close(peerQuit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pls wrap at 80 chars
098e859
to
c7fc088
Compare
@chappjc, remember to re-request review from reviewers when ready |
query.go
Outdated
log.Debugf("Transaction %v rejected by peer %v: code = %v, "+ | ||
"reason = %q", txHash, sp.Addr(), broadcastErr.Code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls wrap at 80 chars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof these big tabs are killing me 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending 1 last line-wrap nit 🔥
Needs a rebase! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎋
(*ChainService).sendTransaction always timed out even if each peer did respond. This is because after receiving a getdata message in response to the outgoing inv, the checkResponse closure provided to queryAllPeers neglected to close the peerQuit channel. As a result, sendTransaction always blocks for the full configured BroadcastTimeout. This change adds the missing close(peerQuit) when a matching response (either a getdata or reject) is received. Note that as before this change, a non-nil error from sendTransaction does not guarantee that a remote peer actually received the MsgTx that was queued since the outgoing tx message is handled asynchronously. To account for the removed delay, the TestNeutrinoSync unit test is updated with a waitTx helper to ensure the mining node has the sent transaction prior to requesting a new block.
@@ -1078,6 +1078,8 @@ func (s *ChainService) sendTransaction(tx *wire.MsgTx, options ...QueryOption) e | |||
) | |||
|
|||
numReplied++ | |||
|
|||
close(peerQuit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is wrong. Why would we close the connection to the peer if we just got an INV message? That means they want to know about the TX, so we send it to them. Wouldn't we also want to wait for their response before quitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing the peerQuit
channel is just used to indicate that the peer has responded so that we free up the resource so we can use the peer connection for other calls. Closing this channel doesnt close the connection to the peer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. But are we guaranteed to get further responses from that peer? E.g. if a reject message from that peer comes in later, do we get it in any case or depending on the timing (e.g. there's a chance we might not get it because we quit the peer here already)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried this out in combination with the lnd
test TestLightningWallet/btcwallet/neutrino:publish_transaction
. If I remove the close(peerQuit)
in the INV message case, I correctly get the reject for a double spent transaction. If I leave it in there, the reject is never received.
There's also a bug in the result interpretation, so I'm going to open a PR anyway. Can discuss this further in that new PR.
(*ChainService).sendTransaction
always times out even if each peer does respond. This is because after receiving agetdata
message in response to the outgoinginv
, thecheckResponse
closure provided toqueryAllPeers
neglects to close thepeerQuit
channel. As a result,sendTransaction
always blocks for the full configuredBroadcastTimeout
.This change adds the missing
close(peerQuit)
when a matching response (either a getdata or reject) is received. I have tested this fix in our app, and sends no longer always time out when all peers respond with a getdata.This change also makes a note in the
sendTransaction
function that theif numReplied == len(rejections) {
test at the end is currently impossible to hit sincenumReplied
would also need to be incremented when rejections are recorded. I did not fix this however since it would potentially start returning non-nil errors for harmless rejections, such as "already-in-mempool", "already have transaction", and "transaction already exists" (pushtx.Mempool
andpushtx.Confirmed
). There are existing TODO comments from @wpaulino (wilmer) to this effect. I did add a debug log though to help characterize what rejection messages are actually received in the wild.Finally, this adds a comment in
(*blockManager).fetchFilterFromAllPeers
, which also appears to be lacking aclose(peerQuit)
in it'squeryAllPeers
check closure. I added a debug log instead. I'm pretty certain this fix is needed here too, but this code is only called from(*blockManager).detectBadPeers
, and I haven't hit that in my tests yet today.