-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store: refine the error handling and retry mechanism for stale read #24956
Conversation
f1f439f
to
199de7f
Compare
/run-all-tests |
/rebuild |
@xhebox @nolouch @Yisaer @djshow832 PTAL |
a1205de
to
c94f4b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
store/tikv/region_request.go
Outdated
// Stale Read request will retry the leader or next peer on error, | ||
// so we will exclude the PeerID of the requested peer every time. | ||
// If the new PeerID keeps being the same with the last one, we | ||
// should not continue excluding it to make the opts become bigger. | ||
if ctx.isStaleRead && ctx.lastPeerID != ctx.Peer.GetId() { | ||
opts = append(opts, WithExcludedPeerIDs([]uint64{ctx.Peer.GetId()})) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback strategy is kind of different from what I think, I think if the tx_scope
is local
, we shall find other store which located in same dc. If the tx_scope
is global
, we shall directly send request to leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, it works like this:
- If the
tx_scope
is local, it will find other stores located in the same DC according to matched labels and try on different peers one by one on the error. If all peers are tried, it will only retry on the leader until fails. - If the
tx_scope
is global, it will try on different peers one by one on the error. If all peers are tried, it will only retry on the leader until fails.
It seem that we only need to change the behaviour of the global
? Once it fails, it will directly request to the leader rather than try all other peers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to send to leader directly when tx_scope
is global
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
bb8737d
to
2d7bd46
Compare
Signed-off-by: JmPotato <[email protected]>
b31dcf8
to
aa554ca
Compare
Signed-off-by: JmPotato <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM except faillback strategy
Signed-off-by: JmPotato <[email protected]>
ea12beb
to
884d469
Compare
Signed-off-by: JmPotato <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
Signed-off-by: JmPotato <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add TestingT
otherwise tests in store/tikv
package won't run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
// Stale Read request will retry the leader or next peer on error, | ||
// if txnScope is global, we will only retry the leader by using the WithLeaderOnly option, | ||
// if txnScope is local, we will retry both other peers and the leader by the incresing seed. | ||
if ctx.tryTimes < 1 && req != nil && req.TxnScope == oracle.GlobalTxnScope && req.GetStaleRead() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need tryTimes
and tryTimesLimit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make WithLeaderOnly
option only be appended once to prevent the unnecessary slice memory reallocation. And tryTimesLimit
is used by the unit test to mock different retry times.
I have added it in |
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: c870d80
|
@JmPotato: Your PR was out of date, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Signed-off-by: JmPotato [email protected]
What problem does this PR solve?
Part of #21094 and #23271.
Refine the error handling and retry mechanism for stale read.
What is changed and how it works?
StoreID
list during the retry of a stale read request to make sure TiDB will try on different peers.RegionNotInitialized
andDataIsNotReady
error handling.Check List
Tests
Release note
No release note
.