-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: add backoff for member loop
#6995
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Skipping CI for Draft Pull Request. |
Signed-off-by: husharp <[email protected]>
Signed-off-by: husharp <[email protected]>
client/retry/backoff.go
Outdated
} | ||
|
||
// ExponentialBackoff Get the exponential backoff duration. | ||
func (rs *BackOffer) exponentialBackoff() time.Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we use exponential backoff by default? When network partition a long time, it may need more time to recover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limited by max backoff time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long is it?
Signed-off-by: husharp <[email protected]>
Signed-off-by: husharp <[email protected]>
@@ -239,17 +241,19 @@ func (c *pdServiceDiscovery) updateMemberLoop() { | |||
ticker := time.NewTicker(memberUpdateInterval) | |||
defer ticker.Stop() | |||
|
|||
bo := retry.InitialBackOffer(updateMemberBackOffBaseTime, updateMemberTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhy1024 for member loop
max backup time time is 1 second
which is updateMemberTimeout
, PTAL, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #6995 +/- ##
==========================================
+ Coverage 74.18% 74.19% +0.01%
==========================================
Files 433 433
Lines 46133 46097 -36
==========================================
- Hits 34225 34203 -22
+ Misses 8887 8879 -8
+ Partials 3021 3015 -6
Flags with carried forward coverage won't be shown. Click here to find out more. |
Signed-off-by: husharp <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! good job.
/merge |
@nolouch: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: 2af0f29
|
/run-cherry-picker |
/run-cherry-picker |
In response to a cherrypick label: new pull request created to branch |
ref tikv#6949 Signed-off-by: ti-chi-bot <[email protected]>
ref #6949 Signed-off-by: husharp <[email protected]> Co-authored-by: husharp <[email protected]>
What problem does this PR solve?
Issue Number: Ref #6949 #6556
What is changed and how does it work?
Check List
Tests
PR Summary
1. Add ready for resp- Have goroutinereconnectMemberLoop
callupdateMember
periodically. When callingScheduleCheckMemberChanged
channel, we need to wait for the goroutine to update members until ready or timeout.2. Add backoff mechanism
expo
function can be used to backoff to sleep when an error is encountered.Reproduce Step
enable fail point, like gRPC is throttling, cannot read from etcd.
curl -X PUT -d 'return(10)' http://tc-pd-1.tc-pd-peer.csn-simulator-big-cluster-vd62g.svc:2379/pd/api/v1/fail/github.com/tikv/pd/pkg/etcdutil/SlowEtcdKVGet
simulate pd lost leader
curl -X PUT -d 'return("2346857576170797299")' http://tc-pd-1.tc-pd-peer.csn-simulator-big-cluster-vd62g.svc:2379/pd/api/v1/fail/github.com/tikv/pd/server/exitCampaignLeader
Reproduce Result
Grpc request
GetMember
keeps high:TiKV side show
PR Effect
The Grpc
GetMember
call was reduced from 3.2k to 170, which is relative to the TiDB numbers and client requests for triaging checkLeader.For 20 * tidb 3 * PD 50 * TiKV
170 = (50 * 3 / 3 / 3[TiKV side] + 20 * 2 [TiDB side]) * 3[PD Num]
And more tests are necessary to ensure that no further issues arise.
Release note