Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: add backoff for member loop #6995

Merged
merged 7 commits into from
Aug 29, 2023
Merged

Conversation

HuSharp
Copy link
Member

@HuSharp HuSharp commented Aug 28, 2023

What problem does this PR solve?

Issue Number: Ref #6949 #6556

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

PR Summary

1. Add ready for resp
- Have goroutine reconnectMemberLoop call updateMember periodically. When calling ScheduleCheckMemberChanged channel, we need to wait for the goroutine to update members until ready or timeout.
2. Add backoff mechanism

  • When waiting for the goroutine to update, the expo function can be used to backoff to sleep when an error is encountered.

Reproduce Step

  1. enable fail point, like gRPC is throttling, cannot read from etcd.
    curl -X PUT -d 'return(10)' http://tc-pd-1.tc-pd-peer.csn-simulator-big-cluster-vd62g.svc:2379/pd/api/v1/fail/github.com/tikv/pd/pkg/etcdutil/SlowEtcdKVGet

  2. simulate pd lost leader
    curl -X PUT -d 'return("2346857576170797299")' http://tc-pd-1.tc-pd-peer.csn-simulator-big-cluster-vd62g.svc:2379/pd/api/v1/fail/github.com/tikv/pd/server/exitCampaignLeader

Reproduce Result

Grpc request GetMember keeps high:
image

TiKV side show

image

PR Effect

The Grpc GetMember call was reduced from 3.2k to 170, which is relative to the TiDB numbers and client requests for triaging checkLeader.

For 20 * tidb 3 * PD 50 * TiKV
170 = (50 * 3 / 3 / 3[TiKV side] + 20 * 2 [TiDB side]) * 3[PD Num]

And more tests are necessary to ensure that no further issues arise.

image

Release note

None.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 28, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • lhy1024
  • nolouch

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 28, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Aug 28, 2023
@ti-chi-bot ti-chi-bot bot requested review from lhy1024 and rleungx August 28, 2023 10:59
@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 28, 2023
Signed-off-by: husharp <[email protected]>
Signed-off-by: husharp <[email protected]>
@HuSharp HuSharp marked this pull request as ready for review August 29, 2023 03:04
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2023
client/pd_service_discovery.go Outdated Show resolved Hide resolved
client/retry/backoff.go Outdated Show resolved Hide resolved
}

// ExponentialBackoff Get the exponential backoff duration.
func (rs *BackOffer) exponentialBackoff() time.Duration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we use exponential backoff by default? When network partition a long time, it may need more time to recover.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limited by max backoff time...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long is it?

Signed-off-by: husharp <[email protected]>
client/retry/backoff.go Outdated Show resolved Hide resolved
Signed-off-by: husharp <[email protected]>
@ti-chi-bot ti-chi-bot bot added the status/LGT1 Indicates that a PR has LGTM 1. label Aug 29, 2023
@@ -239,17 +241,19 @@ func (c *pdServiceDiscovery) updateMemberLoop() {
ticker := time.NewTicker(memberUpdateInterval)
defer ticker.Stop()

bo := retry.InitialBackOffer(updateMemberBackOffBaseTime, updateMemberTimeout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhy1024 for member loop max backup time time is 1 second which is updateMemberTimeout, PTAL, thx!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Merging #6995 (2af0f29) into master (50368e5) will increase coverage by 0.01%.
Report is 7 commits behind head on master.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6995      +/-   ##
==========================================
+ Coverage   74.18%   74.19%   +0.01%     
==========================================
  Files         433      433              
  Lines       46133    46097      -36     
==========================================
- Hits        34225    34203      -22     
+ Misses       8887     8879       -8     
+ Partials     3021     3015       -6     
Flag Coverage Δ
unittests 74.19% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Signed-off-by: husharp <[email protected]>
Copy link
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! good job.

@ti-chi-bot ti-chi-bot bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Aug 29, 2023
@nolouch
Copy link
Contributor

nolouch commented Aug 29, 2023

/merge

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 29, 2023

@nolouch: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Aug 29, 2023

This pull request has been accepted and is ready to merge.

Commit hash: 2af0f29

@ti-chi-bot ti-chi-bot bot added the status/can-merge Indicates a PR has been approved by a committer. label Aug 29, 2023
@ti-chi-bot ti-chi-bot bot merged commit 9a574ed into tikv:master Aug 29, 2023
19 checks passed
@HuSharp HuSharp deleted the add_back_off branch August 29, 2023 13:28
@nolouch nolouch mentioned this pull request Aug 30, 2023
7 tasks
@nolouch
Copy link
Contributor

nolouch commented Aug 31, 2023

/run-cherry-picker

@nolouch nolouch added the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Aug 31, 2023
@nolouch
Copy link
Contributor

nolouch commented Aug 31, 2023

/run-cherry-picker

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #7020.

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this pull request Aug 31, 2023
ti-chi-bot bot pushed a commit that referenced this pull request Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-7.1 needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants