kvclient(ticdc): fix kvclient takes too long time to recover (#3612) #3660

ti-chi-bot · 2021-11-29T10:25:07Z

This is an automated cherry-pick of #3612

close #3191
close flaky test in kvclient: #2694 #3302 #2349 #2688 #2747

What problem does this PR solve?

When a tikv node which has 10k region leaders fail, we found it need more than 30min to recover.
We think it is abnormal for the reason that pd and tikv only need about 30s to tag a node 'disconnect' and elect a new leader.

What is changed and how it works?

Decrease the retry num when a new stream fail to establish to make other region try ASAP.
Remove PartialClone when the region fail.

Reason

Expectation
We use a tikv node which has about 3k region leaders as our comparison object. After test in normal case and abnormal case, we got follow results:

(1) ~ 1min for 3k region to finish init scan and get event (normal case)
(2) > 20min for all 3k region to recover the stream (tikv node fail)

But, pd and tikv only need about 30s to tag a node 'disconnect' and elect a new leader, so a reasonable timespan to recover is about:

       1min30s(normal case + region failover) ~ 5min (retry control in cdc)

Consider the probability of network jitter >> tikv temporary down >> tikv permanent down, we still need some retry logic, so

Decrease the retry num when a new stream fail to establish  to make other region try ASAP.

When a region fail and call RegionCache.OnRegionFail, it will (1) mark the store as 'needcheck' and fetch new store state of the store asynchronous, (2) move its leader to other peer to try next time and fetch new leader back. This mechanism is suitable in our case, but we find all failed region still connect to the old store which is quite strange. So we

Remove wrong PartialClone logic when the region fail in region_worker.

Result

3min30s to recover from 3k fail (after this pr)  compared to >20min (before)

Check List

Tests

Manual test (add detailed scripts or steps below)

Related changes

Need to cherry-pick to the release branch
Need to update the documentation
Need to update key monitor metrics in both TiCDC document and official document

Release note

Please add a release note.
fix kvclient takes too long time to recover

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot · 2021-11-29T10:25:08Z

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

ti-chi-bot · 2021-11-29T10:25:09Z

@ti-chi-bot: This cherry pick PR is for a release branch and has not yet been approved by release team.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick, it must first be approved by the collaborators.

AFTER it has been approved by collaborators, please ping the release team in a comment to request a cherry pick review.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

maxshuang · 2021-12-15T09:37:39Z

/invite

maxshuang · 2021-12-15T09:54:10Z

/run-kafka-integration-test
/run-integration-test

ti-chi-bot · 2022-01-20T02:02:00Z

@ti-chi-bot: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This is an automated cherry-pick of pingcap#3612

587d670

Signed-off-by: ti-chi-bot <[email protected]>

ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/cherry-pick-not-approved labels Nov 29, 2021

ti-chi-bot mentioned this pull request Nov 29, 2021

kvclient(ticdc): fix kvclient takes too long time to recover #3612

Merged

ti-chi-bot added component/kv-client TiKV kv log client component. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/LGT2 Indicates that a PR has LGTM 2. type/cherry-pick-for-release-5.0 This PR is cherry-picked to release-5.0 from a source PR. labels Nov 29, 2021

ti-chi-bot assigned maxshuang Nov 29, 2021

fix(kvclient(ticdc)): fix cp conflict

71f3073

maxshuang added 2 commits December 15, 2021 17:50

Merge branch 'release-5.0' into cherry-pick-3612-to-release-5.0

18cdca0

Merge branch 'release-5.0' into cherry-pick-3612-to-release-5.0

888aeea

overvenus added this to the v5.0.7 milestone Jan 11, 2022

ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 20, 2022

VelocityLight closed this Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvclient(ticdc): fix kvclient takes too long time to recover (#3612) #3660

kvclient(ticdc): fix kvclient takes too long time to recover (#3612) #3660

ti-chi-bot commented Nov 29, 2021

ti-chi-bot commented Nov 29, 2021

ti-chi-bot commented Nov 29, 2021

maxshuang commented Dec 15, 2021

maxshuang commented Dec 15, 2021

ti-chi-bot commented Jan 20, 2022

kvclient(ticdc): fix kvclient takes too long time to recover (#3612) #3660

kvclient(ticdc): fix kvclient takes too long time to recover (#3612) #3660

Conversation

ti-chi-bot commented Nov 29, 2021

What problem does this PR solve?

What is changed and how it works?

Reason

Result

Check List

Release note

ti-chi-bot commented Nov 29, 2021

ti-chi-bot commented Nov 29, 2021

maxshuang commented Dec 15, 2021

maxshuang commented Dec 15, 2021

ti-chi-bot commented Jan 20, 2022