Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv: invalidate store's regions when send store fail #11344

Merged
merged 6 commits into from
Jul 24, 2019
Merged

tikv: invalidate store's regions when send store fail #11344

merged 6 commits into from
Jul 24, 2019

Conversation

lysu
Copy link
Contributor

@lysu lysu commented Jul 19, 2019

What problem does this PR solve?

When many regions in one tikv store, and those store down, next query need wait "dial error: no route to host" for each region, if following query need region many region (e.g. select count(*) from table) will be every slow.

What is changed and how it works?

make store's all region be invalid by introduce storeFail to store to invalidate store's region and keep region cache's lock-free feeling- -

  • store' storeFail +1 when send store fail
  • each region has its snapshot failEpochs for each region's peers
  • when region's storeFails[i] != store.fail will let query refill region cache

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • Schrodinger WIP

Code changes

  • Impl

Side effects

  • Increased code complexity

Related changes

  • Need to cherry-pick to the release branch

This change is Reviewable

@lysu lysu added type/enhancement The issue or PR belongs to an enhancement. component/tikv status/WIP labels Jul 19, 2019
@lysu
Copy link
Contributor Author

lysu commented Jul 19, 2019

/run-all-tests

@codecov
Copy link

codecov bot commented Jul 19, 2019

Codecov Report

Merging #11344 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #11344   +/-   ##
===========================================
  Coverage   81.7662%   81.7662%           
===========================================
  Files           424        424           
  Lines         92137      92137           
===========================================
  Hits          75337      75337           
  Misses        11490      11490           
  Partials       5310       5310

@@ -368,7 +385,7 @@ func (c *RegionCache) OnSendFail(bo *Backoffer, ctx *RPCContext, scheduleReload
tikvRegionCacheCounterWithSendFail.Inc()
r := c.getCachedRegionWithRLock(ctx.Region)
if r != nil {
c.switchNextPeer(r, ctx.PeerIdx)
c.switchNextPeer(r, ctx.PeerIdx, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can create a new error type ReConnectionFailure, and only invalid regions for only such error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm looking into how go's network packet generate error, after that we can fix this TODO https://github.com/pingcap/tidb/pull/11344/files#diff-708f6242b27e2b7bcf0e905e9b0eacf2R965

store/tikv/region_cache.go Outdated Show resolved Hide resolved
store/tikv/region_cache.go Outdated Show resolved Hide resolved
@lysu lysu marked this pull request as ready for review July 23, 2019 07:00
@lysu lysu removed the status/WIP label Jul 23, 2019
@lysu lysu requested review from tiancaiamao and coocood July 23, 2019 08:42
@lysu
Copy link
Contributor Author

lysu commented Jul 23, 2019

/run-all-tests


if err != nil { // TODO: refine err, only do this for some errors.
s := rs.stores[rs.workStoreIdx]
epoch := rs.storeFails[rs.workStoreIdx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be atomic.load? Or maybe a lock will be better. I'm not sure are there any other threads can access storeFails concurrently.

Copy link
Contributor Author

@lysu lysu Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, but this no need atomic.load, because r.getStore does that, the whole rs need follow copy-on-write idiom

so https://github.com/pingcap/tidb/pull/11344/files#diff-708f6242b27e2b7bcf0e905e9b0eacf2R973 has bug, and we should not +1 for rs.storeFails[rs.workStoreIdx] at here, because after try other peer, region maybe back to current idx, and we need reload region at that time.

@hicqu
Copy link
Contributor

hicqu commented Jul 23, 2019

PTAL @coocood

lysu added 3 commits July 23, 2019 17:11
no need +1 for workIdx, this region need be reload if try other peer fail and back to current idx
@coocood coocood changed the title tikv: invalid store's regions when send store fail tikv: invalidate store's regions when send store fail Jul 23, 2019
@coocood
Copy link
Member

coocood commented Jul 23, 2019

LGTM

@hicqu hicqu merged commit 2b251d1 into pingcap:master Jul 24, 2019
@sre-bot
Copy link
Contributor

sre-bot commented Jul 24, 2019

cherry pick to release-3.0 failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants