tikv: invalidate store's regions when send store fail #11344

lysu · 2019-07-19T14:56:56Z

What problem does this PR solve?

When many regions in one tikv store, and those store down, next query need wait "dial error: no route to host" for each region, if following query need region many region (e.g. select count(*) from table) will be every slow.

What is changed and how it works?

make store's all region be invalid by introduce storeFail to store to invalidate store's region and keep region cache's lock-free feeling- -

store' storeFail +1 when send store fail
each region has its snapshot failEpochs for each region's peers
when region's storeFails[i] != store.fail will let query refill region cache

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
Schrodinger WIP

Code changes

Impl

Side effects

Increased code complexity

Related changes

Need to cherry-pick to the release branch

This change is

lysu · 2019-07-19T14:57:13Z

/run-all-tests

codecov · 2019-07-19T15:01:44Z

Codecov Report

Merging #11344 into master will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master     #11344   +/-   ##
===========================================
  Coverage   81.7662%   81.7662%           
===========================================
  Files           424        424           
  Lines         92137      92137           
===========================================
  Hits          75337      75337           
  Misses        11490      11490           
  Partials       5310       5310

store/tikv/region_cache.go

hicqu · 2019-07-22T03:55:51Z

store/tikv/region_cache.go

@@ -368,7 +385,7 @@ func (c *RegionCache) OnSendFail(bo *Backoffer, ctx *RPCContext, scheduleReload
 	tikvRegionCacheCounterWithSendFail.Inc()
 	r := c.getCachedRegionWithRLock(ctx.Region)
 	if r != nil {
-		c.switchNextPeer(r, ctx.PeerIdx)
+		c.switchNextPeer(r, ctx.PeerIdx, err)


I think we can create a new error type ReConnectionFailure, and only invalid regions for only such error.

yes, I'm looking into how go's network packet generate error, after that we can fix this TODO https://github.com/pingcap/tidb/pull/11344/files#diff-708f6242b27e2b7bcf0e905e9b0eacf2R965

store/tikv/region_cache.go

lysu · 2019-07-23T08:45:24Z

/run-all-tests

hicqu · 2019-07-23T08:53:30Z

store/tikv/region_cache.go

+
+	if err != nil { // TODO: refine err, only do this for some errors.
+		s := rs.stores[rs.workStoreIdx]
+		epoch := rs.storeFails[rs.workStoreIdx]


It should be atomic.load? Or maybe a lock will be better. I'm not sure are there any other threads can access storeFails concurrently.

good catch, but this no need atomic.load, because r.getStore does that, the whole rs need follow copy-on-write idiom

so https://github.com/pingcap/tidb/pull/11344/files#diff-708f6242b27e2b7bcf0e905e9b0eacf2R973 has bug, and we should not +1 for rs.storeFails[rs.workStoreIdx] at here, because after try other peer, region maybe back to current idx, and we need reload region at that time.

hicqu · 2019-07-23T08:55:17Z

PTAL @coocood

no need +1 for workIdx, this region need be reload if try other peer fail and back to current idx

coocood · 2019-07-23T11:20:47Z

LGTM

sre-bot · 2019-07-24T05:23:10Z

cherry pick to release-3.0 failed

lysu added type/enhancement The issue or PR belongs to an enhancement. component/tikv status/WIP labels Jul 19, 2019

jackysp added the status/all tests passed label Jul 21, 2019

overvenus reviewed Jul 21, 2019

View reviewed changes

store/tikv/region_cache.go Outdated Show resolved Hide resolved

hicqu reviewed Jul 22, 2019

View reviewed changes

store/tikv/region_cache.go Outdated Show resolved Hide resolved

hicqu reviewed Jul 22, 2019

View reviewed changes

store/tikv/region_cache.go Outdated Show resolved Hide resolved

lysu marked this pull request as ready for review July 23, 2019 07:00

lysu removed the status/WIP label Jul 23, 2019

lysu requested review from tiancaiamao and coocood July 23, 2019 08:42

lysu added the needs-cherry-pick-3.0 label Jul 23, 2019

hicqu reviewed Jul 23, 2019

View reviewed changes

lysu added 3 commits July 23, 2019 17:11

tikv: invalid store's regions when send store fail

02dca27

address comment

350a1f0

no need +1 for workIdx, this region need be reload if try other peer fail and back to current idx

fix CI

94f8283

coocood changed the title ~~tikv: invalid store's regions when send store fail~~ tikv: invalidate store's regions when send store fail Jul 23, 2019

lysu and others added 2 commits July 23, 2019 20:11

address comment

bd6385e

Merge branch 'master' into dev-store-fail-next

2887079

hicqu approved these changes Jul 24, 2019

View reviewed changes

Merge branch 'master' into dev-store-fail-next

101483d

hicqu merged commit 2b251d1 into pingcap:master Jul 24, 2019

lysu mentioned this pull request Jul 29, 2019

tikv: invalidate store's regions when send store fail (#11344) #11498

Merged

jackysp pushed a commit that referenced this pull request Jul 30, 2019

tikv: invalidate store's regions when send store fail (#11344) (#11498)

aeeeb15

lysu mentioned this pull request Apr 1, 2020

tikv: refine region cache error handle #15989

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tikv: invalidate store's regions when send store fail #11344

tikv: invalidate store's regions when send store fail #11344

lysu commented Jul 19, 2019 •

edited

Loading

lysu commented Jul 19, 2019

codecov bot commented Jul 19, 2019 •

edited

Loading

hicqu Jul 22, 2019

lysu Jul 22, 2019

lysu commented Jul 23, 2019

hicqu Jul 23, 2019

lysu Jul 23, 2019 •

edited

Loading

hicqu commented Jul 23, 2019

coocood commented Jul 23, 2019

sre-bot commented Jul 24, 2019

tikv: invalidate store's regions when send store fail #11344

tikv: invalidate store's regions when send store fail #11344

Conversation

lysu commented Jul 19, 2019 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

lysu commented Jul 19, 2019

codecov bot commented Jul 19, 2019 • edited Loading

Codecov Report

hicqu Jul 22, 2019

Choose a reason for hiding this comment

lysu Jul 22, 2019

Choose a reason for hiding this comment

lysu commented Jul 23, 2019

hicqu Jul 23, 2019

Choose a reason for hiding this comment

lysu Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

hicqu commented Jul 23, 2019

coocood commented Jul 23, 2019

sre-bot commented Jul 24, 2019

lysu commented Jul 19, 2019 •

edited

Loading

codecov bot commented Jul 19, 2019 •

edited

Loading

lysu Jul 23, 2019 •

edited

Loading