-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tikv: invalidate store's regions when send store fail #11344
Conversation
/run-all-tests |
Codecov Report
@@ Coverage Diff @@
## master #11344 +/- ##
===========================================
Coverage 81.7662% 81.7662%
===========================================
Files 424 424
Lines 92137 92137
===========================================
Hits 75337 75337
Misses 11490 11490
Partials 5310 5310 |
@@ -368,7 +385,7 @@ func (c *RegionCache) OnSendFail(bo *Backoffer, ctx *RPCContext, scheduleReload | |||
tikvRegionCacheCounterWithSendFail.Inc() | |||
r := c.getCachedRegionWithRLock(ctx.Region) | |||
if r != nil { | |||
c.switchNextPeer(r, ctx.PeerIdx) | |||
c.switchNextPeer(r, ctx.PeerIdx, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can create a new error type ReConnectionFailure
, and only invalid regions for only such error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm looking into how go's network packet generate error, after that we can fix this TODO https://github.com/pingcap/tidb/pull/11344/files#diff-708f6242b27e2b7bcf0e905e9b0eacf2R965
/run-all-tests |
|
||
if err != nil { // TODO: refine err, only do this for some errors. | ||
s := rs.stores[rs.workStoreIdx] | ||
epoch := rs.storeFails[rs.workStoreIdx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be atomic.load
? Or maybe a lock will be better. I'm not sure are there any other threads can access storeFails
concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, but this no need atomic.load
, because r.getStore
does that, the whole rs
need follow copy-on-write idiom
so https://github.com/pingcap/tidb/pull/11344/files#diff-708f6242b27e2b7bcf0e905e9b0eacf2R973 has bug, and we should not +1 for rs.storeFails[rs.workStoreIdx]
at here, because after try other peer, region maybe back to current idx, and we need reload region at that time.
PTAL @coocood |
no need +1 for workIdx, this region need be reload if try other peer fail and back to current idx
LGTM |
cherry pick to release-3.0 failed |
What problem does this PR solve?
When many regions in one tikv store, and those store down, next query need wait "dial error: no route to host" for each region, if following query need region many region (e.g.
select count(*) from table
) will be every slow.What is changed and how it works?
make store's all region be invalid by introduce
storeFail
to store to invalidate store's region and keep region cache's lock-free feeling- -snapshot failEpochs
for each region's peersCheck List
Tests
Code changes
Side effects
Related changes
This change is