Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScanRegions total retry time is too short #5230

Closed
overvenus opened this issue Apr 21, 2022 · 3 comments · Fixed by #5269 or #5499
Closed

ScanRegions total retry time is too short #5230

overvenus opened this issue Apr 21, 2022 · 3 comments · Fixed by #5269 or #5499
Assignees
Labels

Comments

@overvenus
Copy link
Member

overvenus commented Apr 21, 2022

What did you do?

Run tpcc prepare on 4 node TiKV cluster. Create a changefeed to replication upstream write.

What did you expect to see?

No replication interruption.

What did you see instead?

Replication is interrupted due to following errors, which is caused by tikv/tikv#12410

808147:[2022/04/18 14:07:40.868 +08:00] [WARN] [client.go:1020] [ScanRegions] [span="[7480000000000018ff435f7280000004aeff9a4e430000000000fa, 7480000000000018ff435f730000000000fa)"] [regions="[{\"id\":33977192,\"start_key\":\"dIAAAAAAABj/Q19ygAAABK7/nHi9AAAAAAD6\",\"end_key\":\"dIAAAAAAABj/RQAAAAAAAAD4\",\"region_epoch\":{\"conf_ver\":40838,\"version\":22856},\"peers\":[{\"id\":34144322,\"store_id\":11},{\"id\":34144658,\"store_id\":1},{\"id\":34144780,\"store_id\":9}]}]"] [error="[CDC:ErrRegionsNotCoverSpan]regions not completely left cover span, span [7480000000000018ff435f7280000004aeff9a4e430000000000fa, 7480000000000018ff435f730000000000fa) regions: [id:33977192 start_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377C_r\\200\\000\\000\\004\\256\\377\\234x\\275\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377E\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:40838 version:22856 > peers:<id:34144322 store_id:11 > peers:<id:34144658 store_id:1 > peers:<id:34144780 store_id:9 > ]"] [errorVerbose="[CDC:ErrRegionsNotCoverSpan]regions not completely left cover span, span [7480000000000018ff435f7280000004aeff9a4e430000000000fa, 7480000000000018ff435f730000000000fa) regions: [id:33977192 start_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377C_r\\200\\000\\000\\004\\256\\377\\234x\\275\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377E\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:40838 version:22856 > peers:<id:34144322 store_id:11 > peers:<id:34144658 store_id:1 > peers:<id:34144780 store_id:9 > ]\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).divideAndSendEventFeedToRegions.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:1019\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).divideAndSendEventFeedToRegions\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:1001\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func3.1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:527\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
...
808238:[2022/04/18 14:07:44.956 +08:00] [WARN] [client.go:1020] [ScanRegions] [span="[7480000000000018ff435f7280000004aeff9a4e430000000000fa, 7480000000000018ff435f730000000000fa)"] [regions="[{\"id\":33977192,\"start_key\":\"dIAAAAAAABj/Q19ygAAABK7/nHi9AAAAAAD6\",\"end_key\":\"dIAAAAAAABj/RQAAAAAAAAD4\",\"region_epoch\":{\"conf_ver\":40838,\"version\":22856},\"peers\":[{\"id\":34144322,\"store_id\":11},{\"id\":34144658,\"store_id\":1},{\"id\":34144780,\"store_id\":9}]}]"] [error="[CDC:ErrRegionsNotCoverSpan]regions not completely left cover span, span [7480000000000018ff435f7280000004aeff9a4e430000000000fa, 7480000000000018ff435f730000000000fa) regions: [id:33977192 start_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377C_r\\200\\000\\000\\004\\256\\377\\234x\\275\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377E\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:40838 version:22856 > peers:<id:34144322 store_id:11 > peers:<id:34144658 store_id:1 > peers:<id:34144780 store_id:9 > ]"] [errorVerbose="[CDC:ErrRegionsNotCoverSpan]regions not completely left cover span, span [7480000000000018ff435f7280000004aeff9a4e430000000000fa, 7480000000000018ff435f730000000000fa) regions: [id:33977192 start_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377C_r\\200\\000\\000\\004\\256\\377\\234x\\275\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\030\\377E\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:40838 version:22856 > peers:<id:34144322 store_id:11 > peers:<id:34144658 store_id:1 > peers:<id:34144780 store_id:9 > ]\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:156\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).divideAndSendEventFeedToRegions.func1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:1019\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).divideAndSendEventFeedToRegions\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:1001\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func3.1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:527\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
808298:[2022/04/18 14:07:48.987 +08:00] [ERROR] [processor.go:1457] ["error on running processor"] [capture=10.19.85.9:8300] [changefeed=v20210714] [processor=339c9d4e-41ec-46f7-bd53-cc70912d68f8] [error="[CDC:ErrReachMaxTry]reach maximum try: 100"] [errorVerbose="[CDC:ErrReachMaxTry]reach maximum try: 100\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:156\ngithu
b.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:65\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).divideAndSendEventFeedToRegions\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:1001\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func3.1\n\tgithub.com/pingcap/ticdc@/cdc/kv/client.go:527\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]

tiflow/cdc/kv/client.go

Lines 930 to 967 in c6966a4

retryErr := retry.Do(ctx, func() error {
scanT0 := time.Now()
bo := tikv.NewBackoffer(ctx, tikvRequestMaxBackoff)
regions, err = s.client.regionCache.BatchLoadRegionsWithKeyRange(bo, nextSpan.Start, nextSpan.End, limit)
scanRegionsDuration.Observe(time.Since(scanT0).Seconds())
if err != nil {
return cerror.WrapError(cerror.ErrPDBatchLoadRegions, err)
}
metas := make([]*metapb.Region, 0, len(regions))
for _, region := range regions {
if region.GetMeta() == nil {
err = cerror.ErrMetaNotInRegion.GenWithStackByArgs()
log.Warn("batch load region",
zap.Stringer("span", nextSpan), zap.Error(err),
zap.String("changefeed", s.client.changefeed),
)
return err
}
metas = append(metas, region.GetMeta())
}
if !regionspan.CheckRegionsLeftCover(metas, nextSpan) {
err = cerror.ErrRegionsNotCoverSpan.GenWithStackByArgs(nextSpan, metas)
log.Warn("ScanRegions",
zap.Stringer("span", nextSpan),
zap.Reflect("regions", metas), zap.Error(err),
zap.String("changefeed", s.client.changefeed),
)
return err
}
log.Debug("ScanRegions",
zap.Stringer("span", nextSpan),
zap.Reflect("regions", metas),
zap.String("changefeed", s.client.changefeed))
return nil
}, retry.WithBackoffMaxDelay(50), retry.WithMaxTries(100), retry.WithIsRetryableErr(cerror.IsRetryableError))
if retryErr != nil {
return retryErr
}

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

4.0.15

Upstream TiKV version (execute tikv-server --version):

4.0.15

TiCDC version (execute cdc version):

4.0.15
@overvenus
Copy link
Member Author

Find a new case reproduce the issue. That leader election takes about 20s, it is still possible for ScanRegion timeout given the 25s total retry time.

│[2022/05/20 12:48:48.329 +08:00] [INFO] [pd.rs:814] ["try to batch split region"] [task=batch_split] [region="id: 101200 start_key: 7480000000000000FF3F5F72E000000002FF1EBB370000000000FA end_key: 7480000000000000FF3F5F72E800000000FF0009710000000000FA region_epoch { conf│
│[2022/05/20 12:48:49.958 +08:00] [INFO] [peer.rs:2713] ["insert new region"] [region="id: 307900 start_key: 7480000000000000FF3F5F72E000000002FF1EBB370000000000FA end_key: 7480000000000000FF3F5F72E000000002FF1EF65A0000000000FA region_epoch { conf_ver: 5 version: 1762 } │
│[2022/05/20 12:48:50.232 +08:00] [INFO] [peer.rs:2713] ["insert new region"] [region="id: 307900 start_key: 7480000000000000FF3F5F72E000000002FF1EBB370000000000FA end_key: 7480000000000000FF3F5F72E000000002FF1EF65A0000000000FA region_epoch { conf_ver: 5 version: 1762 } │
│[2022/05/20 12:48:50.539 +08:00] [INFO] [peer.rs:2713] ["insert new region"] [region="id: 307900 start_key: 7480000000000000FF3F5F72E000000002FF1EBB370000000000FA end_key: 7480000000000000FF3F5F72E000000002FF1EF65A0000000000FA region_epoch { conf_ver: 5 version: 1762 } │
│[2022/05/20 12:49:10.331 +08:00] [INFO] [raft.rs:1200] ["became leader at term 6"] [term=6] [raft_id=307902] [region_id=307900]                                                                                                                                               │

overvenus added a commit to overvenus/ticdc that referenced this issue Jun 20, 2022
overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Jun 20, 2022
overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Jun 20, 2022
@nongfushanquan
Copy link
Contributor

/label affects-5.4

overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Jun 21, 2022
overvenus added a commit to overvenus/ticdc that referenced this issue Jun 21, 2022
overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Jun 21, 2022
overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Jun 23, 2022
@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants