Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv client takes too long time to recover when a TiKV node shutdowns #3191

Closed
amyangfei opened this issue Oct 29, 2021 · 0 comments · Fixed by #3612
Closed

kv client takes too long time to recover when a TiKV node shutdowns #3191

amyangfei opened this issue Oct 29, 2021 · 0 comments · Fixed by #3612
Assignees
Labels
affects-4.0 affects-5.0 area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.

Comments

@amyangfei
Copy link
Contributor

amyangfei commented Oct 29, 2021

What did you do?

  1. Setup a tidb cluster with 5 or more TiKV nodes, and a ticdc replication task, the replicated table should have a large region count, such as 100k or more.
  2. Shutdown one of the TiKV node, don't evict leader region before shutdown.
  3. Observe how long it takes that the changefeed comes to normal.

What did you expect to see?

No response

What did you see instead?

image

The kv client takes 23minutes to reconnect all leader regions that belong to the shutdown TiKV, it wastes too much time to retry to establish gPRC connection with the shutdown TiKV store.

[2021/10/29 15:26:37.361 +08:00] [INFO] [client.go:377] ["establish stream to store failed, retry later"] [addr=172.16.7.55:20161] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.16.7.55:20161: connect: connection refused\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.16.7.55:20161: connect: connection refused\""] [errorVerbose="[CDC:ErrTiKVEventFeed]rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.16.7.55:20161: connect: connection refused\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing 
dial tcp 172.16.7.55:20161: connect: connection refused\"\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/[email protected]/normalize.go:302\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream.func1\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:376\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:347\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).requestRegionToStore\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:735\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func2\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:519\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1371"]

The kv client should take the available of TiKV store into consideration, which is quite similar to the active store mechanism in region cache.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v5.2.1

TiCDC version (execute cdc version):

master@pingcap/ticdc@0a8390a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 affects-5.0 area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants