Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

table sink stucks after cdc owner switch #9583

Closed
fubinzh opened this issue Aug 15, 2023 · 7 comments
Closed

table sink stucks after cdc owner switch #9583

fubinzh opened this issue Aug 15, 2023 · 7 comments
Labels
area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/moderate type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Aug 15, 2023

What did you do?

  1. create changefeed
  2. run sysbench workload
  3. Inject cdc owner failure twice at 08:44 and 09:14

What did you expect to see?

CDC task should be OK

What did you see instead?

Table sink stucks sometimes, and finally changefeed "changefeed is in unretryable state, please check the error message, and you should manually handle it: table sink stuck"

image

[2023/08/14 08:46:53.019 +00:00] [WARN] [manager.go:984] ["Table checkpoint is stuck too long, will restart the sink backend"] [namespace=default] [changefeed=owner-switch-sync] [span={table_id:319,start_key:7480000000000001ff3f5f720000000000fa,end_key:7480000000000001ff3f5f730000000000fa}] [checkpointTs="{\"Mode\":0,\"Ts\":443548182204121088,\"BatchID\":18446744073709551615}"] [stuckCheck=150] [factoryVersion=1]
[2023/08/14 08:46:53.019 +00:00] [INFO] [manager.go:384] ["Sink manager tries to put an sink error"] [namespace=default] [changefeed=owner-switch-sync] [skipped=false] [error="table sink stuck"]
[2023/08/14 08:46:53.019 +00:00] [WARN] [manager.go:263] ["Sink manager backend sink fails"] [namespace=default] [changefeed=owner-switch-sync] [factoryVersion=1] [error="table sink stuck"] [errorVerbose="table sink stuck\ngithub.com/pingcap/tiflow/cdc/processor/sinkmanager.(*SinkManager).GetTableStats\n\tgithub.com/pingcap/tiflow/cdc/processor/sinkmanager/manager.go:992\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).GetTableSpanStatus\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:333\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpan).getAndUpdateTableSpanState\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:57\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll.func1\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:322\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll.(*BtreeMap[...]).Ascend.func2\n\tgithub.com/pingcap/tiflow/pkg/spanz/btree_map.go:110\ngithub.com/google/btree.(*node[...]).iterate\n\tgithub.com/google/[email protected]/btree_generic.go:522\ngithub.com/google/btree.(*node[...]).iterate\n\tgithub.com/google/[email protected]/btree_generic.go:510\ngithub.com/google/btree.(*BTreeG[...]).Ascend\n\tgithub.com/google/[email protected]/btree_generic.go:779\ngithub.com/pingcap/tiflow/pkg/spanz.(*BtreeMap[...]).Ascend\n\tgithub.com/pingcap/tiflow/pkg/spanz/btree_map.go:109\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*tableSpanManager).poll\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/table.go:315\ngithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent.(*agent).Tick\n\tgithub.com/pingcap/tiflow/cdc/scheduler/internal/v3/agent/agent.go:214\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).tick\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:586\ngithub.com/pingcap/tiflow/cdc/processor.(*processor).Tick\n\tgithub.com/pingcap/tiflow/cdc/processor/processor.go:474\ngithub.com/pingcap/tiflow/cdc/processor.(*managerImpl).Tick\n\tgithub.com/pingcap/tiflow/cdc/processor/manager.go:141\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:290\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:577\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:388\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1650"]

...

[2023/08/14 09:26:06.244 +00:00] [ERROR] [feed_state_manager.go:434] ["processor reports an error"] [namespace=default] [changefeed=owner-switch-sync] [captureID=f07dfb1f-f029-4a38-811a-23b898717b10] [error="{\"time\":\"2023-08-14T09:26:06.225683566Z\",\"addr\":\"upstream-ticdc-2.upstream-ticdc-peer.cdc-testbed-tps-1891603-1-66.svc:8301\",\"code\":\"CDC:ErrChangefeedUnretryable\",\"message\":\"[CDC:ErrChangefeedUnretryable]changefeed is in unretryable state, please check the error message, and you should manually handle it: table sink stuck\"}"]

Versions of the cluster

CDC version

["Welcome to Change Data Capture (CDC)"] [release-version=v7.4.0-alpha] [git-hash=dd8da8e725052e3abb8be12513f74ebb5e479a22] [git-branch=heads/refs/tags/v7.4.0-alpha] [utc-build-time="2023-08-12 11:36:23"] [go-version="go version go1.21.0 linux/amd64"] [failpoint-build=false]
@fubinzh fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Aug 15, 2023
@fubinzh
Copy link
Author

fubinzh commented Aug 15, 2023

cdc log:
ticdc-1_1891603.log

@hicqu
Copy link
Contributor

hicqu commented Aug 16, 2023

[2023/08/14 08:46:53.019 +00:00] [WARN] [manager.go:984] ["Table checkpoint is stuck too long, will restart the sink backend"] [namespace=default] [changefeed=owner-switch-sync] [span={table_id:319,start_key:7480000000000001ff3f5f720000000000fa,end_key:7480000000000001ff3f5f730000000000fa}] [checkpointTs="{\"Mode\":0,\"Ts\":443548182204121088,\"BatchID\":18446744073709551615}"] [stuckCheck=150] [factoryVersion=1]

In the log the tso can be parsed to 2023-08-14 16:33:57.827. So I guess the root cause is sink is really slow. It can cause DML sink destroy and re-create, which makes sink more slow and the case becomes worse.

@fubinzh
Copy link
Author

fubinzh commented Aug 17, 2023

This issue seen twice during daily ci, and not reproduced after that.
Is there any workaround if it happens? @hicqu

@fubinzh
Copy link
Author

fubinzh commented Aug 17, 2023

/found automation
/severity moderate

@ti-chi-bot ti-chi-bot bot added found/automation Bugs found by automation cases severity/moderate labels Aug 17, 2023
@nongfushanquan
Copy link
Contributor

fixed by #9747

@nongfushanquan
Copy link
Contributor

/close

@ti-chi-bot ti-chi-bot bot closed this as completed Sep 21, 2023
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Sep 21, 2023

@nongfushanquan: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. found/automation Bugs found by automation cases severity/moderate type/bug The issue is confirmed as a bug.
Projects
Development

No branches or pull requests

3 participants