stability (cdc) improve the stability of TiCDC #10343

zhangjinpeng87 · 2023-12-21T18:57:09Z

How to define stability of TiCDC?

TiCDC as a distributed system, it should continuously provide service with predictable replication lag under any reasonable situations like single TiCDC node failure, single upstream TiKV node failure, single upstream PD node failure, planned rolling upgrade/restart of TiCDC or upstream TiDB cluster, temporarily network partition between one CDC node and other CDC nodes, etc. TiCDC should recover the replication lag by itself quickly and tolerant different resilience cases.

Expected replication lag SLO under different cases

Category	Case Description	Expected Behavior
Planned Operations	Rolling upgrade/restart TiCDC	replication lag < 5s
	Scale-in/scale-out TiCDC	replication lag < 5s
	Rolling upgrade/restart upstream PD	replication lag < 5s
	Rolling upgrade/restart upstream TiKV	replication lag < 10s
	Scale-in/scale-out upstream TiDB	replication lag < 5s
	Rolling upgrade/restart downstream Kafka brokers	begin to sink ASAP kafka resumed
	Rolling upgrade/restart downstream MySQL/TiDB	begin to sink ASAP kafka resumed
Unplanned Failures	Single TiCDC node (random one) permanent failure	replication lag < 1min
	Single TiCDC node temporarily failure for 5 minutes	replication lag < 1min
	PD leader permanent failure or temporarily failure for 5 minutes	replication lag < 5s
	Network partition between one TiCDC node and PD leader for 5 minutes	replication lag < 5s
	Network partition between one TiCDC node and other TiCDC nodes	replication lag < 5s

Principle of prioritizing TiCDC stability issues

We deal with TiCDC stability issues as following priorities

If this issue is related to data correctness or data completeness, it should be top priority(P0), we must fix them ASAP and cherry-pick them to other LTS versions.
When this issue happens, the replication/changefeed can be stuck/failed and can't recover by TiCDC itself. We should treat this category as P1 priority. Because if such an issue happens, it means we/users should handle it manually.
When this issue happens, the replication lag will increase unexpectedly a lot. This type of issue is not that critical as P1 issues, but will breach the SLO of TiCDC, we treat them as P2 issues.
Enhancements like reducing resource usage, ... we treat them as P3 priority.

Tasks Tracking

The text was updated successfully, but these errors were encountered:

flowbehappy · 2023-12-25T02:19:15Z

#10157 @zhangjinpeng1987 resolve ts gets stuck issue

zhangjinpeng87 added the task-tracking label Dec 21, 2023

zhangjinpeng87 self-assigned this Dec 21, 2023

zhangjinpeng87 changed the title ~~Stability (cdc) Improve the stability of TiCDC~~ stability (cdc) improve the stability of TiCDC Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability (cdc) improve the stability of TiCDC #10343

stability (cdc) improve the stability of TiCDC #10343

zhangjinpeng87 commented Dec 21, 2023 •

edited by flowbehappy

Loading

flowbehappy commented Dec 25, 2023

stability (cdc) improve the stability of TiCDC #10343

stability (cdc) improve the stability of TiCDC #10343

Comments

zhangjinpeng87 commented Dec 21, 2023 • edited by flowbehappy Loading

How to define stability of TiCDC?

Expected replication lag SLO under different cases

Principle of prioritizing TiCDC stability issues

Tasks Tracking

flowbehappy commented Dec 25, 2023

zhangjinpeng87 commented Dec 21, 2023 •

edited by flowbehappy

Loading