You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TiCDC as a distributed system, it should continuously provide service with predictable replication lag under any reasonable situations like single TiCDC node failure, single upstream TiKV node failure, single upstream PD node failure, planned rolling upgrade/restart of TiCDC or upstream TiDB cluster, temporarily network partition between one CDC node and other CDC nodes, etc. TiCDC should recover the replication lag by itself quickly and tolerant different resilience cases.
Expected replication lag SLO under different cases
Category
Case Description
Expected Behavior
Planned Operations
Rolling upgrade/restart TiCDC
replication lag < 5s
Scale-in/scale-out TiCDC
replication lag < 5s
Rolling upgrade/restart upstream PD
replication lag < 5s
Rolling upgrade/restart upstream TiKV
replication lag < 10s
Scale-in/scale-out upstream TiDB
replication lag < 5s
Rolling upgrade/restart downstream Kafka brokers
begin to sink ASAP kafka resumed
Rolling upgrade/restart downstream MySQL/TiDB
begin to sink ASAP kafka resumed
Unplanned Failures
Single TiCDC node (random one) permanent failure
replication lag < 1min
Single TiCDC node temporarily failure for 5 minutes
replication lag < 1min
PD leader permanent failure or temporarily failure for 5 minutes
replication lag < 5s
Network partition between one TiCDC node and PD leader for 5 minutes
replication lag < 5s
Network partition between one TiCDC node and other TiCDC nodes
replication lag < 5s
Principle of prioritizing TiCDC stability issues
We deal with TiCDC stability issues as following priorities
If this issue is related to data correctness or data completeness, it should be top priority(P0), we must fix them ASAP and cherry-pick them to other LTS versions.
When this issue happens, the replication/changefeed can be stuck/failed and can't recover by TiCDC itself. We should treat this category as P1 priority. Because if such an issue happens, it means we/users should handle it manually.
When this issue happens, the replication lag will increase unexpectedly a lot. This type of issue is not that critical as P1 issues, but will breach the SLO of TiCDC, we treat them as P2 issues.
Enhancements like reducing resource usage, ... we treat them as P3 priority.
How to define stability of TiCDC?
TiCDC as a distributed system, it should continuously provide service with predictable replication lag under any reasonable situations like single TiCDC node failure, single upstream TiKV node failure, single upstream PD node failure, planned rolling upgrade/restart of TiCDC or upstream TiDB cluster, temporarily network partition between one CDC node and other CDC nodes, etc. TiCDC should recover the replication lag by itself quickly and tolerant different resilience cases.
Expected replication lag SLO under different cases
Principle of prioritizing TiCDC stability issues
We deal with TiCDC stability issues as following priorities
Tasks Tracking
TIMESTAMP
value #10393 @zhangjinpeng87RowChangedEvent
#10386 (potential OOM issue) @lidezhuThe text was updated successfully, but these errors were encountered: