Expaination of DevOps metric "repl_connect_status" #2689

cheniujh · 2024-05-31T09:27:11Z

cheniujh
May 31, 2024
Collaborator

自 PR #2638 合并之后，若主从超时后Slave 重新与 Master 建连，可能会有更长的时间处于 "connecting" 状态，此时虽然按预期主从能建联成功，但master_link_status的值为down，若用户欲获取主从数据同步状态，只能查看日志，很是不便。

我们在 PR #2656 中新增了主从数据同步 info 命令指标 ”repl_connect_status“，以方便用户或者其他旁路系统实时获取 Pika 的数据同步状态。

1 获取

用户向 Slave 端发送 info or info replication 命令，Slave 即会返回主从同步 ”repl_connect_status“ 状态。

向 Master 发送 info 命令，则不会有该指标。

2 格式

每个 DB 均有自己的 ”repl_connect_status“ 值，以 \n 分隔，示例如下：

3 repl_connect_status

”repl_connect_status“ 的值域如下列表：

指标值域	解释
no_connect	主从断连 / 无主从连接
try_to_incr_sync	瞬时状态：从节点处于 TryConnect 状态，即将发出 TrySync 请求来建立主从连接
try_to_full_sync	瞬时状态：从节点处于 TryDBSync 状态，即将发出 TryDBSync 请求来进行全量同步
syncing_full	主从节点正在进行全量同步
connecting	大部分情况下是瞬时状态，但在极端情况下，该状态可能持续几十秒到几分钟：从节点发出建联请求 (TrySync Req) 后，在收到并处理主节点回复的 TrySync Resp 前，从节点都处于这一状态中
connected	主从连接已经建立，主从正在通过 Binlog 进行增量同步
error	从节点状态错误，需介入排查

repl_connect_status 与 master_link_status 的关联：只有当所有 DB 的 ”repl_connect_status“ 都为 connected 时，master_status_link 才会为 up，只要有任何一个 DB 不处于该状态，master_status_link 均为 down。

4 运维

旁路监控告警系统可以向 Slave 端发送 info or info replication 命令，实时获取 Slave 数据同步状态。

当 master_status_link 为 down 时，可以使用 repl_connnect_status 判断是否进行告警。下面给出一些告警策略建议：

允许 repl_connect_status 在 connecting 状态停留一段时间，但不应该超过 5 分钟

自 PR 2638 后，在一些极端情况下如果 slave 阻塞严重，主从超时后重连可能会花费一些时间：具体地，PR 2638将Slave对TrySync Resp的处理由异步改成了同步，如果Slave被某个Binlog任务阻塞住，就会延迟对TrySync Resp的处理，在此期间，Slave的repl_connect_status将一直处于connecting状态，只有等到Slave的阻塞解除，TrySync Resp才会得到Slave的处理，完成主从连接的建立（Slave也会从connecting转为connected状态）。所以，Slave会在connecting状态持续多久，取决于Slave在建联时是否处于阻塞状态，在绝大部分情况下Slave都是不阻塞的，connecting也是一个瞬时状态，但如果发生了极端情况导致Slave阻塞，那Slave可能会在connecting状态上停留较长时间，此时 connecting 状态持续多久，就取决于 slave 阻塞了多久。
如果 repl_connect_status 处于 syncing_full 状态，说明主从正在进行全量同步，这一阶段的时间取决于各种因素（DB 大小，网络带宽等），关于这个阶段的停留告警时间应根据运维经验酌情考量

cheniujh · 2024-05-31T10:47:39Z

cheniujh
May 31, 2024
Collaborator Author

To make it easier for operations personnel to determine the Pika master-slave status, and because after the merge of PR #2638, the Slave may remain in the "connecting" state for a longer time after a timeout, PR #2656 adds a new master-slave metric "repl_connect_status".

1 Metric Acquisition Method: Send the `info` or `info replication` command to the Slave to obtain this value. It will also appear on the Slave instance. Sending the `info` command to the Master will not retrieve this metric.

2 Metric Format: Each DB has its own `repl_connect_status` value, separated by `\n`. An example is shown below:

3 Value Range of `repl_connect_status`

Metric Value Range	Explanation
no_connect	Master-slave disconnected/no master-slave connection
try_to_incr_sync	Instant state: The slave is in TryConnect state, about to send a TrySync request to establish a master-slave connection
try_to_full_sync	Instant state: The slave is in TryDBSync state, about to send a TryDBSync request for full synchronization
syncing_full	Master and slave nodes are performing full synchronization
connecting	Mostly an instant state, but in extreme cases, this state may last from tens of seconds to several minutes: After the slave sends a connection request (TrySync Req), before receiving and processing the master node's TrySync Resp reply, the slave remains in this state
connected	Master-slave connection has been established, and master-slave are performing incremental synchronization through Binlog
error	Slave node status error, intervention required

4 Operational Monitoring Recommendations:

4.1 Association between repl_connect_status and master_link_status: master_status_link will be up only when all DB's repl_connect_status are connected. If any DB is not in this state, master_status_link will be down.

4.2 Alarm Strategy Suggestions (When master_status_link is down, how to use repl_connect_status to determine whether to alarm):

Allow repl_connect_status to stay in the connecting state for a period, but not more than 5 minutes: Due to the merge of PR 2638, in some extreme cases if the slave is severely blocked, re-establishing the connection after a timeout may take some time (it has to wait until the slave unblocks before allowing the slave to return to the connected state. Before this, even if the slave has received the connection request response, the slave's repl_connect_status will remain in connecting. The duration of the connecting state depends on how long the slave was blocked).
If repl_connect_status is in syncing_full, it means the master-slave is performing full synchronization. The duration of this stage depends on various factors (DB size, network bandwidth, etc.), and the duration of this stage can be considered on a case-by-case basis.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expaination of DevOps metric "repl_connect_status" #2689

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Expaination of DevOps metric "repl_connect_status" #2689

cheniujh May 31, 2024 Collaborator

1 获取

2 格式

3 repl_connect_status

4 运维

Replies: 1 comment

cheniujh May 31, 2024 Collaborator Author

1 Metric Acquisition Method: Send the info or info replication command to the Slave to obtain this value. It will also appear on the Slave instance. Sending the info command to the Master will not retrieve this metric.

2 Metric Format: Each DB has its own repl_connect_status value, separated by \n. An example is shown below:

3 Value Range of repl_connect_status

4 Operational Monitoring Recommendations:

cheniujh
May 31, 2024
Collaborator

cheniujh
May 31, 2024
Collaborator Author

1 Metric Acquisition Method: Send the `info` or `info replication` command to the Slave to obtain this value. It will also appear on the Slave instance. Sending the `info` command to the Master will not retrieve this metric.

2 Metric Format: Each DB has its own `repl_connect_status` value, separated by `\n`. An example is shown below:

3 Value Range of `repl_connect_status`