Replies: 1 comment
-
To make it easier for operations personnel to determine the Pika master-slave status, and because after the merge of PR #2638, the Slave may remain in the "connecting" state for a longer time after a timeout, PR #2656 adds a new master-slave metric "repl_connect_status". 1 Metric Acquisition Method: Send the
|
Metric Value Range | Explanation |
---|---|
no_connect | Master-slave disconnected/no master-slave connection |
try_to_incr_sync | Instant state: The slave is in TryConnect state, about to send a TrySync request to establish a master-slave connection |
try_to_full_sync | Instant state: The slave is in TryDBSync state, about to send a TryDBSync request for full synchronization |
syncing_full | Master and slave nodes are performing full synchronization |
connecting | Mostly an instant state, but in extreme cases, this state may last from tens of seconds to several minutes: After the slave sends a connection request (TrySync Req), before receiving and processing the master node's TrySync Resp reply, the slave remains in this state |
connected | Master-slave connection has been established, and master-slave are performing incremental synchronization through Binlog |
error | Slave node status error, intervention required |
4 Operational Monitoring Recommendations:
4.1 Association between repl_connect_status
and master_link_status
: master_status_link
will be up
only when all DB's repl_connect_status
are connected
. If any DB is not in this state, master_status_link
will be down
.
4.2 Alarm Strategy Suggestions (When master_status_link
is down
, how to use repl_connect_status
to determine whether to alarm):
- Allow
repl_connect_status
to stay in the connecting state for a period, but not more than 5 minutes: Due to the merge of PR 2638, in some extreme cases if the slave is severely blocked, re-establishing the connection after a timeout may take some time (it has to wait until the slave unblocks before allowing the slave to return to theconnected
state. Before this, even if the slave has received the connection request response, the slave'srepl_connect_status
will remain inconnecting
. The duration of theconnecting
state depends on how long the slave was blocked). - If
repl_connect_status
is insyncing_full
, it means the master-slave is performing full synchronization. The duration of this stage depends on various factors (DB size, network bandwidth, etc.), and the duration of this stage can be considered on a case-by-case basis.
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
自 PR #2638 合并之后,若主从超时后Slave 重新与 Master 建连,可能会有更长的时间处于 "connecting" 状态,此时虽然按预期主从能建联成功,但master_link_status的值为down,若用户欲获取主从数据同步状态,只能查看日志,很是不便。
我们在 PR #2656 中新增了主从数据同步 info 命令指标 ”repl_connect_status“,以方便用户或者其他旁路系统实时获取 Pika 的数据同步状态。
1 获取
用户向 Slave 端发送
info
orinfo replication
命令,Slave 即会返回主从同步 ”repl_connect_status“ 状态。向 Master 发送 info 命令,则不会有该指标。
2 格式
每个 DB 均有自己的 ”repl_connect_status“ 值,以 \n 分隔,示例如下:
3 repl_connect_status
”repl_connect_status“ 的值域如下列表:
repl_connect_status 与 master_link_status 的关联:只有当所有 DB 的 ”repl_connect_status“ 都为 connected 时,master_status_link 才会为 up,只要有任何一个 DB 不处于该状态,master_status_link 均为 down。
4 运维
旁路监控告警系统可以向 Slave 端发送
info
orinfo replication
命令,实时获取 Slave 数据同步状态。当 master_status_link 为 down 时,可以使用 repl_connnect_status 判断是否进行告警。下面给出一些告警策略建议:
允许 repl_connect_status 在 connecting 状态停留一段时间,但不应该超过 5 分钟
自 PR 2638 后,在一些极端情况下如果 slave 阻塞严重,主从超时后重连可能会花费一些时间:具体地,PR 2638将Slave对TrySync Resp的处理由异步改成了同步,如果Slave被某个Binlog任务阻塞住,就会延迟对TrySync Resp的处理,在此期间,Slave的repl_connect_status将一直处于connecting状态,只有等到Slave的阻塞解除,TrySync Resp才会得到Slave的处理,完成主从连接的建立(Slave也会从connecting转为connected状态)。所以,Slave会在connecting状态持续多久,取决于Slave在建联时是否处于阻塞状态,在绝大部分情况下Slave都是不阻塞的,connecting也是一个瞬时状态,但如果发生了极端情况导致Slave阻塞,那Slave可能会在connecting状态上停留较长时间,此时 connecting 状态持续多久,就取决于 slave 阻塞了多久。
如果 repl_connect_status 处于 syncing_full 状态,说明主从正在进行全量同步,这一阶段的时间取决于各种因素(DB 大小,网络带宽等),关于这个阶段的停留告警时间应根据运维经验酌情考量
Beta Was this translation helpful? Give feedback.
All reactions