fix: add a user-friendly repl metric "repl_connect_status" in the resp of info command #2656
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Due to the merge of #2638(Changed the processing of TrySync Resp from asynchronous to synchronous.) , slave may stay in WaitReply state for a while(in some extreme scenario, WaitReply state could last even 1-2 minutes), during the period of slave being "WaitReply" state, the metric "master_link_status"(which is fetched by info command) is down, which might trigger monitoring alerts set by the operations personnel.
So it's needed to provide an more granular metric for the user/operations personnel to know what is going on when master_link_status is down. And that's why the following monitoring metric is added by this PR:
metirc name: repl_connect_status
value range: {no_connect, try_to_incr_sync, try_to_full_sync, syncing_full, connecting, error}
由于 #2638 的合并(将TrySync Resp的处理由异步改成了同步),slave可能会在WaitReply状态下停留一段时间(在某些极端情况下,WaitReply状态可能会持续1-2分钟),在slave处于“WaitReply”状态期间,通过info命令获取的“master_link_status”指标会显示为down,这可能会触发运维人员设置的监控警报。
因此,需要提供一个更细粒度的指标,以便用户/运维人员在master_link_status为down时了解实际情况。这也是为什么通过这个PR添加了以下监控指标:
指标名称:repl_connect_status
值范围:{no_connect, try_to_incr_sync, try_to_full_sync, syncing_full, connecting, connected, error}
关于该值如何使用:请见 Disscussion #2689