-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: maitain heartbeart between nodes in the same cluster #91
Conversation
bugs: Register member slice maybe panic
…er, ensure cluster health. 2.master process upgrade or reload process, members through heartbeat Register node,members not reload.
examples/cluster/main.go
Outdated
@@ -167,7 +170,10 @@ func runChat(args *cli.Context) error { | |||
|
|||
// Register session closed callback | |||
session.Lifetime.OnClosed(chat.OnSessionClosed) | |||
|
|||
//go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to remove this code block instead of commenting them.
cluster/node.go
Outdated
@@ -54,6 +54,7 @@ type Options struct { | |||
IsWebsocket bool | |||
TSLCertificate string | |||
TSLKey string | |||
OnUnregister func(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to use UnregisterCallback
or UnregisterHook
. And I think the callback parameter should be node information instead of only server address.
cluster/member.go
Outdated
|
||
type Member struct { | ||
isMaster bool | ||
memberInfo *clusterpb.MemberInfo | ||
|
||
heartbeatLastAt time.Time // cluster member report heartbeat time to the master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the redundant blank line (L32) and its name should be lastHeartbeatAt
cluster/clusterpb/cluster.pb.go
Outdated
@@ -88,7 +88,8 @@ type RegisterRequest struct { | |||
sizeCache protoimpl.SizeCache | |||
unknownFields protoimpl.UnknownFields | |||
|
|||
MemberInfo *MemberInfo `protobuf:"bytes,1,opt,name=memberInfo,proto3" json:"memberInfo,omitempty"` | |||
MemberInfo *MemberInfo `protobuf:"bytes,1,opt,name=memberInfo,proto3" json:"memberInfo,omitempty"` | |||
IsHeartbeat bool `protobuf:"varint,2,opt,name=isHeartbeat,proto3" json:"isHeartbeat,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not define a new HeartbeatRequest
and make them decoupled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄I just want to do this with minimal modification, and if I don't mind adding a new RPC interface, I'll define a HeartbeatRequest RPC. At the same time, I will also modify the above questions to maintain the same style.
@bbdshow Thanks for your awesome contribution. There are some suggestions to make this PR better.
Again, thanks very much for your pull request to make nano better. |
@lonng Thank you very much for your suggestion. I have modified it and submitted it |
cluster/cluster.go
Outdated
} | ||
|
||
// as UnregisterCallback arg | ||
unregisterMember := *c.members[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we don't need this variable.
if c.currentNode.UnregisterCallback != nil {
c.currentNode.UnregisterCallback(*c.members[index])
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since there is no possibility of exception, I can advance the method to delMeber
cluster/node.go
Outdated
|
||
// cluster mode | ||
if !n.IsMaster { | ||
n.once.Do(n.sendHeartbeat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move this line into L187?
cluster/node.go
Outdated
@@ -412,3 +423,43 @@ func (n *Node) CloseSession(_ context.Context, req *clusterpb.CloseSessionReques | |||
} | |||
return &clusterpb.CloseSessionResponse{}, nil | |||
} | |||
|
|||
// ticker send heartbeat register info to master | |||
func (n *Node) sendHeartbeat() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to rename this method into keepalive
@bbdshow Another three commented left, please address them and the rest looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What problem does this PR solve?
1.Master与Members之间通过RPC调动建立心跳。让Master有主动踢掉异常节点的能力。
2.通过节点上报心跳信息,可以让Master重启或更新后,第一时间知道注册信息,其他节点无需通过重启再次注册。
3.最终保持 集群健康,和解耦 节点间的发布顺序与关联。方便多样式部署。
4.hook一个 OnUnregister fn,当节点异常做一些处理,比如报警。
What is changed and how it works?
1.通过在 Register proto + IsHeartbeat 字段,走不同的注册逻辑。记录心跳时间,每次 心跳注册 时检查一下所有节点的上报时间。
2.普通member定时向master调用心跳注册。
关于测试:
启动 1 master 2Gate 3 chat,
1chat可以开一个 mock exception exit 的方法。此时 master 会有主动 剔除异常 chat 的行为,具体可看日志。
2.当master重启,模拟master更新操作, 其他节点不用重启, master 的集群信息会自动恢复