Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: maitain heartbeart between nodes in the same cluster #91

Merged
merged 8 commits into from
Nov 11, 2022

Conversation

bbdshow
Copy link
Contributor

@bbdshow bbdshow commented Nov 11, 2022

What problem does this PR solve?

1.Master与Members之间通过RPC调动建立心跳。让Master有主动踢掉异常节点的能力。
2.通过节点上报心跳信息,可以让Master重启或更新后,第一时间知道注册信息,其他节点无需通过重启再次注册。
3.最终保持 集群健康,和解耦 节点间的发布顺序与关联。方便多样式部署。
4.hook一个 OnUnregister fn,当节点异常做一些处理,比如报警。

What is changed and how it works?

1.通过在 Register proto + IsHeartbeat 字段,走不同的注册逻辑。记录心跳时间,每次 心跳注册 时检查一下所有节点的上报时间。
2.普通member定时向master调用心跳注册。

关于测试:
启动 1 master 2Gate 3 chat,
1chat可以开一个 mock exception exit 的方法。此时 master 会有主动 剔除异常 chat 的行为,具体可看日志。
2.当master重启,模拟master更新操作, 其他节点不用重启, master 的集群信息会自动恢复

bbdshow added 5 commits November 10, 2022 00:44
bugs: Register member slice maybe panic
…er, ensure cluster health.

2.master process upgrade or reload process, members through heartbeat Register node,members not reload.
@@ -167,7 +170,10 @@ func runChat(args *cli.Context) error {

// Register session closed callback
session.Lifetime.OnClosed(chat.OnSessionClosed)

//go func() {
Copy link
Owner

@lonng lonng Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to remove this code block instead of commenting them.

cluster/node.go Outdated
@@ -54,6 +54,7 @@ type Options struct {
IsWebsocket bool
TSLCertificate string
TSLKey string
OnUnregister func(string)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to use UnregisterCallback or UnregisterHook. And I think the callback parameter should be node information instead of only server address.


type Member struct {
isMaster bool
memberInfo *clusterpb.MemberInfo

heartbeatLastAt time.Time // cluster member report heartbeat time to the master
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the redundant blank line (L32) and its name should be lastHeartbeatAt

@@ -88,7 +88,8 @@ type RegisterRequest struct {
sizeCache protoimpl.SizeCache
unknownFields protoimpl.UnknownFields

MemberInfo *MemberInfo `protobuf:"bytes,1,opt,name=memberInfo,proto3" json:"memberInfo,omitempty"`
MemberInfo *MemberInfo `protobuf:"bytes,1,opt,name=memberInfo,proto3" json:"memberInfo,omitempty"`
IsHeartbeat bool `protobuf:"varint,2,opt,name=isHeartbeat,proto3" json:"isHeartbeat,omitempty"`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not define a new HeartbeatRequest and make them decoupled?

Copy link
Contributor Author

@bbdshow bbdshow Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄I just want to do this with minimal modification, and if I don't mind adding a new RPC interface, I'll define a HeartbeatRequest RPC. At the same time, I will also modify the above questions to maintain the same style.

@lonng lonng changed the title 利用master节点维护各member之间的心跳 cluster: maitain heartbeart between nodes in the same cluster Nov 11, 2022
@lonng
Copy link
Owner

lonng commented Nov 11, 2022

@bbdshow Thanks for your awesome contribution.

There are some suggestions to make this PR better.

  1. It's better to define a Heartbeat interface in MasterService instead of reusing Register interface to make the semantic clear.
  2. The OnUnregister callback name should rename to UnregisterCallback to make its name clear.
  3. I think it's better to spawn a dedicated goroutine to check member heartbeat expiration instead of in every heartbeat request handler.

Again, thanks very much for your pull request to make nano better.

@bbdshow
Copy link
Contributor Author

bbdshow commented Nov 11, 2022

@lonng Thank you very much for your suggestion. I have modified it and submitted it

}

// as UnregisterCallback arg
unregisterMember := *c.members[index]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we don't need this variable.

	if c.currentNode.UnregisterCallback != nil {
		c.currentNode.UnregisterCallback(*c.members[index])
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since there is no possibility of exception, I can advance the method to delMeber

cluster/node.go Outdated

// cluster mode
if !n.IsMaster {
n.once.Do(n.sendHeartbeat)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this line into L187?

cluster/node.go Outdated
@@ -412,3 +423,43 @@ func (n *Node) CloseSession(_ context.Context, req *clusterpb.CloseSessionReques
}
return &clusterpb.CloseSessionResponse{}, nil
}

// ticker send heartbeat register info to master
func (n *Node) sendHeartbeat() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to rename this method into keepalive

@lonng
Copy link
Owner

lonng commented Nov 11, 2022

@bbdshow Another three commented left, please address them and the rest looks good to me.

Copy link
Owner

@lonng lonng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lonng lonng merged commit abcf7c9 into lonng:master Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants