Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

The issue is several Antrea Agent are out of memory in a large scale cluster, and we observe that the memory of the failed Antrea Agent is continuously increasing, from 400MB to 1.8G in less than 24 hours. After profiling Agent memory and call stack, we find that most memory is taken by Node resources received by Node informer watch function. From the goroutines, we find a dead lock that, 1. function "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked by requiring "Memberlist.nodeLock". 2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds the lock "Memberlist.nodeLock". It is blocking at sending message to "Cluster.nodeEventsCh", while the consumer is also blocking. The issue may happen in a large scale setup. Although Antrea has 1024 messages buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the channel is full before Agent completes sending out the Member join message on the existing Nodes. To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join() in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer. Signed-off-by: wenyingd <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

Commits on Dec 13, 2022