-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469
Conversation
The issue is several Antrea Agent are out of memory in a large scale cluster, and we observe that the memory of the failed Antrea Agent is continuously increasing, from 400MB to 1.8G in less than 24 hours. After profiling Agent memory and call stack, we find that most memory is taken by Node resources received by Node informer watch function. From the goroutines, we find a dead lock that, 1. function "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked by requiring "Memberlist.nodeLock". 2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds the lock "Memberlist.nodeLock". It is blocking at sending message to "Cluster.nodeEventsCh", while the consumer is also blocking. The issue may happen in a large scale setup. Although Antrea has 1024 messages buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the channel is full before Agent completes sending out the Member join message on the existing Nodes. To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join() in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer. Signed-off-by: wenyingd <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #4469 +/- ##
==========================================
- Coverage 65.89% 65.79% -0.11%
==========================================
Files 402 402
Lines 57247 57227 -20
==========================================
- Hits 37723 37650 -73
- Misses 16822 16881 +59
+ Partials 2702 2696 -6
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If I understand correctly, this issue can also be addressed by starting a goroutine to handle the events from c.nodeEventsCh
before calling the c.mList.Join(members)
in the Cluster.Run()
function.
Good job and excellent debugging. The "Memberlist.Join()" in "Cluster.Run()" ensures that all Nodes join the memberlist. I have a question here, when you deploy antrea in an existing Cluster, will the |
Yes, it is still called by NodeInformer when a "NodeAdd" event is watched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Although starting a goroutine to handle the events from |
Agreed. Removing unnecessary calls should be a better approach. |
@wenyingd please backport the fix. |
The issue is several Antrea Agent are out of memory in a large scale cluster, and we observe that the memory of the failed Antrea Agent is continuously increasing, from 400MB to 1.8G in less than 24 hours.
After profiling Agent memory and call stack, we find that most memory is taken by Node resources received by Node informer watch function. From the goroutines, we find a dead lock that,
The issue may happen in the large scale setup. Although Antrea has 1024 messages buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the channel is full before Agent completes sending out the Member join message on the existing Nodes.
To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join() in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer.
Signed-off-by: wenyingd [email protected]