Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

wenyingd · 2022-12-13T06:48:39Z

The issue is several Antrea Agent are out of memory in a large scale cluster, and we observe that the memory of the failed Antrea Agent is continuously increasing, from 400MB to 1.8G in less than 24 hours.

After profiling Agent memory and call stack, we find that most memory is taken by Node resources received by Node informer watch function. From the goroutines, we find a dead lock that,

funciton "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked by requiring lock "Memberlist.nodeLock".
Memberlist has received a Node Join/Leave message sent by other Agent, and holds the lock "Memberlist.nodeLock". It is blocking at sending message to "Cluster.nodeEventsCh", while the consumer is also blocking.

The issue may happen in the large scale setup. Although Antrea has 1024 messages buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the channel is full before Agent completes sending out the Member join message on the existing Nodes.

To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join() in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer.

Signed-off-by: wenyingd [email protected]

The issue is several Antrea Agent are out of memory in a large scale cluster, and we observe that the memory of the failed Antrea Agent is continuously increasing, from 400MB to 1.8G in less than 24 hours. After profiling Agent memory and call stack, we find that most memory is taken by Node resources received by Node informer watch function. From the goroutines, we find a dead lock that, 1. function "Cluster.Run()" is stuck at caling "Memberlist.Join()", which is blocked by requiring "Memberlist.nodeLock". 2. Memberlist has received a Node Join/Leave message sent by other Agent, and holds the lock "Memberlist.nodeLock". It is blocking at sending message to "Cluster.nodeEventsCh", while the consumer is also blocking. The issue may happen in a large scale setup. Although Antrea has 1024 messages buffer in "Cluster.nodeEventsCh", a lot of Nodes in the cluster may cause the channel is full before Agent completes sending out the Member join message on the existing Nodes. To resolve the issue, this patch has removed the unnecessary call of Memberlist.Join() in Cluster.Run, since it is also called by the "NodeAdd" event triggered by NodeInformer. Signed-off-by: wenyingd <[email protected]>

codecov · 2022-12-13T07:05:30Z

Codecov Report

Merging #4469 (a1690de) into main (b503a46) will decrease coverage by 0.10%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #4469      +/-   ##
==========================================
- Coverage   65.89%   65.79%   -0.11%     
==========================================
  Files         402      402              
  Lines       57247    57227      -20     
==========================================
- Hits        37723    37650      -73     
- Misses      16822    16881      +59     
+ Partials     2702     2696       -6

Flag	Coverage Δ
e2e-tests	`38.32% <ø> (?)`
integration-tests	`34.59% <ø> (ø)`
kind-e2e-tests	`47.11% <ø> (-0.74%)`	⬇️
unit-tests	`51.03% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pkg/agent/memberlist/cluster.go	`73.46% <ø> (-1.69%)`	⬇️
pkg/agent/flowexporter/exporter/certificate.go	`27.77% <0.00%> (-22.23%)`	⬇️
...nt/apiserver/handlers/serviceexternalip/handler.go	`29.62% <0.00%> (-22.23%)`	⬇️
pkg/ipfix/ipfix_process.go	`81.25% <0.00%> (-18.75%)`	⬇️
pkg/agent/controller/networkpolicy/packetin.go	`61.48% <0.00%> (-15.55%)`	⬇️
...g/agent/controller/serviceexternalip/controller.go	`68.60% <0.00%> (-12.97%)`	⬇️
pkg/agent/controller/networkpolicy/reject.go	`66.00% <0.00%> (-9.86%)`	⬇️
pkg/controller/networkpolicy/tier.go	`53.84% <0.00%> (-4.62%)`	⬇️
pkg/agent/cniserver/ipam/antrea_ipam.go	`75.75% <0.00%> (-4.33%)`	⬇️
pkg/agent/cniserver/server.go	`74.94% <0.00%> (-3.44%)`	⬇️
... and 26 more

tnqn

LGTM

tnqn · 2022-12-13T07:31:12Z

/test-all

xliuxu

LGTM. If I understand correctly, this issue can also be addressed by starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) in the Cluster.Run() function.

wenqiq · 2022-12-13T08:28:44Z

Good job and excellent debugging. The "Memberlist.Join()" in "Cluster.Run()" ensures that all Nodes join the memberlist. I have a question here, when you deploy antrea in an existing Cluster, will the handleCreateNode still be executed?

wenyingd · 2022-12-13T08:30:31Z

I have a question here, when you deploy antrea in an existing Cluster, will the handleCreateNode still be executed?

Yes, it is still called by NodeInformer when a "NodeAdd" event is watched.

wenqiq

LGTM

wenyingd · 2022-12-13T09:16:39Z

LGTM. If I understand correctly, this issue can also be addressed by starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) in the Cluster.Run() function.

Although starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) could avoid the event channel blocking, it leads to that the call of "c.mList.Join" on the existing Node twice: one is in the Run, and the other is called by NodeAdd event. It is duplicated on the cluster network, and other Agent may also receive and process the message twice.

xliuxu · 2022-12-13T09:39:28Z

LGTM. If I understand correctly, this issue can also be addressed by starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) in the Cluster.Run() function.

Although starting a goroutine to handle the events from c.nodeEventsCh before calling the c.mList.Join(members) could avoid the event channel blocking, it leads to that the call of "c.mList.Join" on the existing Node twice: one is in the Run, and the other is called by NodeAdd event. It is duplicated on the cluster network, and other Agent may also receive and process the message twice.

Agreed. Removing unnecessary calls should be a better approach.

tnqn · 2022-12-13T10:18:25Z

@wenyingd please backport the fix.

wenyingd requested a review from tnqn December 13, 2022 06:48

wenyingd added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2022

wenyingd added this to the Antrea v1.10 release milestone Dec 13, 2022

tnqn approved these changes Dec 13, 2022

View reviewed changes

tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Dec 13, 2022

tnqn requested review from wenqiq and xliuxu December 13, 2022 07:31

xliuxu approved these changes Dec 13, 2022

View reviewed changes

wenqiq approved these changes Dec 13, 2022

View reviewed changes

tnqn merged commit eb47df9 into antrea-io:main Dec 13, 2022

wenyingd deleted the memberlist_deadlock branch May 30, 2023 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

wenyingd commented Dec 13, 2022

codecov bot commented Dec 13, 2022 •

edited

Loading

tnqn left a comment

tnqn commented Dec 13, 2022

xliuxu left a comment

wenqiq commented Dec 13, 2022

wenyingd commented Dec 13, 2022 •

edited

Loading

wenqiq left a comment

wenyingd commented Dec 13, 2022

xliuxu commented Dec 13, 2022 •

edited

Loading

tnqn commented Dec 13, 2022

Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

Bugfix: Resolve a deadlock in cluster memberlist maintanance #4469

Conversation

wenyingd commented Dec 13, 2022

codecov bot commented Dec 13, 2022 • edited Loading

Codecov Report

tnqn left a comment

Choose a reason for hiding this comment

tnqn commented Dec 13, 2022

xliuxu left a comment

Choose a reason for hiding this comment

wenqiq commented Dec 13, 2022

wenyingd commented Dec 13, 2022 • edited Loading

wenqiq left a comment

Choose a reason for hiding this comment

wenyingd commented Dec 13, 2022

xliuxu commented Dec 13, 2022 • edited Loading

tnqn commented Dec 13, 2022

codecov bot commented Dec 13, 2022 •

edited

Loading

wenyingd commented Dec 13, 2022 •

edited

Loading

xliuxu commented Dec 13, 2022 •

edited

Loading