Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing NPC heartbeat because of interfering Pod updates #725

Closed
Cytrian opened this issue Apr 29, 2019 · 2 comments
Closed

Missing NPC heartbeat because of interfering Pod updates #725

Cytrian opened this issue Apr 29, 2019 · 2 comments

Comments

@Cytrian
Copy link
Contributor

Cytrian commented Apr 29, 2019

We are experiencing regular kube-router failures. The health check sometimes fails.
I could nail it down to this situation:

Normally network_policy_controller Run() executes Sync() every 5 minutes, regulated by a Ticker.
After executing Sync() it sends a NPC heartbeat.

On pod updates, OnPodUpdate() also executes Sync().

If OnPodUpdates() executes Sync() just before the heartbeat is expected by healthcheck and Sync() takes too long, the health_controller misses a heartbeat.

TLDR: Run() cannot execute Sync() when Sync() is already started by OnPodUpdate().

Here's a shortened/commented log of this situation.

# Sync() started by Run() at 13:11:37
I0429 13:11:37.919713       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543497919691345
I0429 13:11:38.696038       1 network_services_controller.go:1499] Successfully synced iptables masquerad rule
I0429 13:11:38.915911       1 network_services_controller.go:1163] IPVS servers and services are synced to desired state
I0429 13:11:38.915972       1 network_services_controller.go:799] sync ipvs services took 217.201384ms
I0429 13:11:38.918723       1 health_controller.go:74] Received heartbeat from NSC
I0429 13:11:39.826326       1 network_routes_controller.go:259] Syncing ipsets
I0429 13:11:40.950845       1 network_routes_controller.go:278] Performing periodic sync of service VIP routes
I0429 13:11:40.951166       1 network_routes_controller.go:282] Performing periodic sync of pod CIDR routes
I0429 13:11:40.960030       1 health_controller.go:74] Received heartbeat from NRC
I0429 13:11:48.838850       1 network_policy_controller.go:340] Iptables chains in the filter table are synchronized with the network policies.
I0429 13:11:48.838873       1 network_policy_controller.go:276] Syncing network policy chains took 10.916588158s
I0429 13:11:55.630715       1 network_policy_controller.go:231] sync iptables took 17.711018071s

# heartbeat sent in Run() at 13:11:55. Next heartbeat is expected at 14:16:55
I0429 13:11:55.630735       1 health_controller.go:74] Received heartbeat from NPC
I0429 13:15:05.057334       1 network_policy_controller.go:173] Received update to pod: stage-intelligence/flink-fluber-stage-controller-cron-1556543700-7pwf9
I0429 13:15:05.057356       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543705057352328
I0429 13:15:16.061213       1 network_policy_controller.go:340] Iptables chains in the filter table are synchronized with the network policies.
I0429 13:15:16.061249       1 network_policy_controller.go:276] Syncing network policy chains took 11.00116339s
I0429 13:15:22.800685       1 network_policy_controller.go:231] sync iptables took 17.743325555s

I0429 13:15:22.800721       1 network_policy_controller.go:173] Received update to pod: stage-intelligence/flink-fluber-stage-argo-wf-5fjj6-2888147321

# Sync() executed by OnPodUpdate() at 13:15:22
I0429 13:15:22.800731       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543722800726665
[..]
I0429 13:15:41.365030       1 network_policy_controller.go:231] sync iptables took 18.564299125s

I0429 13:15:41.365053       1 network_policy_controller.go:173] Received update to pod: stage-intelligence/flink-fluber-stage-controller-cron-1556543700-7pwf9

# Sync() executed by OnPodUpdate() at 13:15:41
I0429 13:15:41.365059       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543741365056198
[..]
I0429 13:16:00.172668       1 network_policy_controller.go:231] sync iptables took 18.807607407s

I0429 13:16:00.172683       1 network_policy_controller.go:173] Received update to pod: stage-intelligence/flink-fluber-stage-argo-wf-5fjj6-2888147321

# Sync() executed by OnPodUpdate() at 13:16:00
I0429 13:16:00.172689       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543760172685862
[..]
I0429 13:16:21.597458       1 network_policy_controller.go:231] sync iptables took 21.424766695s

I0429 13:16:21.597484       1 network_policy_controller.go:173] Received update to pod: stage-intelligence/flink-fluber-stage-argo-wf-5fjj6-2888147321

# Sync() executed by OnPodUpdate() at 13:16:21
I0429 13:16:21.597493       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543781597488423
[..]
I0429 13:16:48.448123       1 network_policy_controller.go:231] sync iptables took 26.850629554s

I0429 13:16:48.448146       1 network_policy_controller.go:173] Received update to pod: stage-intelligence/flink-fluber-stage-argo-wf-5fjj6-1980496625

# Sync() executed by OnPodUpdate() at 13:16:48. This is 7 seconds before NPC heartbeat is expected.
I0429 13:16:48.448153       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543808448149200
I0429 13:17:00.957408       1 network_policy_controller.go:340] Iptables chains in the filter table are synchronized with the network policies.
I0429 13:17:00.957456       1 network_policy_controller.go:276] Syncing network policy chains took 12.505761274s
I0429 13:17:19.886845       1 network_policy_controller.go:231] sync iptables took 31.438681669s
I0429 13:17:19.886990       1 network_policy_controller.go:173] Received update to pod: stage-intelligence/flink-fluber-stage-argo-wf-5fjj6-1980496625
I0429 13:17:19.887011       1 network_policy_controller.go:234] Starting sync of iptables with version: 1556543839887000979

# Sync() is still being executed, was started by OnPodUpdate(). Heartbeat is now missed. Run() cannot run now.
E0429 13:17:34.419930       1 health_controller.go:111] Network Policy Controller heartbeat missed

I0429 13:17:37.383474       1 network_policy_controller.go:340] Iptables chains in the filter table are synchronized with the network policies.
I0429 13:17:37.383718       1 network_policy_controller.go:276] Syncing network policy chains took 17.482993466s
E0429 13:17:37.419882       1 health_controller.go:111] Network Policy Controller heartbeat missed
E0429 13:17:37.419989       1 health_controller.go:111] Network Policy Controller heartbeat missed
E0429 13:17:40.419919       1 health_controller.go:111] Network Policy Controller heartbeat missed
E0429 13:17:42.419869       1 health_controller.go:111] Network Policy Controller heartbeat missed
I0429 13:17:43.126256       1 health_controller.go:161] Shutting down health controller
I0429 13:17:43.126271       1 health_controller.go:178] Shutting down HealthController RunCheck
E0429 13:17:43.126306       1 health_controller.go:150] Health controller error: http: Server closed
I0429 13:17:59.728160       1 network_policy_controller.go:231] sync iptables took 39.841153808s

To fix this, I suggest calling healthcheck.SendHeartBeat(healthChan, "NPC") in OnPodUpdate() and OnNetworkPolicyUpdate() and OnNamespaceUpdate()

ticpu pushed a commit to ticpu/kube-router that referenced this issue May 31, 2019
In reference to issue cloudnativelabs#725, we modified kube-router to send
heartbeats during policy sync to prevent missing heartbeats
while running iptables commands.

Signed-off-by: Jérôme Poulin <[email protected]>
ticpu pushed a commit to ticpu/kube-router that referenced this issue May 31, 2019
In reference to issue cloudnativelabs#725, we modified kube-router to send
heartbeats during policy sync to prevent missing heartbeats
while running iptables commands.

Signed-off-by: Jérôme Poulin <[email protected]>
ticpu pushed a commit to ticpu/kube-router that referenced this issue May 31, 2019
In reference to issue cloudnativelabs#725, we modified kube-router to send
heartbeats during policy sync to prevent missing heartbeats
while running iptables commands.

This commit is the first part to make kube-router faster and
more robust when applying network policies.

Signed-off-by: Jérôme Poulin <[email protected]>
ticpu pushed a commit to ticpu/kube-router that referenced this issue May 31, 2019
In reference to issue cloudnativelabs#725, we modified kube-router to send
heartbeats during policy sync to prevent missing heartbeats
while running iptables commands.

This commit is the first part to make kube-router faster and
more robust when applying network policies.

Signed-off-by: Jérôme Poulin <[email protected]>
ticpu pushed a commit to ticpu/kube-router that referenced this issue May 31, 2019
In reference to issue cloudnativelabs#725, we modified kube-router to send
heartbeats during policy sync to prevent missing heartbeats
while running iptables commands.

This commit is the first part to make kube-router faster and
more robust when applying network policies.

Signed-off-by: Jérôme Poulin <[email protected]>
ticpu pushed a commit to ticpu/kube-router that referenced this issue May 31, 2019
In reference to issue cloudnativelabs#725, we modified kube-router to send
heartbeats before starting policy sync to prevent missing
heartbeats while running iptables commands.

Signed-off-by: Jérôme Poulin <[email protected]>
ticpu pushed a commit to ticpu/kube-router that referenced this issue May 31, 2019
In reference to issue cloudnativelabs#725, we modified kube-router to send
heartbeats before starting policy sync to prevent missing
heartbeats while running iptables commands.

Signed-off-by: Jérôme Poulin <[email protected]>
murali-reddy pushed a commit that referenced this issue Jun 24, 2019
In reference to issue #725, we modified kube-router to send
heartbeats before starting policy sync to prevent missing
heartbeats while running iptables commands.

Signed-off-by: Jérôme Poulin <[email protected]>
@ticpu
Copy link
Contributor

ticpu commented Jun 24, 2019

This should be fixed with the pull request #741 merged, if you want to give it a try.

@murali-reddy
Copy link
Member

closing the issue as its addressed by #741. @Cytrian please reopen if you still see the issue.

Cytrian pushed a commit to yieldlab/kube-router that referenced this issue Jul 9, 2019
Missing NPC heartbeat because of interfering Pod updates

commit 94fd7b6
murali-reddy pushed a commit that referenced this issue Jul 25, 2019
In reference to issue #725, we modified kube-router to send
heartbeats before starting policy sync to prevent missing
heartbeats while running iptables commands.

Signed-off-by: Jérôme Poulin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants