Network traffic up for 20 minutes after a server restart #135

indrekj · 2019-11-27T16:12:19Z

We're using Phoenix Presence (latest version in the master branch). We have a kubernetes set up where we have multiple pods running.

We've noticed that every time we restart a pod or do a rolling update then network traffic is up for 20 minutes.

I was able to replicate it in our beta environment when I had 10K online connections and I restarted one pod:

As you can see, traffic went up around 11:53 and came back down around 12:14.

I think it's related to permdown_period setting which by default is 20 minutes. I tried to replicate this with just phoenix_pubsub library without a web server but wasn't able to. EDIT: It is related. If I changed it to 10, then network traffic was up only for 10 minutes.

I also did a tcpdump inside one pod to see where the traffic is coming from/going to. It was all between the presence servers themselves. I think these are the state synchronization messages.

Do you have any suggestions what to look for or how to gather more information?

The text was updated successfully, but these errors were encountered:

Scenario how Node2 is replaced by Node3 (this is also basically a rolling update): 1. Node1 and Node2 are up and synced 2. Kill Node2 (node1 will start permdown grace period for Node2) 3. Spawn Node3 4. Node1 sends a heartbeat that includes clocks for Node1 & Node2 5. Node3 receives the heartbeat. It sees node1 clock is dominating because there's Node2 clock. It requests transfer from Node1. 6. Node1 sends transfer ack to Node3 7. Node3 uses `State#extract` to process the transfer payload which discards Node2 values. 8. It all starts again from step 4 on the next heartbeat. This loop between steps 4 and 8 lasts until Node1 permdown period for Node2 triggers and it doesn't put it to the heartbeat clocks any more. The solution here is not to include down replicas in the heartbeat notifications. This fixes phoenixframework#135

Scenario how Node2 is replaced by Node3 (this is also basically a rolling update): 1. Node1 and Node2 are up and synced 2. Kill Node2 (node1 will start permdown grace period for Node2) 3. Spawn Node3 4. Node1 sends a heartbeat that includes clocks for Node1 & Node2 5. Node3 receives the heartbeat. It sees node1 clock is dominating because there's Node2 clock. It requests transfer from Node1. 6. Node1 sends transfer ack to Node3 7. Node3 uses `State#extract` to process the transfer payload which discards Node2 values. 8. It all starts again from step 4 on the next heartbeat. This loop between steps 4 and 8 lasts until Node1 permdown period for Node2 triggers and it doesn't put it to the heartbeat clocks any more. The solution here is not to include down replicas in the heartbeat notifications. This fixes #135

indrekj mentioned this issue Nov 28, 2019

Fix transfer req loop when a node is replaced #136

Merged

chrismccord closed this as completed in #136 Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network traffic up for 20 minutes after a server restart #135

Network traffic up for 20 minutes after a server restart #135

indrekj commented Nov 27, 2019 •

edited

Loading

Network traffic up for 20 minutes after a server restart #135

Network traffic up for 20 minutes after a server restart #135

Comments

indrekj commented Nov 27, 2019 • edited Loading

indrekj commented Nov 27, 2019 •

edited

Loading