Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network traffic up for 20 minutes after a server restart #135

Closed
indrekj opened this issue Nov 27, 2019 · 0 comments · Fixed by #136
Closed

Network traffic up for 20 minutes after a server restart #135

indrekj opened this issue Nov 27, 2019 · 0 comments · Fixed by #136

Comments

@indrekj
Copy link
Contributor

indrekj commented Nov 27, 2019

We're using Phoenix Presence (latest version in the master branch). We have a kubernetes set up where we have multiple pods running.

We've noticed that every time we restart a pod or do a rolling update then network traffic is up for 20 minutes.

I was able to replicate it in our beta environment when I had 10K online connections and I restarted one pod:
Screen Shot 2019-11-27 at 18 05 28
As you can see, traffic went up around 11:53 and came back down around 12:14.

I think it's related to permdown_period setting which by default is 20 minutes. I tried to replicate this with just phoenix_pubsub library without a web server but wasn't able to. EDIT: It is related. If I changed it to 10, then network traffic was up only for 10 minutes.

I also did a tcpdump inside one pod to see where the traffic is coming from/going to. It was all between the presence servers themselves. I think these are the state synchronization messages.

Do you have any suggestions what to look for or how to gather more information?

indrekj added a commit to indrekj/phoenix_pubsub that referenced this issue Nov 28, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
indrekj added a commit to indrekj/phoenix_pubsub that referenced this issue Nov 28, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
urmastalimaa pushed a commit to indrekj/phoenix_pubsub that referenced this issue Nov 29, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
urmastalimaa pushed a commit to indrekj/phoenix_pubsub that referenced this issue Nov 29, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
urmastalimaa pushed a commit to indrekj/phoenix_pubsub that referenced this issue Nov 29, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
urmastalimaa pushed a commit to indrekj/phoenix_pubsub that referenced this issue Nov 29, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
indrekj added a commit to indrekj/phoenix_pubsub that referenced this issue Dec 4, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
indrekj added a commit to salemove/phoenix_pubsub that referenced this issue Dec 4, 2019
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes phoenixframework#135
chrismccord pushed a commit that referenced this issue Jan 7, 2020
Scenario how Node2 is replaced by Node3 (this is also basically a
rolling update):
1. Node1 and Node2 are up and synced
2. Kill Node2 (node1 will start permdown grace period for Node2)
3. Spawn Node3
4. Node1 sends a heartbeat that includes clocks for Node1 & Node2
5. Node3 receives the heartbeat. It sees node1 clock is dominating
  because there's Node2 clock. It requests transfer from Node1.
6. Node1 sends transfer ack to Node3
7. Node3 uses `State#extract` to process the transfer payload which
   discards Node2 values.
8. It all starts again from step 4 on the next heartbeat.

This loop between steps 4 and 8 lasts until Node1 permdown period for
Node2 triggers and it doesn't put it to the heartbeat clocks any more.

The solution here is not to include down replicas in the heartbeat
notifications.

This fixes #135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant