Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Memory leak/OOM with "Received update for IP range I own" messages in log #3659

Closed
sferrett opened this issue Jul 12, 2019 · 16 comments
Closed

Comments

@sferrett
Copy link

sferrett commented Jul 12, 2019

What you expected to happen?

Memory usage of the weave process is expected to be stable and not grow unbounded over time.

What happened?

I had a stable 2.5.0 weave network in my Kubernetes 1.9 cluster of about 100 nodes. The weave was initially installed by kops and had a memory limit of 200mb set. There were no occurrences of "Received update for IP range I own" in the log files and memory usage for weave pods in the cluster had been very stable over time for weeks.

As part of refactoring some services, about 30 nodes were removed from the cluster (bringing the cluster size down to 71 nodes). After this action, the memory usage of the weave pods started growing until it exceeded the memory limit, at which time the pod was OOM killed and restarted. These restarts result in brief disruption for the node on which the restart occurs. At this time the "Received update for IP range I own" message started appearing in the logs (although not from all pods, this nuance was not discovered until later).

After looking at some related tickets and such here (#3650, #3600, #2797), the following actions were taken:

  • The "status ipam" output was checked and seen to have a lot of "unreachable" peers listed in it
  • The unreachable nodes listed by "status ipam" were removed with rmpeer on one node, though this did not fix all the unreachables on all the nodes, the process of listing and removing unreachables was done on a couple of other systems before all systems were showing all 71 nodes in the list and all as reachable.
  • updated to 2.5.2 as there were some related looking tickets mentioned in that release
  • increased the memory limit so that OOM killing might happen less frequently (from 200mb to 1gb)

Weave pods continue to grow in memory usage, the new 2.5.2 pods have not hit their 1g limit yet but look to be heading that way. The "update for IP range I own" messages are still being seen - however on closer inspection these messages are only coming from 3 of the 71 pods.

How to reproduce it?

Have a working kubernetes cluster and delete some nodes out of it.

Anything else we need to know?

Versions:


        Version: 2.5.2 (up to date; next check at 2019/07/12 18:43:12)

        Service: router
       Protocol: weave 1..2
           Name: ea:38:6f:58:7b:81(ip-10-32-124-236.us-west-2.compute.internal)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 71
    Connections: 71 (70 established, 1 failed)
          Peers: 71 (with 4966 established, 4 pending connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 100.96.0.0/11
  DefaultSubnet: 100.96.0.0/11
admin@ip-10-32-92-49:~$ docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:09:56 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:09:56 2017
 OS/Arch:      linux/amd64
 Experimental: false
Linux ip-10-32-92-49 4.4.121-k8s #1 SMP Sun Mar 11 19:39:47 UTC 2018 x86_64 GNU/Linux
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.8", GitCommit:"c138b85178156011dc934c2c9f4837476876fb07", GitTreeState:"clean", BuildDate:"2018-05-21T18:53:18Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}```

Logs:

This is the logs from one of the weave pods that is showing the "Received update for IP range I own" messages: weave-net-q56hl.log
This is the pprof/heap output for the above node
weave-net-q56hl.heap.gz
This is status ipam from the above node
weave-net-q56hl.ipam.txt
This is status peers from the above node
weave-net-q56hl.peers.txt

This is the logs from one of the weave pods not showing that message:
weave-net-9t7d8.log
This is the pprof/heap output for the above node weave-net-9t7d8.heap.gz

And here's a picture showing the history of memory usage form these pods
Screen Shot 2019-07-12 at 9 20 16 AM

@sferrett
Copy link
Author

Here are the log files from the other two weave pods producing the "Received update for IP range I own" log messages - only 3 of the 71 pods are producing this message (the other is in the original comment)

weave-net-67hrw.log
weave-net-cqfc9.log

@murali-reddy
Copy link
Contributor

murali-reddy commented Jul 15, 2019

@sferrett thanks for reporting the issue and sharing the logs. pprof heap output file seems to be corrupted. I can not usethem.

Looking at the logs, there is continuous activity connections getting established to the peers and then subsequently shutting down the established connected with the peers due to conflict in the IPAM entry. There might be resource leak during the connection shuttingdown. Please share the heap output i will investigate further.

cat weave-net-q56hl.log| grep "connection fully established\|connection shutting down due to" | more
INFO: 2019/07/12 14:48:45.096844 ->[10.32.118.152:6783|1a:81:c0:7d:65:8d(ip-10-32-118-152.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 1a:81:c0:7d:65:8d(ip-10-32-118-152.us-west-2.compute.internal) added to 9e:53:50:e6:1e:31(ip-10-32-117-193.us-west-2.compute.internal)
INFO: 2019/07/12 14:48:45.102204 ->[10.32.118.152:47487|1a:81:c0:7d:65:8d(ip-10-32-118-152.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:48:45.104220 ->[10.32.116.209:38935|c2:57:fe:eb:f5:1b(ip-10-32-116-209.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:48:45.105973 ->[10.32.122.86:11183|ee:b4:d4:3e:29:08(ip-10-32-122-86.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to ee:b4:d4:3e:29:08(ip-10-32-122-86.us-west-2.compute.internal) added to 9e:53:50:e6:1e:31(ip-10-32-117-193.us-west-2.compute.internal)
INFO: 2019/07/12 14:48:45.106002 ->[10.32.121.63:6783|66:d8:18:5e:d8:28(ip-10-32-121-63.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 66:d8:18:5e:d8:28(ip-10-32-121-63.us-west-2.compute.internal) added to 9e:53:50:e6:1e:31(ip-10-32-117-193.us-west-2.compute.internal)
INFO: 2019/07/12 14:48:45.114811 ->[10.32.121.63:22137|66:d8:18:5e:d8:28(ip-10-32-121-63.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:48:45.115645 ->[10.32.122.86:6783|ee:b4:d4:3e:29:08(ip-10-32-122-86.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:48:51.961639 ->[10.32.122.167:10584|36:0e:50:ea:19:dd(ip-10-32-122-167.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.96.96.0 v10: incoming message says owner e2:54:77:2d:03:f5 v11
INFO: 2019/07/12 14:48:52.063684 ->[10.32.122.167:6783|36:0e:50:ea:19:dd(ip-10-32-122-167.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 36:0e:50:ea:19:dd(ip-10-32-122-167.us-west-2.compute.internal) added to 9e:53:50:e6:1e:31(ip-10-32-117-193.us-west-2.compute.internal)
INFO: 2019/07/12 14:48:52.071075 ->[10.32.122.167:6783|36:0e:50:ea:19:dd(ip-10-32-122-167.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 36:0e:50:ea:19:dd(ip-10-32-122-167.us-west-2.compute.internal) added to 9e:53:50:e6:1e:31(ip-10-32-117-193.us-west-2.compute.internal)
INFO: 2019/07/12 14:48:52.147730 ->[10.32.122.167:56584|36:0e:50:ea:19:dd(ip-10-32-122-167.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:48:59.451552 ->[10.32.122.154:6783|d2:3b:d6:be:05:46(ip-10-32-122-154.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.96.96.0 v10: incoming message says owner e2:54:77:2d:03:f5 v11
INFO: 2019/07/12 14:48:59.535330 ->[10.32.122.154:56632|d2:3b:d6:be:05:46(ip-10-32-122-154.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:49:04.569976 ->[10.32.121.63:22137|66:d8:18:5e:d8:28(ip-10-32-121-63.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.96.96.0 v10: incoming message says owner e2:54:77:2d:03:f5 v11
INFO: 2019/07/12 14:49:04.578200 ->[10.32.121.63:6783|66:d8:18:5e:d8:28(ip-10-32-121-63.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 66:d8:18:5e:d8:28(ip-10-32-121-63.us-west-2.compute.internal) added to 9e:53:50:e6:1e:31(ip-10-32-117-193.us-west-2.compute.internal)
INFO: 2019/07/12 14:49:04.625605 ->[10.32.121.63:64725|66:d8:18:5e:d8:28(ip-10-32-121-63.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:49:09.368974 ->[10.32.121.110:54777|42:fe:cc:22:06:be(ip-10-32-121-110.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.96.96.0 v10: incoming message says owner e2:54:77:2d:03:f5 v11
INFO: 2019/07/12 14:49:09.373978 ->[10.32.122.86:6783|ee:b4:d4:3e:29:08(ip-10-32-122-86.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.96.96.0 v10: incoming message says owner e2:54:77:2d:03:f5 v11
INFO: 2019/07/12 14:49:09.501779 ->[10.32.122.86:48605|ee:b4:d4:3e:29:08(ip-10-32-122-86.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:49:09.876308 ->[10.32.121.110:6783|42:fe:cc:22:06:be(ip-10-32-121-110.us-west-2.compute.internal)]: connection fully established
INFO: 2019/07/12 14:49:11.879385 ->[10.32.125.220:6783|f2:84:e6:76:02:c8(ip-10-32-125-220.us-west-2.compute.internal)]: connection shutting down due to error: Received update for IP range I own at 100.96.96.0 v10: incoming message says owner e2:54:77:2d:03:f5 v11

@itskingori
Copy link

This happened to us last week on our production cluster ... 👇

Screenshot 2019-07-15 at 11 21 24

We noticed that there were two groups of IPAM lists with overlapping entries so it wasn't a complete split-brain. We bit the bullet and terminated all the nodes in one group to recover. It was brutal.

After looking at the memory profile, 200MB as a limit makes sense for weave-npc container but not for the weave container. We decided to increase the memory limit to 300MB.

@sferrett
Copy link
Author

sferrett commented Jul 15, 2019

@murali-reddy - Here's how I'm generating the heap profile: curl 'http://localhost:6784/debug/pprof/heap' > weave-net-6bgmx.dump

Then I gzip the file - here's another attached, it seems different in size and the method I used to copy it from the system that generated it was slightly different, so please let me know if this is a good one or not? It's from one of the more-rapidly growing instances that got a restart over the weekend
weave-net-6bgmx.heap.gz

@sferrett
Copy link
Author

sferrett commented Jul 15, 2019

@itskingori - interesting. I ran the following on my prd cluster, which is having this symptom, and I also see there are two distinct set of "status ipam" hosts returned from the cluster, each comprising about 50% of the nodes:

# Generate a sorted list of hosts-only from ipam output per weave node
$ for s in `kubectl get pod | grep weave-net | awk '{ print $1 }'`; do kubectl exec -ti $s -c weave -- ./weave --local status ipam | awk '{ print $1 }' | sort > $s.ipam; done

# See how many different variants of that list there are - this shows there are two different checksums for the files, and there are 38 of one and 33 of the other.
$ sum *.ipam | awk '{ print $1 }' | sort | uniq -c
     38 02398
     33 57603

Here's the "02398" list
weave-ipam-02398.txt
Here's the "57603" list
weave-ipam-57603.txt

The lists are identical except that the "57603" list has an additional entry in it:

$ diff /host-user/weave-ipam-57603.txt /host-user/weave-ipam-02398.txt 
56d55
< be:f4:ea:cd:a5:9a(ip-10-32-127-95.us-west-2.compute.internal)

I wonder if I can do something with that one host that's in one list but not the other? Or if I do indeed need to restart all the systems in one of the lists....

When you say you terminated the hosts in one of the groups, did you do anything to the remaining systems or just terminate the nodes in one of the lists? Also did the nodes that were terminated have anything special done on them prior (such as running weave reset or similar?) . I'm also curious if you did them one at a time, and somehow observe that they came up with a consistent ipam list? Basically I think I need to do the same thing here, just want to make sure that deleting those nodes won't just increase the amount of breakage as it seems like node removal is what precipitated all of this in the first place.

Thanks!

@sferrett
Copy link
Author

sferrett commented Jul 16, 2019

OK so I have this condition happening in two different clusters, so I guess that's good news I can try things in one and compare?

Anyhow - in one of the clusters I did the following:

  • Identified weave pods into two groups based on what their 'ipam status' output was showing. There were generally two "views" of the ipam data and nodes either had one or the other of the views.
  • I chose one "view" as being the correct one (the one that had two of the three kube masters in it) and on the nodes that were in the "wrong" list the following was done:
    ** The weave db file was removed
    ** The pod was deleted/restarted
    After the above, the pods cpu usage dropped, the continual messages about connection established and being dropped has ceased and it looks like the memory growth for all of the pods (even the ones that were not restarted) has stopped. Also all the pods now return the same details for their 'ipam status' output

So - this sounds like somewhat of a success

However - I am not sure what the ramifications are of deleting the weave db file on those nodes and just restarting weave and not the entire node. I am somewhat concerned about any ill-effect on the pods still running on these nodes where weave was restarted and if there's any potential issue that has been set up due to not having them restart (ie: full node restart).
If anyone knows what negative side-effect might arise from deleting that db file and restarting weave but not restarting other pods on the system, I'd appreciate the insight.

Pending negative feedback from the above, I will be looking to do the same action in the other cluster soon

@itskingori
Copy link

@itskingori - interesting. I ran the following on my prd cluster, which is having this symptom, and I also see there are two distinct set of "status ipam" hosts returned from the cluster, each comprising about 50% of the nodes:

# Generate a sorted list of hosts-only from ipam output per weave node
$ for s in `kubectl get pod | grep weave-net | awk '{ print $1 }'`; do kubectl exec -ti $s -c weave -- ./weave --local status ipam | awk '{ print $1 }' | sort > $s.ipam; done

# See how many different variants of that list there are - this shows there are two different checksums for the files, and there are 38 of one and 33 of the other.
$ sum *.ipam | awk '{ print $1 }' | sort | uniq -c
     38 02398
     33 57603

Yes, this is more or less the position we were in. I'll borrow your commands are they seem similar to (if not better than) what I have to look through weave 👇

#!/bin/bash

set -u

context=$1
command=$2
pods=$(kubectl get pods -n=kube-system -lname=weave-net --context=${context} | tail -n+2 | awk '{print $1}')

for pod in ${pods[@]}
do
  kubectl exec -n=kube-system --context=${environment} ${pod} -- /home/weave/weave --local ${command}
done

I use something like ./script.sh <cluster> "<status ipam>" to loop through all the weave pods.

I wonder if I can do something with that one host that's in one list but not the other? Or if I do indeed need to restart all the systems in one of the lists....

Weave works by sharing state by consensus, the fact that there are two states breaks weave and you need to bite the bullet and get rid of a group or any weave pods that have inconsistent state. There are two ways to do this:

  1. Terminate all the nodes with bad state at the same time.
  2. Terminate weave pods with bad state while deleting their database (that stays on the host) at the same time.

I ran with no. 2 because this was production so recovery was critical. I didn't have time to write a script for no. 2. I pretty much figured out which group to terminate and terminated them all at the same time. I'm guessing, it's better to terminate weave while deleting the database on the host so that the new pod starts from a clean slate and gets its state from other correct weave pods.

When you say you terminated the hosts in one of the groups, did you do anything to the remaining systems or just terminate the nodes in one of the lists?

I didn't do anything to the remaining group as their state was similar and correct. Once you get rid of the bad weave pods, everything goes back to normal and the state is now consistent among those remaining ... the problem goes away, and the cluster heals.

Also did the nodes that were terminated have anything special done on them prior (such as running weave reset or similar?) .

I didn't do anything to the ones remaining. The fact that they had similar state was all I needed.

I'm also curious if you did them one at a time, and somehow observe that they came up with a consistent ipam list?

I terminated them all at the same time because I could not risk them sharing state any longer. I figured doing it one by one might not work because new pod might get it state from another bad one. I wanted all the 'bad' ones gone and only the 'good' ones left to share state.

Basically I think I need to do the same thing here, just want to make sure that deleting those nodes won't just increase the amount of breakage as it seems like node removal is what precipitated all of this in the first place.

It didn't cause more breakage for me. Other than the time when I had to wait for the autoscaling group to replace the nodes that I just terminated. And the disruption it caused to the apps on the nodes I just terminated as new pods come up.

@murali-reddy
Copy link
Contributor

so please let me know if this is a good one or not? It's from one of the more-rapidly growing instances that got a restart over the weekend

@sferrett I am able to use the recent heap output. It does look like continous establishing and shutting down the connections is causing the memeroy use to grow up between the go GC runs or there is a memory leak. Will investigate further.

(pprof) top
Showing nodes accounting for 137.61MB, 90.98% of 151.24MB total
Dropped 54 nodes (cum <= 0.76MB)
Showing top 10 nodes out of 64
      flat  flat%   sum%        cum   cum%
   63.01MB 41.66% 41.66%    73.61MB 48.67%  github.com/weaveworks/weave/router.(*OverlaySwitch).PrepareConnection
   30.51MB 20.17% 61.83%    30.51MB 20.17%  runtime.malg
   14.50MB  9.59% 71.42%    19.02MB 12.58%  github.com/weaveworks/weave/vendor/github.com/weaveworks/mesh.(*gossipSender).run
   10.11MB  6.68% 78.10%    10.11MB  6.68%  github.com/weaveworks/weave/router.NewNonEncryptor (inline)

@itskingori
Copy link

However - I am not sure what the ramifications are of deleting the weave db file on those nodes and just restarting weave and not the entire node. I am somewhat concerned about any ill-effect on the pods still running on these nodes where weave was restarted and if there's any potential issue that has been set up due to not having them restart (ie: full node restart). If anyone knows what negative side-effect might arise from deleting that db file and restarting weave but not restarting other pods on the system, I'd appreciate the insight.

No ramifications from my understanding. Anyone from weave can correct me if I'm wrong but when you delete the database and terminate weave, it just recreates it. In my opinion, this is the right approach and I took the other approach out of the need for a quick win. If you have a script that you can share for terminating weave pods and clearing the databases please share. 😅

@murali-reddy
Copy link
Contributor

Yes, there are no ramifications. For now these manual steps will reconcile from state where there were IPAM conflicts. In 2.6 release IPAM conflicts are automatically resolved (#3637)

@itskingori
Copy link

@sferrett Now that you're on 2.5.2 and are able to get the cluster stable, could you continue monitoring the IPAM lists to see if they remain consistent (assuming you're scaling up and down)? I really struggled remaining stable on that version for a couple of days (it would eventually get borked after scaling up and down) and so reverted to 2.5.1 as an experiment and I'm since stable. I could be drawing the wrong conclusions or that there's something unique about my setup (highly unlikely since I use kops) 🤷‍♂ ... would be good to have another data point.

@sferrett
Copy link
Author

@murali-reddy here's another dump from one of the processes that is just about to surpass it's 1Gb heap limit, in case that's also useful.
weave-net-fllrr.heap.gz

@sferrett
Copy link
Author

@itskingori - I will keep an eye on it for sure. We're not doing a huge amount of scaling at the moment, however we did just do some refactoring of instance types hence we had a fairly large amount of hosts created and deleted which is what set the current condition into motion. There is some more similar adjustments still outstanding so I will check and let you know how things look as that happens.

Also, thanks very much for your feedback and detail - sounds like your process to recover normal operation was very similar to mine. The only difference seems to be that you rebooted/terminated the nodes whereas I just recycled the weave pod on the node and left the other pods and the node itself alone. We'll be doing this same process in our prd cluster today so if I cook up a noteworthy script I will share it here. Cheers -

@sferrett
Copy link
Author

OK so since the weave nodes became consistent, things have been stable with no memory growth or OOM issues. I have done a minor amount of scaling (perhaps up/down of 10 nodes or so) and things have remained consistent throughout that with 2.5.2.

I wrote a quick script
verify-weave.sh which will go through all weave pods in the cluster, compute a checksum on the 'status ipam' list and tell you if there are any weave pods that disagree on their peer list. We used this to identify the different groups of pods in the cluster and decide which we wanted to preserve and which ones we wanted to reset/restart. Another quick script
bump-weave.sh was then used to remove the db file and restart for those weave pods we wished to reset.

Both of those scripts were written quickly to address a particular condition here and so are not intended as portable or good examples of coding, but may be useful to someone so here they are.

@itskingori , @murali-reddy - thank you both for your help on this.

@itskingori
Copy link

itskingori commented Jul 25, 2019

@sferrett Thanks for the update! Will try out 2.5.2 again and check out your scripts

@bboreham
Copy link
Contributor

Note we found a leak which matches the OOM symptom here: #3807

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants