Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage keeps increasing unbounded when running backups on a schedule #2069

Closed
betta1 opened this issue Nov 15, 2019 · 23 comments
Closed
Assignees
Labels
Bug Needs info Waiting for information
Milestone

Comments

@betta1
Copy link
Contributor

betta1 commented Nov 15, 2019

What steps did you take and what happened:
We're running backups on a schedule and have observed that the memory usage of the Velero Pod keeps increasing. We see backups and restores start failing once the memory usage reaches the memory limit of the Velero Pod (we're using velero install's default mem limit of 256Mi). Is there a possible memory leak since the memory usage of the Velero Pod keeps increasing unbounded when these scheduled backups run?

This issue is closely related to #780. Below is a Grafana dashboard tracking the memory usage of the Velero pod. Memory usage keeps increasing until the mem limit of 256Mi is reached and at this point we observe backups and restores start failing due to OOM issues.

velero-ns-metrics

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Environment:

  • Velero version (use velero version): v1.2.0
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release):
@skriss
Copy link
Contributor

skriss commented Nov 15, 2019

@betta1 do you know if this is a new issue with v1.2 or whether it existed in v1.1 as well? If you have this info, it would help with debugging.

@skriss
Copy link
Contributor

skriss commented Nov 15, 2019

Also - can you do the following:

  1. Run kubectl -n velero port-forward deployment/velero 6060:6060
  2. Run go tool pprof http://localhost:6060/debug/pprof/heap
  3. Run top and dump the output here

@betta1
Copy link
Contributor Author

betta1 commented Nov 15, 2019

I think it existed in v1.1 as well since we had scheduled backups running fine for a week or so then we started seeing intermittent failures both for backups and restores.

@Frank51
Copy link
Contributor

Frank51 commented Nov 15, 2019

Hey Kriss,

I am working with Antony on this issue right now. Please take a look at the below output of pprof.

$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/cheny48/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: velero
Type: inuse_space
Time: Nov 15, 2019 at 9:21am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 81.31MB, 93.66% of 86.81MB total
Showing top 10 nodes out of 131
      flat  flat%   sum%        cum   cum%
   61.70MB 71.08% 71.08%    77.16MB 88.89%  compress/flate.NewWriter
   14.96MB 17.24% 88.32%    14.96MB 17.24%  compress/flate.(*compressor).initDeflate
       1MB  1.15% 89.47%        1MB  1.15%  runtime.malg
    0.64MB  0.73% 90.20%     0.64MB  0.73%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*ClientConn).frameScratchBuffer
    0.50MB  0.58% 90.78%     0.50MB  0.58%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
    0.50MB  0.58% 91.36%     0.50MB  0.58%  compress/flate.(*huffmanEncoder).generate
    0.50MB  0.58% 91.93%     0.50MB  0.58%  github.com/vmware-tanzu/velero/pkg/restore.prioritizeResources
    0.50MB  0.58% 92.51%     0.50MB  0.58%  compress/flate.newHuffmanBitWriter
    0.50MB  0.58% 93.09%     0.50MB  0.58%  fmt.Sprintf
    0.50MB  0.58% 93.66%        1MB  1.15%  github.com/vmware-tanzu/velero/pkg/restore.resolveActions
(pprof) 

I run more backup and provide the output once it's hit the memory limit.

@Frank51
Copy link
Contributor

Frank51 commented Nov 19, 2019

Hi Kriss,

This is the output when Velero pod reaches its memory limit.

$ go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
File: velero
Type: inuse_space
Time: Nov 19, 2019 at 11:46am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 209.56MB, 93.23% of 224.78MB total
Dropped 104 nodes (cum <= 1.12MB)
Showing top 10 nodes out of 85
      flat  flat%   sum%        cum   cum%
  153.37MB 68.23% 68.23%   188.05MB 83.66%  compress/flate.NewWriter
   34.18MB 15.21% 83.44%    34.18MB 15.21%  compress/flate.(*compressor).initDeflate
       5MB  2.23% 85.67%        5MB  2.23%  fmt.Sprintf
       4MB  1.78% 87.45%        4MB  1.78%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*ClientConn).newStream
    3.50MB  1.56% 89.00%     3.50MB  1.56%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
       3MB  1.34% 90.34%        3MB  1.34%  reflect.New
       2MB  0.89% 91.23%     5.50MB  2.45%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneRequest
    1.50MB  0.67% 91.90%     1.50MB  0.67%  runtime.malg
    1.50MB  0.67% 92.56%     1.50MB  0.67%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
    1.50MB  0.67% 93.23%     1.50MB  0.67%  net/url.parse

@skriss skriss added this to the v1.3 milestone Dec 3, 2019
@skriss skriss self-assigned this Dec 9, 2019
@skriss
Copy link
Contributor

skriss commented Dec 9, 2019

@Frank51 @betta1 do you have any more info on this issue since the last update? I'm going to spend some time digging into it this week.

@Frank51
Copy link
Contributor

Frank51 commented Dec 9, 2019

@Frank51 @betta1 do you have any more info on this issue since the last update? I'm going to spend some time digging into it this week.

So far we don't have any other info about that.

@skriss
Copy link
Contributor

skriss commented Dec 9, 2019

Are you using pre/post hooks in your environment? As best I can tell, this issue is related to that code, and it looks like if I disable hooks, the leak doesn't exist.

@Frank51
Copy link
Contributor

Frank51 commented Dec 9, 2019

Are you using pre/post hooks in your environment? As best I can tell, this issue is related to that code, and it looks like if I disable hooks, the leak doesn't exist.

Could you give more detail about the meaning of hooks? I think I just set the memory limit is 256 MB.

@skriss
Copy link
Contributor

skriss commented Dec 9, 2019

A hook in velero is a command you run in a workload pod before/after backing it up, typically to quiesce/unqiuesce the application - see https://velero.io/docs/v1.2.0/hooks/.

If you're using the nginx-example workload, it has hooks configured. Basically any pod with the annotation key pre.hook.backup.velero.io/command.

@Frank51
Copy link
Contributor

Frank51 commented Dec 9, 2019

A hook in velero is a command you run in a workload pod before/after backing it up, typically to quiesce/unqiuesce the application - see https://velero.io/docs/v1.2.0/hooks/.

If you're using the nginx-example workload, it has hooks configured. Basically any pod with the annotation key pre.hook.backup.velero.io/command.

So, that means if I try to disable the hook before running the container in the pod, is this memory leak issue will be solved?

@skriss
Copy link
Contributor

skriss commented Dec 9, 2019

If I disable hooks, then I don't see the memory leak issue. Just wondering if you can confirm the same for debugging purposes.

@skriss
Copy link
Contributor

skriss commented Dec 9, 2019

It seems like this leak is coming from either the Kube libraries or the https://github.com/docker/spdystream dependency that's used for executing hooks. SPDY is deprecated, so need to look into whether there's another approach to implementing this.

@carlisia carlisia added Needs info Waiting for information and removed Waiting for info labels Dec 14, 2019
@betta1
Copy link
Contributor Author

betta1 commented Jan 23, 2020

@skriss we were not running any pre/post hooks when we hit this issue. One possible way to reproduce this is to frequently run backups and restores on a schedule -- we observed the memory usage of the Velero Pod keeps increasing until it reaches the memory limit of the Pod when backups/restores start consistently failing.

@skriss
Copy link
Contributor

skriss commented Jan 30, 2020

we were not running any pre/post hooks when we hit this issue.

Hmm, OK -- the signs I saw in the profiles pointed to an issue with the hook code, but sounds like that shouldn't be the case. We'll have to dig further.

@ashish-amarnath
Copy link
Member

ashish-amarnath commented Jan 30, 2020

Hi @betta1 and @Frank51 👋
I did some more digging into this but I wasn't able to find a memory leak. Here is what I ran:
I had Velero running on a KinD cluster using the velero-plugin-for-aws, using the S3 as the object store.

In the experiment the created backups had backup tarball size of ~100K: a full backup of the cluster including 5 deployments of NGINX of workload.
There were no volumes that needed to be backed up
There were no restore operations running during my experiment.

As a part of my experiment, backups were created every 10s for 30m.

These are some of the go_memstats_ metrics I scrapped from the prometheus metrics endpoint while the experiment was running.

go_memstats_alloc_bytes
go_memstats_alloc_bytes_total
go_memstats_buck_hash_sys_bytes
go_memstats_frees_total
go_memstats_gc_cpu_fraction
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_objects
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_last_gc_time_seconds
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_other_sys_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads

Further, while the experiment was running, heap related metrics from pprof was also gather to inspect top memory allocators. Following was the output

$  go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in ~/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
File: velero
Type: inuse_space
Time: Jan 29, 2020 at 4:55pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 5529.90kB, 100% of 5529.90kB total
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
 1805.17kB 32.64% 32.64%  1805.17kB 32.64%  compress/flate.NewWriter
  650.62kB 11.77% 44.41%   650.62kB 11.77%  bytes.makeSlice
  513.56kB  9.29% 53.70%   513.56kB  9.29%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
  512.17kB  9.26% 62.96%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
  512.17kB  9.26% 72.22%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
  512.14kB  9.26% 81.48%   512.14kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/api/autoscaling/v2beta1.init.ializers
  512.05kB  9.26% 90.74%   512.05kB  9.26%  strings.(*Builder).WriteString
  512.01kB  9.26%   100%   512.01kB  9.26%  github.com/vmware-tanzu/velero/vendor/github.com/prometheus/client_golang/prometheus.NewGaugeVec
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).Flush
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).WriteHeader
(pprof) top
Showing nodes accounting for 5529.90kB, 100% of 5529.90kB total
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
 1805.17kB 32.64% 32.64%  1805.17kB 32.64%  compress/flate.NewWriter
  650.62kB 11.77% 44.41%   650.62kB 11.77%  bytes.makeSlice
  513.56kB  9.29% 53.70%   513.56kB  9.29%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/api/meta.(*DefaultRESTMapper).AddSpecific
  512.17kB  9.26% 62.96%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/golang.org/x/net/http2.(*clientConnReadLoop).handleResponse
  512.17kB  9.26% 72.22%   512.17kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/apimachinery/pkg/util/net.CloneHeader
  512.14kB  9.26% 81.48%   512.14kB  9.26%  github.com/vmware-tanzu/velero/vendor/k8s.io/api/autoscaling/v2beta1.init.ializers
  512.05kB  9.26% 90.74%   512.05kB  9.26%  strings.(*Builder).WriteString
  512.01kB  9.26%   100%   512.01kB  9.26%  github.com/vmware-tanzu/velero/vendor/github.com/prometheus/client_golang/prometheus.NewGaugeVec
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).Flush
         0     0%   100%   902.59kB 16.32%  archive/tar.(*Writer).WriteHeader

The above output indicates that compress/flate.NewWriter is one of the more regular memory allocators. However, this is not indicative of a memory leak.

From the metrics the following conclusions were drawn:

  • No memory leak was found in the part of the code, exercised by the experiment.
  • Memory consumption of Velero increases with the number of backup objects that are present in the cluster. Similar conclusion may be extended to the number of restore objects.

To that effect, following up on what you have been experiencing, can you please clarify the following:

  1. How many backup and restore objects to you have in your cluster?
  2. What are the size of the backup tarballs?
  3. If you are continuing to experience this problem, can you please run the below script to help us gather more metrics?
$ # setup port-forwarding for the velero metric server address
$ kubectl --namespace velero portforward deploy/velero 8085:8085
$ # while the port forward is active, in another terminal window
$ # scrape the above mentioned metrics into a file. Scrapping every 5s, for may be a couple of hours
$ while true; do curl http://localhost:8085/metrics | grep -E '^velero_backup_tarball_size_bytes|^go_memstats|^go_threads|^process_resident_memory_bytes|^process_virtual_memory_bytes' >> ~/velero-go-memstats.txt; echo "---" >> ~/velero-go-memstats.txt; sleep 5; done

and please share the ~/velero-go-memstats.txt file with us so that we can find out what's going on.

@ashish-amarnath ashish-amarnath self-assigned this Feb 3, 2020
@skriss skriss modified the milestones: v1.3, v1.4 Feb 4, 2020
@betta1
Copy link
Contributor Author

betta1 commented Feb 5, 2020

Hi @ashish-amarnath, I think I've been able to find a way to reproduce this. I ran restores every 10s for ~25 mins and observed Velero Pod memory usage reached the default memory limit of 256MiB. I'm running Velero v1.2 on K8s cluster on AWS using velero-plugin-for-aws v1.0.0, below are the steps I ran:

kubectl create ns nginx
kubectl run nginx --image=nginx -n nginx

velero create backup nginx-backup --include-namespaces nginx --snapshot-volumes=false

while true; do velero restore create --from-backup=nginx-backup --namespace-mappings nginx:nginx-$(date +%s); sleep 10; done

After running ~ 230 restores I started seeing some plugin timeout errors:

k logs deploy/velero -n velero | grep error
time="2020-02-05T16:21:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error=**"timeout while waiting for plugin to start**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:27:37Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="fork/exec /plugins/velero-plugin-for-aws: **cannot allocate memory**" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:29:07Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:36:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:38:18Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:39:48Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:44:08Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:45:12Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:47:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:48:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:52:26Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: Get https://***.s3.amazonaws.com/?delimiter=%2F&list-type=2&prefix=velero-mem%2Fbackups%2F: net/http: TLS handshake timeout" error.file="/go/src/github.com/vmware-tanzu/velero-plugin-for-aws/velero-plugin-for-aws/object_store.go:308" error.function="main.(*ObjectStore).ListCommonPrefixes" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T16:52:45Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:54:02Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:54:27Z" level=error msg="Error uploading restore results to backup storage" controller=restore error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/restore_controller.go:502"
time="2020-02-05T16:55:33Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:57:03Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T16:58:55Z" level=error msg="Error listing backups in backup store" backupLocation=default controller=backup-sync error="rpc error: code = Unavailable desc = transport is closing" logSource="pkg/controller/backup_sync_controller.go:174"
time="2020-02-05T17:00:25Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:03:52Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:05:22Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:06:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:08:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:09:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:11:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:12:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:14:23Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"
time="2020-02-05T17:15:53Z" level=error msg="Error getting backup store for this location" backupLocation=default controller=backup-sync error="timeout while waiting for plugin to start" logSource="pkg/controller/backup_sync_controller.go:167"

Heap related metrics from pprof:

go tool pprof http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /home/betta1/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.029.pb.gz
File: velero
Type: inuse_space
Time: Feb 5, 2020 at 8:58am (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 191.20MB, 95.02% of 201.22MB total
Dropped 89 nodes (cum <= 1.01MB)
Showing top 10 nodes out of 42
      flat  flat%   sum%        cum   cum%
  155.13MB 77.09% 77.09%   189.20MB 94.03%  compress/flate.NewWriter
   32.07MB 15.94% 93.03%    32.07MB 15.94%  compress/flate.(*compressor).initDeflate
    1.50MB  0.75% 93.78%     1.50MB  0.75%  github.com/vmware-tanzu/velero/vendor/github.com/json-iterator/go.(*Iterator).ReadString
       1MB   0.5% 94.27%     1.50MB  0.75%  github.com/vmware-tanzu/velero/pkg/plugin/clientmgmt.(*restartableProcess).getByKindAndNameLH
       1MB   0.5% 94.77%        2MB  0.99%  compress/flate.newHuffmanBitWriter
    0.50MB  0.25% 95.02%     1.50MB  0.75%  github.com/vmware-tanzu/velero/pkg/util/collections.NewIncludesExcludes
         0     0% 95.02%    34.07MB 16.93%  compress/flate.(*compressor).init
         0     0% 95.02%   189.20MB 94.03%  compress/gzip.(*Writer).Write
         0     0% 95.02%   196.71MB 97.75%  github.com/vmware-tanzu/velero/pkg/controller.(*genericController).Run.func2
         0     0% 95.02%   196.71MB 97.75%  github.com/vmware-tanzu/velero/pkg/controller.(*genericController).processNextWorkItem
(pprof)

I've uploaded the velero-go-memstats.txt to https://gist.github.com/betta1/b8198b56d20311279899f4f68a06060d, please also find velero logs at https://gist.github.com/betta1/d6d6039af3df9bfe8378adfbcc852741.

Grafana metrics while restores are running:
velero-ns-metrics

@ashish-amarnath
Copy link
Member

Thanks @betta1! I'll take a look at this.

@ashish-amarnath
Copy link
Member

ashish-amarnath commented Feb 5, 2020

@betta1, Thank you for all the information on this.
From the metrics you shared, there definitely seems to be a memory leak somewhere.
image
Also

  155.13MB 77.09% 77.09%   189.20MB 94.03%  compress/flate.NewWriter

is indicating the same.

To figure out where this leak may be coming from, I tried to re-run your test, for about 1h, but my pprof top looks different from what you saw and was also changing at various points in time showing compress/flate.NewWriter as the top allocator only sometimes.

Furthermore, the memory footprint of my test is in the order of KB as against MB in your case.

(pprof) top
Showing nodes accounting for 5339.04kB, 83.91% of 6363.09kB total
Showing top 10 nodes out of 107

At the times when compress/flate.NewWriter was among the top allocators and looking the at the svg output format, it indicated that the code path leading to the NewWriter function call was a log statement

image

So here is my ask:

  1. I am using an image built from master as of today, quay.io/ashish_amarnath/velero:02052020-00. Can you please confirm the version of velero you are running and can you please try to run your experiment using an image built from master? If you are running my image, please be sure to run this in non-critical environments.  
  2. Can you please share the /home/betta1/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.029.pb.gz pprof profile file? That will help in investigate this further.
  3. What log level are you running velero at?

@ashish-amarnath
Copy link
Member

ashish-amarnath commented Feb 6, 2020

Update:
I was able to repro this leak by re-running the experiment using image: velero/velero:v1.2.0
Looking over at the pprof output:

$ go tool pprof --source_path ~/go/src/github.com/vmware-tanzu/velero/  http://localhost:6060/debug/pprof/heap
Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in ~/pprof/pprof.velero.alloc_objects.alloc_space.inuse_objects.inuse_space.012.pb.gz
File: velero
Type: inuse_space
Time: Feb 5, 2020 at 10:12pm (PST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 148.85MB, 99.33% of 149.85MB total
Dropped 29 nodes (cum <= 0.75MB)
Showing top 10 nodes out of 31
      flat  flat%   sum%        cum   cum%
  117.23MB 78.23% 78.23%   143.84MB 95.99%  compress/flate.NewWriter
   25.11MB 16.75% 94.99%    25.11MB 16.75%  compress/flate.(*compressor).initDeflate
    4.01MB  2.68% 97.66%     4.01MB  2.68%  compress/flate.(*huffmanEncoder).generate
    1.50MB  1.00% 98.66%     1.50MB  1.00%  compress/flate.newHuffmanBitWriter
       1MB  0.67% 99.33%        1MB  0.67%  runtime.malg
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*Writer).Close
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*compressor).close
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*compressor).deflate
         0     0% 99.33%    26.61MB 17.76%  compress/flate.(*compressor).init
         0     0% 99.33%     4.01MB  2.68%  compress/flate.(*compressor).writeBlock
(pprof) svg
Generating report in profile005.svg
(pprof) list runValidatedRestore
Total: 149.85MB
ROUTINE ======================== github.com/vmware-tanzu/velero/pkg/controller.(*restoreController).runValidatedRestore in pkg/controller/restore_controller.go
         0   148.35MB (flat, cum) 99.00% of Total
         .          .    453:	volumeSnapshots, err := info.backupStore.GetBackupVolumeSnapshots(restore.Spec.BackupName)
         .          .    454:	if err != nil {
         .          .    455:		return errors.Wrap(err, "error fetching volume snapshots metadata")
         .          .    456:	}
         .          .    457:
         .   143.84MB    458:	restoreLog.Info("starting restore")
         .          .    459:
         .          .    460:	var podVolumeBackups []*velerov1api.PodVolumeBackup
         .          .    461:	for i := range podVolumeBackupList.Items {
         .          .    462:		podVolumeBackups = append(podVolumeBackups, &podVolumeBackupList.Items[i])
         .          .    463:	}
         .          .    464:	restoreReq := pkgrestore.Request{
         .          .    465:		Log:              restoreLog,
         .          .    466:		Restore:          restore,
         .          .    467:		Backup:           info.backup,
         .          .    468:		PodVolumeBackups: podVolumeBackups,
         .          .    469:		VolumeSnapshots:  volumeSnapshots,
         .          .    470:		BackupReader:     backupFile,
         .          .    471:	}
         .   512.07kB    472:	restoreWarnings, restoreErrors := c.restorer.Restore(restoreReq, actions, c.snapshotLocationLister, pluginManager)
         .          .    473:	restoreLog.Info("restore completed")
         .          .    474:
         .     4.01MB    475:	if logReader, err := restoreLog.done(c.logger); err != nil {
         .          .    476:		restoreErrors.Velero = append(restoreErrors.Velero, fmt.Sprintf("error getting restore log reader: %v", err))
         .          .    477:	} else {
         .          .    478:		if err := info.backupStore.PutRestoreLog(restore.Spec.BackupName, restore.Name, logReader); err != nil {
         .          .    479:			restoreErrors.Velero = append(restoreErrors.Velero, fmt.Sprintf("error uploading log file to backup storage: %v", err))
         .          .    480:		}
(pprof)

from the above output the leak is coming from https://github.com/vmware-tanzu/velero/blob/v1.2.0/pkg/controller/restore_controller.go#L458

@betta1
Considering this leak didn't show up in master, can you please try running the test using an image built from master?
Furthermore, as https://github.com/vmware-tanzu/velero/blob/v1.2.0/pkg/controller/restore_controller.go#L458 hasn't changed since v1.2.0 I am suspecting one of our dependency to have fixed this. I will find what change fixed this in master as against v1.2.0

@betta1
Copy link
Contributor Author

betta1 commented Feb 6, 2020

Hi @ashish-amarnath, yes you're right I'm not able to reproduce this as well with the most recent release v1.3.0-beta.1. I'll run the tests some more and post the update here, so far looks like memory is released back to the OS post backup/restore unlike before.

@ashish-amarnath
Copy link
Member

@betta1 Thanks for confirming.

@ashish-amarnath
Copy link
Member

Closing the loop on this.
The leak was fixed as a part of this PR #2201.
https://github.com/vmware-tanzu/velero/pull/2201/files#diff-07ac758a0abc3d782b2530a9ad08ae5cL63-L67

Without this fix, during all restore, the program would wait forever for the errChan to be closed. Causing the gzip writer to be held in memory. This code path would have executed on every restore. This lines up with our observation of the leak on performing restores in a loop.

Going to mark this issue as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs info Waiting for information
Projects
None yet
Development

No branches or pull requests

5 participants