-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage keeps increasing unbounded when running backups on a schedule #2069
Comments
@betta1 do you know if this is a new issue with v1.2 or whether it existed in v1.1 as well? If you have this info, it would help with debugging. |
Also - can you do the following:
|
I think it existed in v1.1 as well since we had scheduled backups running fine for a week or so then we started seeing intermittent failures both for backups and restores. |
Hey Kriss, I am working with Antony on this issue right now. Please take a look at the below output of pprof.
I run more backup and provide the output once it's hit the memory limit. |
Hi Kriss, This is the output when Velero pod reaches its memory limit.
|
Are you using pre/post hooks in your environment? As best I can tell, this issue is related to that code, and it looks like if I disable hooks, the leak doesn't exist. |
Could you give more detail about the meaning of hooks? I think I just set the memory limit is 256 MB. |
A hook in velero is a command you run in a workload pod before/after backing it up, typically to quiesce/unqiuesce the application - see https://velero.io/docs/v1.2.0/hooks/. If you're using the nginx-example workload, it has hooks configured. Basically any pod with the annotation key |
So, that means if I try to disable the hook before running the container in the pod, is this memory leak issue will be solved? |
If I disable hooks, then I don't see the memory leak issue. Just wondering if you can confirm the same for debugging purposes. |
It seems like this leak is coming from either the Kube libraries or the |
@skriss we were not running any pre/post hooks when we hit this issue. One possible way to reproduce this is to frequently run backups and restores on a schedule -- we observed the memory usage of the Velero Pod keeps increasing until it reaches the memory limit of the Pod when backups/restores start consistently failing. |
Hmm, OK -- the signs I saw in the profiles pointed to an issue with the hook code, but sounds like that shouldn't be the case. We'll have to dig further. |
Hi @betta1 and @Frank51 👋 In the experiment the created backups had backup tarball size of ~100K: a full backup of the cluster including 5 deployments of NGINX of workload. As a part of my experiment, backups were created every 10s for 30m. These are some of the
Further, while the experiment was running, heap related metrics from
The above output indicates that From the metrics the following conclusions were drawn:
To that effect, following up on what you have been experiencing, can you please clarify the following:
and please share the |
Hi @ashish-amarnath, I think I've been able to find a way to reproduce this. I ran restores every 10s for ~25 mins and observed Velero Pod memory usage reached the default memory limit of 256MiB. I'm running Velero v1.2 on K8s cluster on AWS using velero-plugin-for-aws v1.0.0, below are the steps I ran:
After running ~ 230 restores I started seeing some plugin timeout errors:
Heap related metrics from
I've uploaded the velero-go-memstats.txt to https://gist.github.com/betta1/b8198b56d20311279899f4f68a06060d, please also find velero logs at https://gist.github.com/betta1/d6d6039af3df9bfe8378adfbcc852741. |
Thanks @betta1! I'll take a look at this. |
@betta1, Thank you for all the information on this.
is indicating the same. To figure out where this leak may be coming from, I tried to re-run your test, for about 1h, but my pprof Furthermore, the memory footprint of my test is in the order of
At the times when So here is my ask:
|
Update:
from the above output the leak is coming from https://github.com/vmware-tanzu/velero/blob/v1.2.0/pkg/controller/restore_controller.go#L458 @betta1 |
Hi @ashish-amarnath, yes you're right I'm not able to reproduce this as well with the most recent release v1.3.0-beta.1. I'll run the tests some more and post the update here, so far looks like memory is released back to the OS post backup/restore unlike before. |
@betta1 Thanks for confirming. |
Closing the loop on this. Without this fix, during all restore, the program would wait forever for the Going to mark this issue as fixed. |
What steps did you take and what happened:
We're running backups on a schedule and have observed that the memory usage of the Velero Pod keeps increasing. We see backups and restores start failing once the memory usage reaches the memory limit of the Velero Pod (we're using velero install's default mem limit of 256Mi). Is there a possible memory leak since the memory usage of the Velero Pod keeps increasing unbounded when these scheduled backups run?
This issue is closely related to #780. Below is a Grafana dashboard tracking the memory usage of the Velero pod. Memory usage keeps increasing until the mem limit of 256Mi is reached and at this point we observe backups and restores start failing due to OOM issues.
The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Environment:
velero version
): v1.2.0velero client config get features
):kubectl version
):/etc/os-release
):The text was updated successfully, but these errors were encountered: