source-controller OOM events #303

robparrott · 2021-02-24T00:12:06Z

Describe the bug

When registering FluxCD to a repository in gitlab enterprise, I am seeing OOM activity on the source-controller pod. Removing the 1GB memory limit fixes the issues.

To Reproduce

Register fluxcd on a repo with some level of complexity, I believe.

Expected behavior

The source-controller pod should not be killed and restarted repeatedly.

Additional context

Kubernetes version: 1.19
Git provider: gitlab self-hosted
Container registry provider: gitlab/ECR

Below please provide the output of the following commands:

flux --version : flux version 0.8.0
flux check
► checking prerequisites
✔ kubectl 1.19.3 >=1.18.0
✔ Kubernetes 1.19.6-eks-49a6c0 >=1.16.0
► checking controllers

✔ source-controller: healthy
► ghcr.io/fluxcd/source-controller:v0.8.1
✔ kustomize-controller: healthy
► ghcr.io/fluxcd/kustomize-controller:v0.8.1
✔ helm-controller: healthy
► ghcr.io/fluxcd/helm-controller:v0.7.0
✔ notification-controller: healthy
► ghcr.io/fluxcd/notification-controller:v0.8.0
✔ all checks passed
kubectl -n <namespace> get all
kubectl -n flux-system get all
NAME                                           READY   STATUS             RESTARTS   AGE
pod/helm-controller-6946b6dc7f-5nr8q           1/1     Running            0          9m34s
pod/kustomize-controller-55dfcdfd58-xj25c      1/1     Running            0          10h
pod/notification-controller-649754966b-2677x   1/1     Running            0          10h
pod/source-controller-597cc769b-lp6w4          0/1     CrashLoopBackOff   5          6m23s

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/notification-controller   ClusterIP   10.100.114.245   <none>        80/TCP    10h
service/source-controller         ClusterIP   10.100.185.20    <none>        80/TCP    10h
service/webhook-receiver          ClusterIP   10.100.198.200   <none>        80/TCP    10h

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/helm-controller           1/1     1            1           10h
deployment.apps/kustomize-controller      1/1     1            1           10h
deployment.apps/notification-controller   1/1     1            1           10h
deployment.apps/source-controller         0/1     1            0           10h

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/helm-controller-6779d46d69           0         0         0       10h
replicaset.apps/helm-controller-6946b6dc7f           1         1         1       9m34s
replicaset.apps/kustomize-controller-55dfcdfd58      1         1         1       10h
replicaset.apps/notification-controller-649754966b   1         1         1       10h
replicaset.apps/source-controller-555d4f9d6          0         0         0       10h
replicaset.apps/source-controller-597cc769b          1         1         0       10h




kubectl -n <namespace> logs deploy/source-controller

-- various without errors until killed ---

kubectl -n <namespace> logs deploy/kustomize-controller

-- various ---

level":"info","ts":"2021-02-24T00:06:40.724Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"istio-system","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.811Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:06:41.815Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"error","ts":"2021-02-24T00:06:41.825Z","logger":"controller.kustomization","msg":"Reconciliation failed after 1.059192016s, next try in 5m0s","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"podinfo","namespace":"flux-system","revision"
:"master/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7","error":"failed to download artifact from http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz, error: Get \"http://source-controller.flux-system.svc.cl
uster.local./gitrepository/flux-system/podinfo/e43ebfa5bf4b87c46f2e1db495eb571cd398e2f7.tar.gz\": dial tcp 10.100.185.20:80: connect: connection refused"}
{"level":"info","ts":"2021-02-24T00:06:41.843Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.833Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.834Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:07:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.853Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.855Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:08:41.863Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.872Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.874Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:09:41.875Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.893Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"calico","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"kafka","namespace":"flux-system"}
{"level":"info","ts":"2021-02-24T00:10:41.895Z","logger":"controller.kustomization","msg":"Source is not ready, artifact not found","reconciler group":"kustomize.toolkit.fluxcd.io","reconciler kind":"Kustomization","name":"bookinfo","namespace":"flux-system"}

The text was updated successfully, but these errors were encountered:

robparrott · 2021-02-24T00:21:33Z

Changing the source-controller deployment resources stanza as follows:

        resources:
          limits:
            cpu: 1000m
            #memory: 1Gi
          requests:
            cpu: 50m
            #memory: 64Mi

addresses the issue

mahmoud-abdelhafez · 2021-05-05T14:03:45Z

I had the same issue but this time increasing the memory limts to 2Gi did mitigate the issue

hihellobolke · 2021-08-24T04:56:21Z

I am seeing OOMs with 2Gi and I am on v0.14.1.

thomasroot · 2021-08-25T08:32:12Z

Same here on flux2 version 0.16.2. Increasing the memory limits to 2Gi mitigated the issue.

runningman84 · 2021-08-26T11:56:15Z

This issue seems to be linked to:
#192
Our clusters also suffer from this issue, we see memory usages of 1-2GB.

Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.

kav · 2021-09-02T20:14:43Z

I was able to trigger this issue by putting interval: 1d in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM

hiddeco · 2021-09-02T21:28:37Z

As with any workload on Kubernetes, the right resource limit configuration highly depends on what you are making the source-controller do (and you may thus have to increase it).

Helm related operations for example, are resource intensive because at present we haven't found a right optimization path to work with repository index files without loading them in memory in full (due to certain constraints around the unmarshalling of YAML).

Combined with the popularity of some solutions like Artifactory, which likes to stuff as much as possible in a single index (in some cases resulting in a file of >100MB), and the fact that the reconciliation of resources is isolated, resource usage exceeding the defaults can be expected.

Another task that can be resource intensive is the packaging of a Helm chart from a Git source, because Helm first loads all the chart data into an object in memory (including all files, and the files of the dependencies), before writing it to disk.

For a fun experiment: check the current resources your CI worker nodes have (or ask around), or monitor the resource usage of various helm commands on your local machine, and then take into account that the controller does this in parallel with multiple workers, for multiple resources.

Generally speaking it is strange that a service which just downloads some files from other repos consumes so much memory.

The controller does much more than just downloading files, and I think you are oversimplifying or underestimating the inner workings of the controller, and ignoring the fact that it has several features that perform composition tasks, etc. In addition, to ensure proper isolation of e.g. credentials, most Git things are done in memory as well.

I was able to trigger this issue by putting interval: 1d in my helm repository spec. Happy to file separately if needed but trying to limit the issue count on source controller OOM

Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.

Lastly, we are continuously looking into ways to reduce the footprint of our controllers, and I can already tell you some paths have been identified (and are actively worked on) to help reduce it.

Do however always keep in mind that while the YAML creates simple looking and composable abstractions, there will always be processes behind it that actually execute the task, and that the hardware of your local development machine often outperforms most containers.

kav · 2021-09-02T21:30:31Z

Your Helm index likely is simply too big, or your resource limit settings are too low, see explanation above.

No, it appears 1d is simply not valid per the log. Sorry should have included that

E0902 19:20:30.626842       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1beta1.HelmRepository: failed to list *v1beta1.HelmRepository: v1beta1.HelmRepositoryList.Items: []v1beta1.HelmRepository: v1beta1.HelmRepository.Spec: v1beta1.HelmRepositorySpec.Timeout: Interval: unmarshalerDecoder: time: unknown unit "d" in duration "1d", error found in #10 byte of ...|rval":"1d","timeout"|..., bigger context ...|0-4596-8543-9d6d4b573433"},"spec":{"interval":"1d","timeout":"60s","url":"https://raw.githubusercont|...

hiddeco · 2021-09-02T21:35:42Z

That is expected, as 1d is simply invalid.

There is no definition for units of Day or larger to avoid confusion across daylight savings time zone transitions.

https://pkg.go.dev/time#pkg-constants

A duration string is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m". Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".

https://pkg.go.dev/time#ParseDuration

kav · 2021-09-02T21:39:59Z

Yes sure, but it synchronized that change from the repository into the Helmrepository resource and then OOMed the source controller trying to read the helmrepo. I backed out the change in git but then had to manually edit the helmrepository object since the source controller was hung. Not saying it should support days just that that is a footgun. If it's not supported I would have expected the helmrepository to fail validation on the sync

hiddeco · 2021-09-03T13:12:46Z

@kav can you please move this into a separate issue? I did a small test yesterday evening and was indeed able to apply a resource with an invalid interval format, but the cluster I was testing on wasn't running any controllers at the time so I wasn't able to validate the crash.

updated source-controller deployment according to this issue: fluxcd/source-controller#303

fluxcd/source-controller#303 (comment)

mkoertgen · 2022-05-12T10:39:49Z

Having the same issue with OOMKilled and with the information from #192 pinned it down to large helm-repo of bitnami with index-file alone having 13.4M

stefanprodan · 2022-05-12T10:44:22Z

For large Helm repository index files, you can enable caching to reduce the memory footprint of source-controller, docs here: https://fluxcd.io/docs/cheatsheets/bootstrap/#enable-helm-repositories-caching

mkoertgen · 2022-05-12T11:07:19Z

Thanks for the documentation link @stefanprodan. That was helpful.

Removing bitnami-helm-repos in redundant namespaces brought down the mem-footprint to 190M, yet still peaking every 10min (helm repo update interval)

I will check on enabling helm-caching. Thanks again, much appreciated.

mkoertgen · 2022-05-12T11:29:48Z

Needed to update 0.28 -> 0.30 so the helm-cache-arguments were available.

gotk_cache_events_total looks good so far. Will observe the mem-footprint but for now seems to solve the issue, at least for me.

Thanks again.

mkoertgen · 2022-05-12T12:11:37Z

Looks much better with helm-caching enabled

stefanprodan · 2022-05-12T12:20:48Z

Yeap that's consistent with what I'm seeing on my test clusters, using source-controller cache brought the memory from 2GB down to 200MB.

stefanprodan · 2024-03-04T13:35:16Z

Enabling Helm caching doc is now here: https://fluxcd.io/flux/installation/configuration/vertical-scaling/#enable-helm-repositories-caching

stefanprodan transferred this issue from fluxcd/flux2 Feb 24, 2021

stefanprodan mentioned this issue May 26, 2021

Update Git packages #365

Merged

onelapahead mentioned this issue Oct 12, 2021

Is there some way to make resource settings configurable within the provider? fluxcd/terraform-provider-flux#122

Closed

famousgarkin mentioned this issue Oct 26, 2021

OOM crash in source-controller pod fluxcd/flux2#991

Closed

apatelGWS added a commit to apatelGWS/flux2-kustomize-helm-example that referenced this issue Feb 21, 2022

Update gotk-components.yaml

45f0ba6

updated source-controller deployment according to this issue: fluxcd/source-controller#303

apatelGWS added a commit to apatelGWS/flux2-kustomize-helm-example that referenced this issue Feb 21, 2022

Update gotk-components.yaml

1c444a4

fluxcd/source-controller#303 (comment)

stefanprodan closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source-controller OOM events #303

source-controller OOM events #303

robparrott commented Feb 24, 2021

robparrott commented Feb 24, 2021

mahmoud-abdelhafez commented May 5, 2021

hihellobolke commented Aug 24, 2021

thomasroot commented Aug 25, 2021

runningman84 commented Aug 26, 2021

kav commented Sep 2, 2021

hiddeco commented Sep 2, 2021 •

edited

Loading

kav commented Sep 2, 2021 •

edited

Loading

hiddeco commented Sep 2, 2021

kav commented Sep 2, 2021 •

edited

Loading

hiddeco commented Sep 3, 2021

mkoertgen commented May 12, 2022

stefanprodan commented May 12, 2022

mkoertgen commented May 12, 2022

mkoertgen commented May 12, 2022

mkoertgen commented May 12, 2022

stefanprodan commented May 12, 2022

stefanprodan commented Mar 4, 2024

source-controller OOM events #303

source-controller OOM events #303

Comments

robparrott commented Feb 24, 2021

Describe the bug

To Reproduce

Expected behavior

Additional context

robparrott commented Feb 24, 2021

mahmoud-abdelhafez commented May 5, 2021

hihellobolke commented Aug 24, 2021

thomasroot commented Aug 25, 2021

runningman84 commented Aug 26, 2021

kav commented Sep 2, 2021

hiddeco commented Sep 2, 2021 • edited Loading

kav commented Sep 2, 2021 • edited Loading

hiddeco commented Sep 2, 2021

kav commented Sep 2, 2021 • edited Loading

hiddeco commented Sep 3, 2021

mkoertgen commented May 12, 2022

stefanprodan commented May 12, 2022

mkoertgen commented May 12, 2022

mkoertgen commented May 12, 2022

mkoertgen commented May 12, 2022

stefanprodan commented May 12, 2022

stefanprodan commented Mar 4, 2024

hiddeco commented Sep 2, 2021 •

edited

Loading

kav commented Sep 2, 2021 •

edited

Loading

kav commented Sep 2, 2021 •

edited

Loading