Controller stops reconciling, needs restart #282

stefanprodan · 2021-12-17T13:57:19Z

Having an automation that should reconcile every 7 minutes:

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: flux-system
  namespace: flux-system
spec:
  git:
    checkout:
      ref:
        branch: master
    commit:
      author:
        email: [email protected]
        name: me
      messageTemplate: '{{range .Updated.Images}}{{println .}}{{end}}'
    push:
      branch: master
  interval: 7m0s
  sourceRef:
    kind: GitRepository
    name: flux-system
  update:
    path: ./staging
    strategy: Setters

The reconciliation stoped two days ago for unknown reasons:

$ date
Fri Dec 17 16:25:48 EET 2021

$ flux get image update

NAME       	READY	MESSAGE                                                     	LAST RUN                 	SUSPENDED
flux-system	True 	no updates made; last commit 8574614 at 2021-12-14T22:47:08Z	2021-12-15T08:15:01-07:00	False

The text was updated successfully, but these errors were encountered:

jwerre · 2021-12-17T15:58:24Z

This coincided with outages at AWS in us-west-2 where I'm running Kubernetes on EKS. My entire cluster went down on that day for about 10 minutes and when it came back up I ran into this issue.

bondido · 2021-12-30T11:10:50Z

Hi,
While waiting for permanent fix, are there any advice how to detect "stucked" image-automation-controller case?
Are there any metrics or logs we should monitor?
We have been suffering the issue (~couple of times per week) for some time and the only alert is from users who pushed their images to container registry and didn't notice cluster deployment for quite some time.

Another observation is that in most cases restarting image-automation-controller is sufficient, but there where two times when we needed to also restart source-controller.

jwerre · 2021-12-30T16:42:42Z

@bondido how about something like this:

kubectl get pod -n flux-system

NAME                                          READY   STATUS    RESTARTS   AGE
helm-controller-96dc99bfd-p9g4q               1/1     Running   0          9d
image-automation-controller-64c955c59-ckbft   1/1     Running   0          9d  <<< ------
image-reflector-controller-55fb7f765d-cr8zn   1/1     Running   0          9d
kustomize-controller-7bc878f768-xf2xb         1/1     Running   0          9d
notification-controller-945795558-q8nht       1/1     Running   0          9d
source-controller-65665fd68f-n7qqz            1/1     Running   0          9d

bondido · 2021-12-30T18:01:37Z

@jwerre
I'm afraid it won't help, as long running controller is no problem.
The problem is that it's running but not doing it's job and not reflecting it in any log or metric (I still hope I missed something here :-) )

jwerre · 2021-12-31T16:32:37Z

Ah yes.. I had the same problem you need to restart the pod e.g.:

kubectl delete pods image-automation-controller-64c955c59-ckbft -n flux-system

bondido · 2022-01-03T15:47:55Z

Sure @jwerre , I know. Thanks :-)
As I mentioned earlier, sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.

As temporary kind of "automatic mitigation" we introduced a scheduled external script checking .status.lastAutomationRunTime of imageupdateautomation resource (

image-automation-controller/config/crd/bases/image.toolkit.fluxcd.io_imageupdateautomations.yaml

Line 170 in 041018f

lastAutomationRunTime:

) and invoking 'kubectl pod delete' commands if it seems to stuck for too long.

We'd love to see this bug fixed soon, anyway :-)

jwerre · 2022-01-04T15:37:47Z

@bondido, I misunderstood your question. Sorry for the confusion.

squaremo · 2022-01-05T16:16:14Z

sometime restarting image-automation-controller is not enough and we need to restart source-controller pod as well.

@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?

bondido · 2022-01-05T16:48:38Z

@bondido Do you mean that source-controller has the same problem; or, that to get image-automation-controller to start working again, you needed to restart source-controller?

We have to restart source-controller for image-automation-controller to start working. In fact we restart both - first image-automation-controller, and if we don't see any improvement in a couple of minutes time - source-controller.

So far, we haven't try to restart just source-controller.

squaremo · 2022-01-05T17:01:56Z

@bondido Thanks for elaborating! On the face of it, I wouldn't expect restarting source-controller to have any effect on image-automation-controller, because it works independently: it only coincidentally refers to the same GitRepository objects, and doesn't alter anything at the upstream git repository (that might "unlock" image-automation-controller). Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?

stefanprodan · 2022-01-05T17:58:05Z

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

bondido · 2022-01-05T21:40:39Z

Do you have good evidence that restarting source-controller is exactly what unblocks image-automation-controller; or could it be a sort of "reliable coincidence"?

I can't be 100% sure as I couldn't get to any logs or metric confirming what was actually happening.

At first two cases restarting image-automation-controller was enough and new images were applied to cluster just seconds after the restart.
But then - on the third occurence - we restarted image-automation-controller and nothing happened for over 10 minutes. So it was just blind shot by my colleague to restart source-controller, after which everything started working.

The situation repeated exactly like the above for one more time.

bondido · 2022-01-05T21:43:46Z

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

You mean checking the directory contents inside "stucked" controller pod, don't you?
(in "working OK" pod I've checked it and it's empty).

jwerre · 2022-01-06T15:30:01Z

@jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

ls -lah /tmp
total 0
drwxrwsrwx 2 root 1337  6 Jan  6 15:22 .
drwxr-xr-x 1 root root 17 Jan  4 16:20 ..

du -sh /tmp/*
du: cannot access '/tmp/AlTest1.err': No such file or directory
du: cannot access '/tmp/AlTest1.out': No such file or directory
du: cannot access '/tmp/adobegc.log': No such file or directory
du: cannot access '/tmp/com.apple.launchd.yGWoPxlEXv': No such file or directory
du: cannot access '/tmp/com.google.Keystone': No such file or directory
du: cannot access '/tmp/fseventsd-uuid': No such file or directory
du: cannot access '/tmp/powerlog': No such file or directory
command terminated with exit code 1

jwerre · 2022-01-06T15:31:10Z

I should mention that I haven't had any problems since I restated the controller pod the first time.

ahisette · 2022-01-10T09:38:22Z

Hello,

can you please exec into the controller pod and see if there is anything left in tmp by running `ls -lah /tmp

In my case, on stuck controller, in /tmp , I have a directory named like GitRepository source of frozen ImageUpdateAutomation.

And a simple restart of the automation controller is enough to unblock the frozen ImageUpdateAutomation.

kingdonb · 2022-01-11T18:42:00Z

I've gone to some lengths to try reproducing this issue, I ran image-automation-controller with a larger than average gitrepo (stuffed with several mp4 video files), and ramped up all of the unfavorable network conditions (packet loss, latency) with Chaos Mesh, reconfigured liveness checks so that image-automation-controller wouldn't be restarted due to network reasons, (which was tricky because it actually needs the network in order to perform the leader election)

With all webhooks configured as receivers for image and git events to make sure everything happens quickly after each commit/image release, ran this for several hours with updates every 45 seconds, and I wasn't able to get the image-automation-controller into any stuck or hanging state. I was able to cause it to stop working due to heavy packet loss, but nothing I did seemed to induce any sort of hanging behavior. (When the unfavorable conditions abated, the controller always recovered and went back to committing and pushing changes for me.)

If anyone knows what type of network issue or abnormal response from GitHub triggers the condition, then surely I can reproduce it and make progress on this issue, but right now I have not made significant progress on it.

bondido · 2022-01-20T11:00:57Z

@bondido @jwerre can you please exec into the controller pod and see if there is anything left in tmp by running ls -lah /tmp and du -sh /tmp/*. Thanks!

After over 16 days since last problems, controller has just "stuck" on one of our clusters.
The output of above commands run on the stucked pod is:

total 12K
drwxrwsrwx 3 root       1337 4.0K Jan 20 09:49 .
drwxr-xr-x 1 root       root 4.0K Jan  3 09:24 ..
drwx--S--- 6 controller 1337 4.0K Jan 20 09:49 flux-system-flux-system56439449

3.0M	/tmp

ahisette · 2022-01-21T12:59:01Z

An other example, on a cluster with 16 differents ressources ImageUpdateAutomation, I have 3 of them "stuck".

controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ ls -alh
total 20K
drwxrwsrwx 5 root       1337 4.0K Jan 21 12:37 .
drwxr-xr-x 1 root       root 4.0K Jan 17 09:21 ..
drwx--S--- 3 controller 1337 4.0K Jan 20 08:55 xxxx-xxxx-flux1428284464
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 xxxx-xxxx-flux3351794846
drwx--S--- 3 controller 1337 4.0K Jan 20 10:17 yyyy-yyyy-flux2915045790

controller@image-automation-controller-65ccc7d5c6-7mgv9:/tmp$ du -sh /tmp/*
2.0M	/tmp/xxxx-xxxx-flux1428284464
696K	/tmp/xxxx-xxxx-flux3351794846
3.5M	/tmp/yyyy-yyyy-flux2915045790

Can this image from #297 resolve this issue ?
Because I sometimes have errors like Unable to clone: Connection timed out or Unable to clone: failed to start SSH session: Unable to exchange encryption keys

stefanprodan · 2022-01-21T13:06:29Z

@ahisette yes the libgit2 timeout callback could be the reason, please try out the image from #297 and see if it problem goes away.

ahisette · 2022-02-01T08:56:52Z

Hello,
I put new image on 2 of 4 clusters, after one week of testing.
I have no good news : One freeze for one cluster with new image and also one freeze for one "old" image.
With the same symptoms.

nobody@image-automation-controller-84bfbc78dc-89bqv:/$ ls -alh /tmp
total 12K
drwxrwsrwx 3 root   1337 4.0K Feb  1 08:54 .
drwxr-xr-x 1 root   root 4.0K Jan 24 13:25 ..
drwx--S--- 8 nobody 1337 4.0K Feb  1 07:13 xxxx-xxxx-flux2375737287

nobody@image-automation-controller-84bfbc78dc-89bqv:/$ du -sh /tmp/*
5.9M	/tmp/xxxx-xxxx-flux2375737287

hiddeco · 2022-02-07T13:43:28Z

With the release of Flux v0.26.2, we would like to kindly ask folks with issues to update to the latest image releases. Since we changed our build process around libgit2 for the source-controller and image-automation-controller, we have observed some of the issues as described to have vanished (and confirmed by others as per fluxcd/source-controller#439 (comment)).

raress96 · 2022-02-08T06:35:32Z

@hiddeco Installed yesterday the new flux and today pushed an image, it seems to also work for me now.

pjbgf · 2022-03-07T17:01:12Z

I managed to reproduce this locally. I am running against the latest IAC version v0.20.1, with a single SSH github repository using libgit2 with 1 minute intervals and the same for timeouts.

Just to help further investigations I will relay some of my observations/assumptions here as I progress.

By analysing the pprof endpoint, I noticed that the time difference in minutes between last IAC reconciliation log message, seems to match the running time of the thread/goroutine below. Leading me to think that libgit2 git_remote_fetch is hanging. This was previously reported upstream and fixed a few years back. However, the issues may be correlated.

goroutine 332 [syscall, 4209 minutes, locked to thread]:
github.com/libgit2/git2go/v33._Cfunc_git_remote_fetch(0x7f37efdf97c0, 0xc0006e9290, 0xc00073cd00, 0x0)
	_cgo_gotypes.go:6954 +0x4c
github.com/libgit2/git2go/v33.(*Remote).Fetch.func2(0xc00073cd00, 0xc00010fe60, 0x40e414, 0xab)
	github.com/libgit2/git2go/[email protected]/remote.go:1044 +0xa7
github.com/libgit2/git2go/v33.(*Remote).Fetch(0xc00056e720, {0xc000f98ec0, 0xc001331e60, 0x1}, 0x2, {0x0, 0x50b8a6})
	github.com/libgit2/git2go/[email protected]/remote.go:1044 +0x1e9
github.com/fluxcd/image-automation-controller/controllers.fetch({0x1fc1c40, 0xc00056e600}, {0xc0013c6690, 0x25}, {0xc000aac370, 0x4}, {0xc001331e60, {0xc000dbef80, 0xc000676200}})
	github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:743 +0x32e
github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc0003f2680, {0x1fc1c78, 0xc0012cad80}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x30}}})
	github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:270 +0x1b15
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00010c000, {0x1fc1c78, 0xc0012cacc0}, {{{0xc0001867e0, 0x1c92880}, {0xc0001867d0, 0x414fb4}}})
	sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00010c000, {0x1fc1bd0, 0xc0003f2540}, {0x1bb1ea0, 0xc0011343a0})
	sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00010c000, {0x1fc1bd0, 0xc0003f2540})
	sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x357

The container is still operational (I can exec into it) and other goroutines seem to be working as expected.

For an automatic restart, users could leverage the workqueue_longest_running_processor_seconds metric whilst we try to get to the bottom of the issue. Here's a dump of some of the metrics for my hung IAC:

workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 256535.347854138
# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested
# TYPE workqueue_queue_duration_seconds histogram
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-08"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-07"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1e-06"} 0
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-06"} 4
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="9.999999999999999e-05"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.001"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.01"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="0.1"} 83
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="1"} 84
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="10"} 86
workqueue_queue_duration_seconds_bucket{name="imageupdateautomation",le="+Inf"} 91
workqueue_queue_duration_seconds_sum{name="imageupdateautomation"} 1000.9602119180001
workqueue_queue_duration_seconds_count{name="imageupdateautomation"} 91

pjbgf · 2022-03-22T17:50:54Z

The image-automation controller version v0.21.0 introduces an experimental transport that fixes the issue in which the controller stops working in some specific scenarios.

The experimental transport needs to be opted-in by setting the environment variable EXPERIMENTAL_GIT_TRANSPORT to true in the controller's Deployment. Once this feature has been tested extensively it may later become enabled by default.

Due to changes on other Flux components, it is recommended that all components are deployed on their latest versions. The recommended approach is via flux bootstrap using the flux cli version v0.28.0 which will be released tomorrow.

It would be great if users experiencing this issue could test it again with the experimental transport enabled and let us know whether the issue persists.

maxbrunet · 2022-03-22T19:13:55Z

Hi @pjbgf, thank you for the update, I deployed v0.21.0 as you were posting your message, and unfortunately it seems to persist.

Sample metrics:

workqueue_longest_running_processor_seconds{name="imageupdateautomation"} 3302.405028565
workqueue_queue_duration_seconds_bucket{le="+Inf", name="imageupdateautomation"} 3

EXPERIMENTAL_GIT_TRANSPORT is set:

$ kubectl --namespace=flux-system exec -ti image-automation-controller-7995f48c77-g99qd -- \
  printenv EXPERIMENTAL_GIT_TRANSPORT
true

Version:

$ kubectl --namespace=flux-system get pod image-automation-controller-7995f48c77-g99qd \
  --output=jsonpath='{.spec.containers[?(@.name=="manager")].image}'
ghcr.io/fluxcd/image-automation-controller:v0.21.0

Nothing specific in the logs. Our interval is 1m and it does commit on restarts.
The latest working version for us is still v0.19.0.

pjbgf · 2022-03-22T19:26:04Z

@maxbrunet thank you for the quick response. Would you be able to collect a profile and share either through here or slack please?

maxbrunet · 2022-03-23T18:24:54Z

Here is the output of /debug/pprof/goroutine?debug=2: https://gist.github.com/maxbrunet/8e304510d506e83be6ea2bb8a78ab42b

pjbgf · 2022-03-29T12:37:04Z

@maxbrunet thank you again for testing and providing the details so promptly.
We have since fixed a few more issues in the new transport and released a new minor version. Would you be able to give it another try and confirm whether that fixes the problems you were experiencing?

Here's more information on how to test: fluxcd/source-controller#636 (comment)

maxbrunet · 2022-03-29T18:32:00Z

Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller v0.21.2 and source-controller v0.22.4, but they both panic shortly after startup:

image-automation-controller - panic trace

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1471573]

goroutine 520 [running]:
github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc000623450)
        github.com/fluxcd/[email protected]/pkg/git/libgit2/managed/ssh.go:268 +0x93
github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404da6, 0xc0002f1a00)
        github.com/libgit2/git2go/[email protected]/transport.go:409 +0x6f
github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc000343120, 0x7ffa2c44e680, 0x7ffa2c42b520, 0xc00087f040)
        _cgo_gotypes.go:3244 +0x4c
github.com/libgit2/git2go/v33.Clone.func3(0xc000d1d938, 0x6, 0xc0006c4c70, 0x14eefc5)
        github.com/libgit2/git2go/[email protected]/clone.go:43 +0x91
github.com/libgit2/git2go/v33.Clone({0xc0009bb500, 0xc0008ab2c0}, {0xc0000479e0, 0x28}, 0xc0007639e0)
        github.com/libgit2/git2go/[email protected]/clone.go:43 +0x19e
github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc0006c4c10, {0x1fd7b40, 0xc0008ab2c0}, {0xc0000479e0, 0x28}, {0xc0009bb500, 0x3a}, 0x0)
        github.com/fluxcd/[email protected]/pkg/git/libgit2/checkout.go:64 +0x22d
github.com/fluxcd/image-automation-controller/controllers.cloneInto({0x1fd7b40, 0xc0008ab2c0}, {0xc0009d93b0, {0xc0009bb500, 0x22}}, 0xc0008ab260, {0xc0000479e0, 0x28})
        github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:547 +0x1a3
github.com/fluxcd/image-automation-controller/controllers.(*ImageUpdateAutomationReconciler).Reconcile(0xc000772140, {0x1fd7b78, 0xc0009d6e40}, {{{0xc000d1d960, 0x1ca56a0}, {0xc000d1d950, 0x30}}})
        github.com/fluxcd/image-automation-controller/controllers/imageupdateautomation_controller.go:283 +0x1805
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc0001622c0, {0x1fd7b78, 0xc0009d6db0}, {{{0xc000d1d960, 0x1ca56a0}, {0xc000d1d950, 0x414fd4}}})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001622c0, {0x1fd7ad0, 0xc000772080}, {0x1bc41e0, 0xc000465940})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001622c0, {0x1fd7ad0, 0xc000772080})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x357

source-controller - panic trace

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a7cf53]

goroutine 529 [running]:
github.com/fluxcd/source-controller/pkg/git/libgit2/managed.(*sshSmartSubtransport).Close(0xc00036bc70)
        github.com/fluxcd/source-controller/pkg/git/libgit2/managed/ssh.go:268 +0x93
github.com/libgit2/git2go/v33.smartSubtransportCloseCallback(0x404e06, 0xc0009d9d40)
        github.com/libgit2/git2go/[email protected]/transport.go:409 +0x6f
github.com/libgit2/git2go/v33._Cfunc_git_clone(0xc0003512d0, 0x7ff2fa7a30e0, 0x7ff2fa7a3120, 0xc0008ef380)
        _cgo_gotypes.go:3244 +0x4c
github.com/libgit2/git2go/v33.Clone.func3(0xc000866e44, 0x6, 0xc00061aec0, 0x1b4db45)
        github.com/libgit2/git2go/[email protected]/clone.go:43 +0x91
github.com/libgit2/git2go/v33.Clone({0xc000c7fd40, 0xc000304ae0}, {0xc0004b0800, 0x36}, 0xc0005657a0)
        github.com/libgit2/git2go/[email protected]/clone.go:43 +0x19e
github.com/fluxcd/source-controller/pkg/git/libgit2.(*CheckoutBranch).Checkout(0xc00061ae60, {0x27ca660, 0xc000304ae0}, {0xc0004b0800, 0x36}, {0xc000c7fd40, 0x3a}, 0x0)
        github.com/fluxcd/source-controller/pkg/git/libgit2/checkout.go:64 +0x22d
github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcileSource(0xc00095f3b0, {0x27ca698, 0xc000771e90}, 0xc000876200, 0xc0009335f0, 0x18, {0xc0004b0800, 0x36})
        github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:404 +0x99f
github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).reconcile(0x2834958, {0x27ca698, 0xc000771e90}, 0xc000876200, {0xc00110dbe8, 0x4, 0x32000040e494})
        github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:244 +0x3d5
github.com/fluxcd/source-controller/controllers.(*GitRepositoryReconciler).Reconcile(0xc00095f3b0, {0x27ca698, 0xc000771e90}, {{{0xc000450f40, 0x2384b60}, {0xc000450f20, 0x30}}})
        github.com/fluxcd/source-controller/controllers/gitrepository_controller.go:205 +0x4bb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000118bb0, {0x27ca698, 0xc000771e00}, {{{0xc000450f40, 0x2384b60}, {0xc000450f20, 0x415034}}})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x26f
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000118bb0, {0x27ca5f0, 0xc000a3bf00}, {0x2226280, 0xc000635340})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x33e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000118bb0, {0x27ca5f0, 0xc000a3bf00})
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x357

GitRepository + Secret

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: my-repo
  namespace: flux-system
spec:
  gitImplementation: libgit2
  interval: 1m0s
  ref:
    branch: master
  url: ssh://[email protected]/my-workspace/my-repo.git
  secretRef:
    name: flux-git-credentials

apiVersion: v1
kind: Secret
metadata:
  name: flux-git-credentials
  namespace: flux-system
stringData:
  identity: |
    -----BEGIN OPENSSH PRIVATE KEY-----
    ...
    -----END OPENSSH PRIVATE KEY-----
  known_hosts: bitbucket.org ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAubiN81eDcafrgMeLzaFPsw2kNvEcqTKl/VqLat/MaB33pZy0y3rJZtnqwR2qOOvbwKZYKiEO1O6VqNEBxKvJJelCq0dTXWT5pbO2gDXC6h6QDX
CaHo6pOHGPUy+YBaGQRGuSusMEASYiWunYN0vCAI8QaXnWMXNMdFP3jHAJH0eDsoiGnLPBlBp4TNm6rYI74nMzgz3B9IikW4WVK+dc8KZJZWYjAuORU3jc1c/NPskD2ASinf8v3xnfXeukU0sJ5N6m5E8VLjObPEO+mN2t/FZTMZLiFqPW
c/ALSqnMnnhwrNi2rbfg/rd/IpL8Le3pSBne8+seeFVBoGqzHM9yXw==
type: Opaque

FYI I had not realized I needed to change spec.gitImplementation to libgit2 before.

pjbgf · 2022-05-12T20:18:39Z

@maxbrunet we have made some improvements that may fix the issue you are experiencing. I have a release candidate for source-controller (below) that is based on a pending PR.

Can you please test the image below and let us know whether that fixes your problem?
ghcr.io/fluxcd/source-controller:rc-6d517589

maxbrunet · 2022-05-12T21:51:57Z

Hey @pjbgf, no, sorry, I used Flux with my previous employer, and I am not working with it at the moment

pjbgf · 2022-05-27T22:23:54Z

@maxbrunet no worries, thank you for all the help so far.

pjbgf · 2022-05-27T22:25:01Z

This should be fixed as part of the managed transport improvements and the enforcement of context timeouts.
Release candidates for both controllers:

ghcr.io/fluxcd/source-controller:rc-4b3e0f9a
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59

pjbgf · 2022-06-09T11:48:43Z

Closing this for lack of activity. Similarly reported issues have been confirmed to be fixed.

Now with Managed Transport enforcing timeouts for Git operations, this should be resolved.

If it reoccurs, given the sheer amount of changes that happened on the Git implementation in the last 6 months, we are better off creating a new issue, linking back to this one.

mickeahlinder · 2022-06-17T14:48:34Z

Hi @pjbgf, I have tried to deploy the latest versions, image-automation-controller v0.21.2 and source-controller v0.22.4, but they both panic shortly after startup:

image-automation-controller - panic trace
source-controller - panic trace
GitRepository + Secret
FYI I had not realized I needed to change spec.gitImplementation to libgit2 before.

@maxbrunet how do you obtain these panic traces, I've portforwared pprof endpoints and cat get dubug-info at request. But how to get information when the process panics?

maxbrunet · 2022-06-17T17:28:13Z

Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with kubectl logs --previous POD_NAME

mickeahlinder · 2022-06-17T17:32:52Z

Stacktraces are dump in the logs when the process panics, you can get the logs from the last restart with kubectl logs --previous POD_NAME

Ah, ok, thank you.

stefanprodan added the bug Something isn't working label Dec 17, 2021

squaremo self-assigned this Jan 5, 2022

stefanprodan mentioned this issue Jan 6, 2022

image-automation-controller not reconnecting after operation timed out #209

Closed

1 task

kingdonb mentioned this issue Jan 7, 2022

Image update automation not committing the resolved version to git #159

Open

pjbgf self-assigned this Mar 7, 2022

pjbgf mentioned this issue Mar 7, 2022

Image Automation controller silently stops working #286

Closed

1 task

pjbgf mentioned this issue Mar 11, 2022

Experimental managed transport for libgit2 operations fluxcd/source-controller#606

Merged

pjbgf mentioned this issue Mar 20, 2022

Add experimental managed transport for libgit2 operations #326

Merged

pjbgf mentioned this issue Mar 29, 2022

Git Overall Improvements fluxcd/flux2#2593

Closed

35 tasks

pjbgf added this to the GA milestone May 13, 2022

pjbgf closed this as completed Jun 9, 2022

0xStarcat mentioned this issue May 29, 2024

Flux 2.2, image-automation-controller stops reconciling; needed restart #689

Open

Controller stops reconciling, needs restart #282

Controller stops reconciling, needs restart #282

Comments

stefanprodan commented Dec 17, 2021 • edited Loading

jwerre commented Dec 17, 2021

bondido commented Dec 30, 2021

jwerre commented Dec 30, 2021 • edited Loading

bondido commented Dec 30, 2021

jwerre commented Dec 31, 2021

bondido commented Jan 3, 2022

jwerre commented Jan 4, 2022

squaremo commented Jan 5, 2022

bondido commented Jan 5, 2022

squaremo commented Jan 5, 2022

stefanprodan commented Jan 5, 2022

bondido commented Jan 5, 2022

bondido commented Jan 5, 2022

jwerre commented Jan 6, 2022

jwerre commented Jan 6, 2022

ahisette commented Jan 10, 2022 • edited Loading

kingdonb commented Jan 11, 2022 • edited Loading

bondido commented Jan 20, 2022

ahisette commented Jan 21, 2022

stefanprodan commented Jan 21, 2022

ahisette commented Feb 1, 2022

hiddeco commented Feb 7, 2022

raress96 commented Feb 8, 2022

pjbgf commented Mar 7, 2022 • edited Loading

pjbgf commented Mar 22, 2022

maxbrunet commented Mar 22, 2022 • edited Loading

pjbgf commented Mar 22, 2022

maxbrunet commented Mar 23, 2022 • edited Loading

pjbgf commented Mar 29, 2022

maxbrunet commented Mar 29, 2022

pjbgf commented May 12, 2022

maxbrunet commented May 12, 2022

pjbgf commented May 27, 2022

pjbgf commented May 27, 2022

pjbgf commented Jun 9, 2022

mickeahlinder commented Jun 17, 2022

maxbrunet commented Jun 17, 2022

mickeahlinder commented Jun 17, 2022

stefanprodan commented Dec 17, 2021 •

edited

Loading

jwerre commented Dec 30, 2021 •

edited

Loading

ahisette commented Jan 10, 2022 •

edited

Loading

kingdonb commented Jan 11, 2022 •

edited

Loading

pjbgf commented Mar 7, 2022 •

edited

Loading

maxbrunet commented Mar 22, 2022 •

edited

Loading

maxbrunet commented Mar 23, 2022 •

edited

Loading