Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageRepository manifests ignoring spec.secretRef changes #2286

Closed
1 task done
stvnksslr opened this issue Jan 12, 2022 · 22 comments
Closed
1 task done

ImageRepository manifests ignoring spec.secretRef changes #2286

stvnksslr opened this issue Jan 12, 2022 · 22 comments

Comments

@stvnksslr
Copy link

Describe the bug

We noticed this issue after updating to v0.25.1 and the issue is not currently effecting one of our other clusters that is on v0.24.1

when making changes to our image repository manifests we noticed that despite the reconciliation passing without issue the spec.secretRef field was not effected example.

Git Version

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
metadata:
  name: some-app
  namespace: flux-system
spec:
  image: <ECR_URL>/some-app
  interval: 5m

Cluster Version

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageRepository
metadata:
  name: some-app
  namespace: flux-system
spec:
  image: <ECR_URL>/some-app
  interval: 5m
  secretRef:
    name: ecr-credentials

Steps to reproduce

  1. add a spec.secretRef section to an existing ImageRepository manifest
  2. commit to git
  3. watch reconciliation pass successfully
  4. remove field
  5. watch reconciliation pass successfully
  6. see that spec.secretRef has not been removed

Expected behavior

I expect that when removing the spec.secretRef for the sync process to remove it on the cluster as well or error if there is a reason it cannot be edited/applied.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v.0.25.1

Flux check

► checking prerequisites
✔ Kubernetes 1.21.5 >=1.19.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.15.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.19.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.15.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.19.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.20.1
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.20.1
✔ all checks passed

Git provider

gitlab

Container Registry provider

ECR

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@kingdonb
Copy link
Member

kingdonb commented Jan 12, 2022

I have done a test to try to confirm this report, and I was able to confirm so far the issue seems to exist at least as far back as Flux 0.24.1

In my case I removed a secretRef from a GitRepository in a way that should have definitely made it invalid, but it synced OK. When I deleted the GitRepository from the cluster and let it sync again from an upstream Kustomization, I found it was definitely invalid:

Applied on a fresh cluster with 0.24.1 (or even just deleting the earlier GitRepository first) the first of these commits, I would have expected to remove the fields which Flux applied in earlier versions, but they are not removed. The second commit is definitely invalid according to the --dry-run=server apply, but it gets accepted without issue and the Kustomization that applies it says:

flux-system           	True 	Applied revision: staging/cdc97dfdea7f5b8d9ae5262ade98b86d16e62a04   	staging/cdc97dfdea7f5b8d9ae5262ade98b86d16e62a04   	False

If I delete the GitRepository from the cluster and let it sync from flux-system again, Flux confirms this manifest has an error and it emits events showing that the next reconciled sync was prevented from successfully completing, like I'd expect.

$ flux reconcile ks flux-system
► annotating Kustomization flux-system in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✗ Kustomization reconciliation failed: GitRepository/flux-system/flux-sync dry-run failed, reason: Invalid, error: GitRepository.source.toolkit.fluxcd.io "flux-sync" is invalid: spec.secretRef.name: Required value

I'm going to start checking older versions of Flux since around 0.18 since that's when server side apply was first used and see if they all mirror this behavior, or if I can narrow down by bisecting which change actually introduced this behavior, since it does not seem like correct behavior and I'm assuming this information will be helpful now that I've reproduced an issue.

Thanks for your report @stvnksslr I'll let you know what I find out 👍

@kingdonb
Copy link
Member

There is definitely a regression between 0.18.3 and 0.24.1, I'm working on narrowing it down now.

In Flux 0.18.3, this commit definitely removes the secretRef from the GitRepository as I'd expect it to.

@kingdonb
Copy link
Member

I found an error in my testing approach, something has changed and I no longer seem to have a repro for this issue now that I corrected my error. Even on the latest version, where I thought I was able to reproduce this issue easily on the first try, now it does not seem to be throwing a problem at all.

The GitRepository updates and removes its secretRef, then errors out predictably as you'd expect any SSH GitRepos to fail if they did not have any SSH key attached.

I'll spend some more time on it to see if I can see what I was seeing again, since it would be very helpful to have a clean repro for some of the server side apply issues we've seen reported for a while now.

@kingdonb
Copy link
Member

Hmm, it seems the issue reproduces cleanly on a new v0.25.1 installation, as in fresh new cluster which has only just had v0.25.1 installed on it, without upgrading from any earlier version. If I create a GitRepository or some other resource with a secretRef (or I presume any other optional field that can be omitted, as I have no reason to think this issue is specific to secretRef), then commit a new version of the manifest which is missing the secretRef, I do not see the field disappear from the cluster-based representation of the resource after the latest manifest has been applied by Flux.

If I upgrade from one version to the next, the issue does not reproduce reliably anymore. I am still trying to isolate the first version that had this issue. It takes some time because I have to destroy the entire cluster and try again with each suspect version on a fresh cluster, almost seems as though it matters which version of Flux is upgraded to which version of Flux.

@kingdonb
Copy link
Member

v0.23.0 and prior does not seem to have this bug. I can bootstrap Flux fresh on a new cluster with this version, then remove the secretRef, then edit the live resource on the cluster and see the spec.secretRef section has been removed as soon as Flux synced. Later versions fail to remove the secret ref.

The issue that is described here is surfacing for me seems to be reproduced in v0.24.0 and on, all suffer from the issue described here when installed on a fresh cluster. But for some reasons, couldn't be reproduced in every cluster.

@stefanprodan
Copy link
Member

stefanprodan commented Jan 13, 2022

This is related to fluxcd/kustomize-controller#486 you can't delete fields added by kubectl. You need to clear the managed fields first.

kubectl -n flux-system patch imagerepository some-app --type=json -p='[{"op": "replace", "path": "/metadata/managedFields", "value": [{}]}]'

@yebyen
Copy link
Contributor

yebyen commented Jan 13, 2022

The repro was done on a fresh K8s 1.23.1 cluster which has never been touched by kubectl - with only Flux 0.25.1 ever installed on it - I'll try it again today to confirm for sure, but this doesn't jibe with the managed fields explanation if it's supposed to have been fixed in 0.25

@stefanprodan
Copy link
Member

if it's supposed to have been fixed in 0.25

The managed fields issue is still opened, but we’ve updated the Kubernetes packages which could improve the situation.

@kingdonb
Copy link
Member

kingdonb commented Jan 13, 2022

Wouldn't users only experience this bug if they were upgrading from an older version of Flux (earlier than SSA) to a newer one? I'm not understanding how I would have come into any fields managed by kubectl, when this resource was added by Flux and synced through the gitrepo on the latest version of Flux, and this cluster has never been touched by kubectl except for completely unrelated kubectl apply to add the sops-gpg after the flux-system kustomization and namespace.

I have been able to confirm again that K8s 1.23.1 with Flux 0.25.1, never operated on any earlier version, apparently has this issue and the managedFields patch approach to solving it offered above has no effect, as it appears the resource has no managedFields to patch and remove when it is opened with kubectl edit --show-managed-fields.

@stefanprodan
Copy link
Member

Can you post here the object with the managed fields when it gets in that state where removing the spec.secretRef from Git has no effect?

@kingdonb
Copy link
Member

kingdonb commented Jan 13, 2022

There are no managed fields...

(I pasted some managed fields a moment ago, I grabbed them from a different resource.)

Not sure how this gitrepo came to be managed by Flux but without any managedFields

Some other resources, like traefik-crds, do have managedFields set on them. But this resource for some reason does not.

@kingdonb
Copy link
Member

kingdonb commented Jan 13, 2022

Here is the resource in context that comes up missing any managedFields when it's applied by Flux from this repo: https://github.com/kingdonb/bootstrap-repo/blob/staging/clusters/moo-cluster/flux-system/flux-sync.yaml

(I have it deployed from the flux-system Kustomization)

$ kubectl get gitrepo flux-sync --show-managed-fields -oyaml
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  annotations:
    reconcile.fluxcd.io/requestedAt: 2022-01-13 13:02:40.004571749 +0000 UTC m=+855.413796794
  creationTimestamp: "2022-01-13T12:48:22Z"
  finalizers:
  - finalizers.fluxcd.io
  generation: 3
  labels:
    kustomize.toolkit.fluxcd.io/name: flux-system
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: flux-sync
  namespace: flux-system
  resourceVersion: "9661"
  uid: 1f008a26-d297-4060-a774-dfb849055e8b
spec:
  gitImplementation: go-git
  interval: 15m
  ref:
    branch: staging
  secretRef:
    name: flux-system
  timeout: 20s
  url: ssh://[email protected]/kingdonb/bootstrap-repo
status:
  artifact:
    checksum: ab7ee12f9b6e85c8a458a04f1ce5ffadb108fce717692e65106604624ffe24e9
    lastUpdateTime: "2022-01-13T13:02:40Z"
    path: gitrepository/flux-system/flux-sync/d99a276ca098b4f06b2caca0d0fe23aada2a5d0f.tar.gz
    revision: staging/d99a276ca098b4f06b2caca0d0fe23aada2a5d0f
    url: http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/flux-sync/d99a276ca098b4f06b2caca0d0fe23aada2a5d0f.tar.gz
  conditions:
  - lastTransitionTime: "2022-01-13T12:49:08Z"
    message: 'Fetched revision: staging/d99a276ca098b4f06b2caca0d0fe23aada2a5d0f'
    reason: GitOperationSucceed
    status: "True"
    type: Ready
  lastHandledReconcileAt: 2022-01-13 13:02:40.004571749 +0000 UTC m=+855.413796794
  observedGeneration: 3
  url: http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/flux-sync/latest.tar.gz

@stefanprodan
Copy link
Member

I found the issue, what happens is that image-reflector-controller acts like kubectl when it adds finalizers and becomes the manager of all the fields inside .spec, this means the Kubernetes API will not allow kustomize-controller to delete any fields from the spec because some other controller manages them as well.

@stefanprodan
Copy link
Member

To solve this issue we need to replace everywhere (in all our controllers) client.Update with client.Patch when we add finalizers, e.g. https://github.com/fluxcd/source-controller/blob/main/controllers/gitrepository_controller.go#L104

@kingdonb
Copy link
Member

I can submit PRs with this simple change for all repos to close this issue, 🌮

That's a very obscure problem and I'm glad you could see it. I should be able to confirm pretty quickly if this change in source-controller fixes the issue I'm finding.

@kingdonb
Copy link
Member

Thank you, I'm able to confirm the new source-controller doesn't take control of the resources in the same way with the finalizer changes, and a fresh cluster bootstrapped with this better source-controller removes the secretRef when it is taken out of the source manifest, 🎉 there are some things to figure out to close the book on this,

I'm going to be looking at:

to understand whatever I need to know to tell users how they can get out of this pickle if they've encountered it already.

@kingdonb
Copy link
Member

It looks like fluxcd/pkg#209 and fluxcd/source-controller#542 together are not quite yet enough to liberate resources which were once "update"d to set their finalizers by source controller, I did the upgrade test and used upgraded versions of both controllers. @stefanprodan FYI

I deployed the rc-408a889a kustomize controller from fluxcd/kustomize-controller#527 and a source-controller with the latest main branch including 542, and the issue remains present on the cluster if it was bootstrapped between 0.24.0 and 0.25.2, I'm still not sure how to clear the ownership from the resource so that Flux can manage it again.

We will need something else to tell users who hit this issue, or ideally something we can bake into the next release so they don't have to fix it on every given resource for themselves. (We are likely to get some questions from users after they merge this fix, however we can accomplish that, as latent changes that may have been stuck unapplied are resolved.)

@stefanprodan
Copy link
Member

stefanprodan commented Jan 13, 2022

We could write a flux fix command that does a proper cleanup of all the toolkit resources in a cluster. For all the other resources applied with kustomize-controller (back when it used kubectl), PR 527 should fix it.

But first we need to get rid of Update in all controllers.

@kingdonb
Copy link
Member

kingdonb commented Jan 14, 2022

I made some PRs to stop using Update when applying finalizers in Helm Controller and Kustomize Controller, mirroring your source-controller PR, which both appear to be the only controllers besides source-controller that do this for their finalizers. (Thanks for merging them 👍 )

Is it only finalizer Update that we need to worry about, or all instances of Update that modify any resources at all?

@stefanprodan
Copy link
Member

stefanprodan commented Jan 15, 2022

Is it only finalizer Update that we need to worry about, or all instances of Update that modify any resources at all?

All updates besides the one that finalizes a resource before deletion.

@jmriebold
Copy link

Hey all, looks like this is fixed as of v0.26.0. Upgraded this morning and then removed the fields and everything worked as expected. Thanks @kingdonb and @stefanprodan!

@stefanprodan
Copy link
Member

Great news! Thanks @jmriebold for confirming the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants