Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Argocd #763

Closed
fabry00 opened this issue Mar 3, 2023 · 14 comments · Fixed by #851
Closed

Support Argocd #763

fabry00 opened this issue Mar 3, 2023 · 14 comments · Fixed by #851
Assignees

Comments

@fabry00
Copy link

fabry00 commented Mar 3, 2023

We are deploying Kong using ArgoCD.
It's working ok but the only problem is with the migration jobs.
Since they do not get automatically deleted, the next time that I upgrade the chart, ArgoCD fails because the job fields are immutable

I tried to configure already ArgoCD hooks: argocd.argoproj.io/hook: PreSync and argocd.argoproj.io/hook-delete-policy: HookSucceeded but I had to revert it because I had to add it for all the dependency and as well on my custom resources deployed alongside Kong helm chart.
So far I prefer to keep the helm chart as it is and delete the job manually before upgrading it.
It would be great if somehow we could reshape the helm chart in a way to solve this issue.

Here an example of the logs (I upgraded the chart from 2.14 to 2.16)

Failed sync attempt to dad5bdcdd039d6e8585fe55297453aad4fd650d8: one or more objects failed to apply, reason: Job.batch "controlplane-dev-kong-init-migrations" is invalid: spec.template: Invalid value: ...: field is immutable (retried 5 times).

@slavogiez
Copy link
Contributor

+1
Using ArgoCD too, we had to disable migration jobs and run them manually using some scripts during kong upgrade.
Having migration jobs working properly with ArgoCD would help to simplify things.

@rainest
Copy link
Contributor

rainest commented Mar 10, 2023

Which ArgoCD version are you running?

https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/#hook-deletion-policies suggests that you may need to check with ArgoCD, since:

Note that if no deletion policy is specified, ArgoCD will automatically assume BeforeHookCreation rules.

It's not entirely clear if they effectively honor other Helm hooks. We don't place any delete hooks on the init job because it only runs at install, whereas the others spawn every upgrade. argoproj/argo-cd#355 (comment) suggests that ArgoCD should mimic that behavior even though they use helm template.

Even if they do ignore that, however, the default behavior described by the docs should handle the problem, since ArgoCD should just delete the old hook job before creating the new one, even if the delete policy doesn't specify.

@fabry00 can you clarify what you did and why that required modifying other resources? Intuitively, even the above default functionality is broken, we'd work around it by adding the argocd.argoproj.io/hook-delete-policy: HookSucceeded annotation to the init-migrations job. Can you confirm that adding that to the job annotations somehow breaks other resources?

@rainest rainest self-assigned this Mar 10, 2023
@fabry00
Copy link
Author

fabry00 commented Mar 13, 2023

Dear @rainest ,

I am using ArgoCD 2.6.

What I noticed is that there is a slightly different behavior between Argo's Hooks and Helm chart hooks.
If I remember right the install, job run every time on ArgoCD, if I don't use Argo's policies; while with helm, as you said, the install job run only the first time when you install, while it does not run on upgrades.

If I add the Hooks than the job run before everything else (even with HookSucceeded) and then It enter in error states because all it's dependencies cannot be found.
By dependencies I mean, for example, all the secrets, certificates that I had configured in the helm chart (external secrets,config maps, postgres, pgpool). It means that I had to add the hooks' policies (e.g. sync-wave) for all of those.

In case I'll try to reproduce it again.
Let me know.

cheers

@rainest
Copy link
Contributor

rainest commented Mar 14, 2023

Yeah--seeing the actual failure state output from the Kubernetes API will help me understand exactly how it's breaking, so if you can reproduce problems with that annotation in place and post the output, that'll help us determine what changes the chart needs.

Aside that, we do have a hidden toggle: you can set migrations.init=false to disable the job. Unsure what workarounds you have now, but setting that after the initial install might be an easier way to deal with the conflict.

@LeszekBlazewski
Copy link

Hey @rainest,

I have stumbled upon this issue recently when deploying kong helm chart with postgres DB enabled via ArgoCD v2.6.6 and can help with the needed investigation.

There are 3 main issues when deploying kong helm chart via ArgoCD:

  1. Initial installation (where kong postgres db is empty). ArgoCD always syncs up first everything which has hook annotations and then does rest of the manifests. This means that for the initial installation the pre migrations job will run which will fail since the kong bootstrap (init migrations) must be executed first. ArgoCD won't proceed with running init migrations since those are not part of any hooks and we end up in a broken state. This can be addressed by providing annotations via the migrations.jobAnnotations (in this case argocd.argoproj.io/hook: PreSync) field but as you know this will set the values for all of the jobs and we certainly don't want to run post migrations before syncing rest of the manifests. This could be addressed by providing per job annotations in values.yaml.

  2. Related to the way ArgoCD helm hooks interpretation works. The helm hooks annotations currently present in the kong helm chart work properly but the following annotation must be added at the minumum to the serviceAccount annotations (snippet from the values.yaml):


kong:
  deployment:
    serviceAccount:
      annotations:
        argocd.argoproj.io/hook: PreSync

Otherwise the serviceAccount is not created on time and the migrations are not able to run successfully (ArgoCD always syncs up first everything which has hook annotations and then does rest of the manifests, same case as in the 1st point just for missing resources not the order in which they are created). The annotation above fixes the issue for missing service account for all of the migration jobs and forces to create the serviceAccount right on time. As you can see this fix I did spills now to all resources which are used by the migrations jobs that have helm hooks annotations attached, for example a Secret which is needed by the migrations jobs to access the DB. If any resource is used by the migration jobs it also needs the argocd.argoproj.io/hook: PreSync annotation because ArgoCD works in a way where no resources would be synced until the required hooks pass.

  1. The init-migrations job which is not part of any hooks and uses this condition: https://github.com/Kong/charts/blob/main/charts/kong/templates/migrations.yaml#L2. I did some digging and I understand why the init-migrations is not part of any helm hooks ([kong] rework migration job configuration #102). As stated in ArgoCD docs (https://argo-cd.readthedocs.io/en/stable/user-guide/helm/):

image

Which brings up to the issue: there is no install/upgrade for ArgoCD. As you are probably aware, ArgoCD uses the helm template ... command when rendering the manifests and then does plain kubectl apply on them and therefore the check if .Release.IsInstall by default always evaluates to true. I validated that by enabling postgres values for kong and rendering the manifests using the below commands and checking the difference between the output:

helm template kong kong/kong --values ./values.yaml
helm template --is-upgrade kong kong/kong --values ./values.yaml

where values.yaml contain the necessary settings to enable postgres DB integration for kong. The diff shows that the second command did not render the init-migrations Job manifest and therefore it wold be removed before upgrading the release and all would work. In ArgoCD this is not the case because the flag --is-upgrade is not passed since ArgoCD does not differentiate between install and upgrade like helm does.

The above is painful, especially if someone maintains his own kong image (based on the official one) and whenever we bump the image tag, we end up with a broken sync because ArgoCD tries modifying the image tag for the init-migrations job which as stated above is immutable.

I wonder how this issue could be addressed because right now I am using the following workaround to get kong in a healthy state:

  1. I set for the serviceAccount and all other resources which are required by the migration jobs the argocd.argoproj.io/hook: PreSync annotation.
  2. Only on initial deployment, I set migrations.preUpgrade=false because that prevents the issue described in point 1 above.
  3. I deploy kong with the above settings.
  4. I set migrations.preUpgrade=true and the hidden value migrations.init=false. This addresses points 2 and 3.
  5. I deploy kong again with the new settings.
  6. Now I can freely modify values.yaml, change image version ETC and all works as expected. Nonetheless you can see the above is pretty painful to remember.

What could be done to address this:

  1. In order to fix point 1, we would need per job annotations not just one jobAnnotations field. I think this has been already brought up here: Add annotations for initmigrations job #219 and the idea was rejected so not sure if there is any hope for this.
  2. For fix regarding point number 2, I think that there is not much we can do in the chart itself. As long as there is an ability to pass annotations for the objects needed by the migration Jobs, we can add there argocd.argoproj.io/hook: PreSync and that should help. Maybe a comment in the docs somewhere could help for newcomers trying to deploy this chart via ArgoCD.
  3. For point 3 I am not sure there exists any mechanism that could be used to make this work with ArgoCD to provide 1:1 replication of the current behaviour with running init migrations just 1 time. The kong init-migrations job is still idempotent right? Meaning there is no harm if I would run it on each sync (deployment) of kong? It would simply log that DB is already bootstrap and complete without error right? In that case if we would add per job annotations we could add via values to it argocd.argoproj.io/hook: PreSync and argocd.argoproj.io/hook-delete-policy: BeforeHookCreation and that would solve the issue but the job would run on each sync.

I hope I explained everything clearly. In order to troubleshoot this issue it's best to setup an ArgoCD instance and start deploying kong helm chart on it. It clearly shows all of the issues you are facing one after another when trying to get this working.

Let me know if you need more insights and for the time being one can use the above steps to make it work.

@rainest
Copy link
Contributor

rainest commented Mar 29, 2023

@LeszekBlazewski Do you know of any reason you ever wouldn't want PreSync and BeforeHookCreation on the init job?

If the issue is that it won't work via Job annotations because adding it there would then make the migrations finish job run concurrently with migrations upgrade (which would probably break things), we can probably just hard-code the Argo annotations into the init Job. That should work without expanding the config surface. If you fork the chart and make those template changes, do you see any issues if you run through a fresh new install sync and then several other syncs after?

What remains unclear is the original report where apparently adding the delete annotation ostensibly did not work to clear the collision issue. @fabry00 I'm still unsure if that's indeed what you observed--did you maybe add it to migrations.annotations instead of migrations.jobAnnotations?

Short of ArgoCD adding some better support for lifecycle to its Helm module, @LeszekBlazewski 's listed workaround (disabling jobs and setting Argo annotations manually) is essentially what we recommend. We can document that more clearly, since it seems more or less vetted across this and other issue reports.

The reason the init Job does not have a hook is because of the built-in Postgres chart: Postgres can't spawn during pre-install, so we have to spawn the init Job during regular install for it to not fail.

I would actually like to remove the Postgres subchart, as it's more trouble than it's worth (it imposes weird constraints like the above on chart development), but that requires approval beyond me and wouldn't happen any time soon if it is approved. We originally added it as a quick way to deploy simple self-contained demo installs, but recommend managing Postgres separately for any production instance. Now that we have DB-less support for most features, that's a more attractive demo install path, but removing the subchart would be a breaking change, so it's not something we can do quickly.

The migrations commands are indeed all idempotent: if there's nothing for them to do, they simply exit successfully with a message saying they did nothing. It's fine if they get deployed unnecessarily.

@fabry00
Copy link
Author

fabry00 commented Mar 30, 2023

@rainest my main problem with the job annotations that it will force me to add the hooks as well for all the other dependencies.
I have an umbrella helm chart that deploys everything: Kong, PGPool and Postgres on Openshift.
I also store my secrets as external secrets in Hashicorp Vault.

So to be able to have the job running with hooks I have to set the hooks as well for:

  • most of the kong secrets ( posgres passwords, certificates, license, admin and portal session confg
  • pgpool secrets
  • pgpool deployment
  • pgpool service
  • postgres secrets
  • postgres service
  • postrgees statefulset
  • postgres pvc
    and so far so fourth

basically almost everything that Kong needs as a dependency to start.
Also I need to add the sync-wave as well in order to have them created before the job, something like:

annotations:
    ....
    argocd.argoproj.io/sync-wave: "-5"

Example of my values.yaml

admin_gui_url: 
    valueFrom:
      configMapKeyRef:
        name: myconfigmap
        key: manager_ui_url
admin_api_uri: 
  valueFrom:
    configMapKeyRef:
      name: myconfigmap
      key: admin_api_uri
portal_api_url: 
  valueFrom:
    configMapKeyRef:
      name: myconfigmap
      key: portal_api_url
portal_gui_host:
  valueFrom:
    configMapKeyRef:
      name: myconfigmap
      key: portal_gui_host
portal_session_conf:
   valueFrom:
    secretKeyRef:
      name: mysecret
      key: portal_session_conf

password:    
      valueFrom:  
        secretKeyRef:            
            name: mysecret
            key: password
pg_user:
      valueFrom:
        secretKeyRef:
          name: mysecret
          key: user
pg_password:
  valueFrom:
    secretKeyRef:
      name: mysecret
      key: password
pg_database:
  valueFrom:
    secretKeyRef:
      name: mysecret
      key: database
extraConfigMaps:
    - name: root-ca-cert
      mountPath: /etc/ca/
    - name: ocp-ca-cert
      mountPath: /etc/ocp/service-ca/
enterprise:
    enabled: true
    license_secret: mysecret
.....

myconfigmap and mysecret are secrets and configmaps created by ExternalSecrets

My helm chart is something like this:

apiVersion: v2
appVersion: 1.0.0
name: my-kong-controlplane
dependencies:
  - name: postgres
    version: 1.0.0
    repository: file:./charts/postgres
  - name: pgpool
    version: 1.0.0
    repository: file:./charts/pgpool
  - name: controlplane
    version: 1.0.0
    repository: file:./charts/controlplane-dependencies
  - name: kong
    version: 2.16.5
    repository: https://artifactory...../helm-virtual
description: Kong Control Plane
engine: gotpl
version: 1.0.0

So all the charts postgres, pgpool and controlplane-dependencies (external secrets,routes,configmaps...) have to be ready before the job migration starts.

@rainest
Copy link
Contributor

rainest commented Mar 30, 2023

Not much we can help with there I'm afraid--we can't really account for the surrounding environment, and the expectation of our chart is that the prerequisite resources are deployed separately. Unless Helm were to add some dependency management system (and Argo were to add support for it), the basic install order provided by hooks doesn't offer enough control to manage deploying all of those when they're needed.

My best recommendation there is to split those charts, so that you can deploy the environment prerequisite components first, and the components using them after.

@LeszekBlazewski
Copy link

@rainest

@LeszekBlazewski Do you know of any reason you ever wouldn't want PreSync and BeforeHookCreation on the init job?

I can't come up with a reason when I would like to not have the PreSync annotation on the init job assuming it runs only 1 time and during the first initial kong deployment.

Let me run some tests based on the PreSync annotation added to the init job and I will get back to you with the results.

@rainest
Copy link
Contributor

rainest commented Jun 5, 2023

@LeszekBlazewski checking in now that I'm back from vacation, did you see any issues or things that weren't addressed by those annotations?

Moving this into our scheduled since I think the impact is valuable enough that we should just go ahead and try it, as well as start testing on ArgoCD internally if possible.

@LeszekBlazewski
Copy link

Hi @rainest,

I am really really sorry for the long reply. Been super busy recently. So we did a far amount of testing (both clean new deployments and updates to existing environments) with the kong helm chart (version 2.14.0) and it turned out that successful kong provisioning via ArgoCD requires the following:

  1. If there are any additional resources required for the db migration jobs (for example the ExternalSecret k8s resource that creates a secret with DB password used by kong) they are required to have the argocd.argoproj.io/hook: PreSync annotation, since otherwise the resources won't be created on time and the pre-sync job would fail.
  2. Only on initial deployment (1 time when the first sync runs) we have to set the following params:
migrations:
    init: true
    preUpgrade: false
    postUpgrade: true

The above would allow to sync the whole kong helm chart successfully and will execute the init db scripts.
3. Once the above is successfully deployed and the db is bootstrapped we can adjust back the migrations parameter to:

migrations:
    init: false
    preUpgrade: true
    postUpgrade: true

and keep it that way at all times. I have ran different deployments and the above ensures that all of the migration jobs run properly in proper argocd sync stages when performing the sync operation.

After all that testing, looks like kong works fine with ArgoCD, it's just the 2 step deployment process that is required due to the nature of how the job works.

I think the it could be worth, improving the docs with specific argoCD section or gitOps section overall (that would include the deployment steps described by me above) + maybe documenting the hidden toggle of migrations: init: true as one has to do some searching in order to find it (of course with proper warning that for standard installations this param should be kept as true).

Do you want me to open a PR for the doc changes described above?

@tiagoskaneta
Copy link

tiagoskaneta commented Jul 4, 2023

Having to first set preUpgrade: false and only after completion change it to preUpgrade: true is really a block on my side.

In my organization, Argo is setup to automatically create new apps (based on a git repo), so it's impractical to have to make that change manually. Unfortunately that's what we have to end up doing and every new application requires a manual step.

I understand the init: false that @LeszekBlazewski mentioned above would at least allow us not to have to manually delete the init-migrations job, but in any case avoiding the extra step would be ideal.

@tiagoskaneta
Copy link

So I tested the following configuraiton

      deployment:
        serviceAccount:
          annotations:
            argocd.argoproj.io/hook: PreSync
      migrations:
        preUpgrade: false
        postUpgrade: true
        jobAnnotations:
          argocd.argoproj.io/hook: Sync
          argocd.argoproj.io/hook-delete-policy: HookSucceeded

The argo hooks do correctly delete the migration jobs (after sync is completed). So it at least allows the change of the preUpgrade and postUpgrade values without having to manually delete the init-migrations job.

The PreSync hook does help as well, the issue is that (as already mentioned) it requires for all custom resources to be marked as such (namely the configmaps for the plugins). Which is quite bothersome and difficult to maintain.

@rainest I'm assuming that there is not a whole lot that can be done to allow the pre-upgrade job to complete successfully when the init job has never been executed in the first place? Given the constraints that argo introduces, of course, where we cannot really differentiate install and upgrade syncs.

@rainest
Copy link
Contributor

rainest commented Aug 10, 2023

We now have tooling to create Argo-managed test environments easily, so we can replicate issues and test fixes locally.

2.26 contains the results of our testing, which indicates the easiest way to handle the Jobs is to run both in the Sync phase, with a delete before creation.

The init Job was already effectively running during Sync, but will no longer block because it's immutable. It'll be clutter after the initial install, but won't cause issues. Disabling it via values.yaml after install will avoid the clutter.

We'd ideally like to run the upgrade job before patching the upgraded Deployment revision, but starting it at the same time doesn't cause any obvious issues in testing. The upgraded Pods will block initially until migrations complete, but will eventually come online.

Sync waves would let us control the order, but I'm unsure if they're compatible with hooks (we still need to delete the Jobs), and they have the same limitation as the PreUpgrade hook, where we need to manage them on all prerequisite resources. Since making upgrade a Sync hook appears viable, we want to try that in the wild first. If we get further issue reports, we'll revisit more complex options.

Please file a new issue if you're still seeing issues with upgrades, or other non-upgrade Argo issues, after upgrading to chart 2.26.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants