Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrapper: retry helm chart installation #1151

Merged
merged 3 commits into from
Feb 9, 2023

Conversation

derpsteb
Copy link
Member

@derpsteb derpsteb commented Feb 7, 2023

Proposed change(s)

Sometimes cluster initilization fails on Azure with the error: Error: init call: rpc error: code = Internal desc = initializing cluster: installing cert-manager: helm install: failed post-install: timed out waiting for the condition.

This PR introduces a retry loop that aborts after 5 mins for all helm installations.
An alternative solution would be to wait after konnectivity installation, however this approach seems more robust to me and might mitigate other unknown race conditions.

Judging from the logs and the frequency that we see this bug 9/10 installation should not run into the retry case.
For those that do, installation should work on the second try (in the logs the api check becomes healthy seconds after the first try is aborted).

Background Information

The timeout errors are most likely connected to konnectivty agents not being ready when starting
the installation. This may become a problem during installation when test-requests are sent to the webhook that cert-manager installs. If the sender is on a different node than the webhook pod and the sender's konnectivity-agent is not ready, the request will fail.

During installation we can see the following error messages on control plane node 2:

  • cert-manager-startupapicheck: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "[https://cert-manager-webhook.kube-system.svc:443/mutate?timeout=10s":](https://cert-manager-webhook.kube-system.svc/mutate?timeout=10s%22:) No agent available
  • konnectivity-agent: cannot connect once" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate signed by unknown authority

The cert-manager-webhook pod is scheduled on node 0.

The problem resolves itself after <1min and cert-manager-startupapicheck reports: The cert-manager API is ready.

Opensearch query.

Checklist

  • Update docs
  • Add labels (e.g., for changelog category)
  • Link to Milestone

Motivation for this change are intermittent
timeout errors while installing cert-manager.
@derpsteb derpsteb added the bug fix Fixing a bug label Feb 7, 2023
@derpsteb derpsteb requested a review from malt3 February 7, 2023 11:03
@derpsteb derpsteb requested a review from 3u13r as a code owner February 7, 2023 11:03
@netlify
Copy link

netlify bot commented Feb 7, 2023

Deploy Preview for constellation-docs canceled.

Name Link
🔨 Latest commit 082feda
🔍 Latest deploy log https://app.netlify.com/sites/constellation-docs/deploys/63e25cece668d00008b3322a

@derpsteb derpsteb merged commit 8b7979c into main Feb 9, 2023
@derpsteb derpsteb deleted the feat/retry-certmanager-install branch February 9, 2023 08:05
msanft pushed a commit that referenced this pull request Feb 13, 2023
Motivation for this change are intermittent
timeout errors while installing cert-manager.

Co-authored-by: Paul Meyer <[email protected]>
@katexochen katexochen added the needs backport This PR needs to be backported to a previous release label Feb 22, 2023
derpsteb added a commit that referenced this pull request Feb 22, 2023
Motivation for this change are intermittent
timeout errors while installing cert-manager.

Co-authored-by: Paul Meyer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixing a bug needs backport This PR needs to be backported to a previous release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants