bootstrapper: retry helm chart installation #1151
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed change(s)
Sometimes cluster initilization fails on Azure with the error:
Error: init call: rpc error: code = Internal desc = initializing cluster: installing cert-manager: helm install: failed post-install: timed out waiting for the condition
.This PR introduces a retry loop that aborts after 5 mins for all helm installations.
An alternative solution would be to wait after konnectivity installation, however this approach seems more robust to me and might mitigate other unknown race conditions.
Judging from the logs and the frequency that we see this bug 9/10 installation should not run into the retry case.
For those that do, installation should work on the second try (in the logs the api check becomes healthy seconds after the first try is aborted).
Background Information
The timeout errors are most likely connected to konnectivty agents not being ready when starting
the installation. This may become a problem during installation when test-requests are sent to the webhook that cert-manager installs. If the sender is on a different node than the webhook pod and the sender's konnectivity-agent is not ready, the request will fail.
During installation we can see the following error messages on control plane node 2:
failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "[https://cert-manager-webhook.kube-system.svc:443/mutate?timeout=10s":](https://cert-manager-webhook.kube-system.svc/mutate?timeout=10s%22:) No agent available
cannot connect once" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate signed by unknown authority
The
cert-manager-webhook
pod is scheduled on node 0.The problem resolves itself after <1min and
cert-manager-startupapicheck
reports:The cert-manager API is ready
.Opensearch query.
Checklist