bootstrapper: mitigate timeout issue during Cilium deployment #1403

Nirusu · 2023-03-10T18:34:56Z

Proposed change(s)

Split FixCilium into WaitForCilium and FixCilium
Put both directly after Cilium installation so we have a distinct error when Cilium doesn't come up instead of having the same cert-manager context deadline exceeded error
Set a timeout for 20 minutes for WaitForCilium (pulling can sometimes be slow with ghcr.io - in local testing the worst case was ~16 minutes occasionally) -> This should likely be reduced after the next Cilium update when/if we switch repositories that have more consistent performance.
Set a timeout for 10 minutes for Helm install instead of 5 minutes (cert-manager can sometimes take longer)
Track duration of Cilium and cert-manager install (since both can stall)
Minor fixes

Generally, the issue is the following:

We try to install too many things at once almost immediately the cluster is up, so when "cert-manager" fails, we have ~10 Pods that need to be created, need to pull images, scheduled etc.
The node can still be tainted as uninitialized for a bit when the cluster recently came up
Repositories can be slow sometimes (e.g. hit a bad PoP from a CDN - definitely happens with ghcr.io)
Sometimes Kubernetes takes time to schedule things

Often it's not things being completely broken - they are just occasionally slow. Given that we depend on external resources, this is not too unsurprising (though certainly annoying).

I hope we can use OpenSearch to maybe capture the duration of successful installs and derive good timeouts from them.

I choose a middle ground with 20 minutes and 10 minutes. We could also set them higher in case they still make trouble and/or we want to run more statistics on them. 20 minutes for Cilium is already graceful (only doing this because of the ghcr.io issue), 10 minutes for Helm might still be a little bit too short for worst-case scenarios.

In the later run though, however, I wish we can refactor the bootstrapper to handle the Helm installations after we return the kubeconfig to the client. Often things just take longer. Or they can be fixed manually.

Maybe we can also disable the cert-manager API health check? Not sure how that works though with the dependencies on the operator. Maybe @derpsteb can comment on that?

daniel-weisse

If we are already logging the install time for cert-manager, why not also log the time for all the other helm deployments?

bootstrapper/internal/kubernetes/kubernetes.go

Nirusu · 2023-03-13T09:44:21Z

Mainly because only Cilium and cert-manager are the only blocking ones since they require a healthy state. The other installs go through quickly - they don't wait for their deployments to be fully functional.

But I could refactor the cert-manager duration into the general Helm install code. Cilium needs extra treatment nevertheless.

.github/actions/constellation_create/action.yml

3u13r

LGTM

bootstrapper/internal/kubernetes/kubernetes.go

derpsteb · 2023-03-14T08:14:25Z

I really look forward to having this bug gone :D
Thanks for investigating the issue further!

What is the reason for moving the WaitForCilium code into it's own function?

e2e manual run 🟢 by Nirusu
another one because I saw the above one only after starting the second run.

bootstrapper/internal/kubernetes/kubernetes.go

Nirusu · 2023-03-14T09:38:21Z

I really look forward to having this bug gone :D

Unfortunately it won't be fully gone but it should be a bit better.

What is the reason for moving the WaitForCilium code into it's own function?

No super important reason, mainly:

to split the functionality here (since it had two logical components - waiting and restarting the Pod)
have different contexts for waiting and killing (otherwise you could theoretically pass the wait but fail the rest in an unfortunate situation or have an hidden embedded context as before). Also only the first part needed the logger for warning.
Also helps to count the time this way.

Could also still be one function but I thought it is cleaner this way.

Nirusu · 2023-03-14T13:31:53Z

@katexochen Can you re-review please? It's stuck on change requested otherwise.
Will squash some commits together then and merge this and see how everything behaves in the e2e tests.

Nirusu added the bug fix Fixing a bug label Mar 10, 2023

Nirusu requested a review from derpsteb March 10, 2023 18:34

Nirusu requested a review from 3u13r as a code owner March 10, 2023 18:34

edgelesssys deleted a comment from netlify bot Mar 10, 2023

Nirusu requested a review from katexochen as a code owner March 11, 2023 07:15

daniel-weisse reviewed Mar 13, 2023

View reviewed changes

bootstrapper/internal/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

katexochen requested changes Mar 13, 2023

View reviewed changes

.github/actions/constellation_create/action.yml Outdated Show resolved Hide resolved

.github/actions/constellation_create/action.yml Outdated Show resolved Hide resolved

.github/actions/constellation_create/action.yml Outdated Show resolved Hide resolved

Nirusu force-pushed the ref/bootstrapper-timeouts branch 2 times, most recently from d6bf595 to 2fc0316 Compare March 13, 2023 15:23

Nirusu mentioned this pull request Mar 13, 2023

cli: prevent double initialization in cases where an error was mistakenly retried #1404

Merged

3 tasks

Nirusu requested review from daniel-weisse and katexochen March 13, 2023 16:52

3u13r approved these changes Mar 13, 2023

View reviewed changes

derpsteb reviewed Mar 14, 2023

View reviewed changes

bootstrapper/internal/kubernetes/kubernetes.go Show resolved Hide resolved

derpsteb approved these changes Mar 14, 2023

View reviewed changes

daniel-weisse reviewed Mar 14, 2023

View reviewed changes

bootstrapper/internal/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

daniel-weisse reviewed Mar 14, 2023

View reviewed changes

bootstrapper/internal/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

daniel-weisse reviewed Mar 14, 2023

View reviewed changes

bootstrapper/internal/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

Nirusu force-pushed the ref/bootstrapper-timeouts branch from 7001fd0 to 4836c71 Compare March 14, 2023 09:44

daniel-weisse approved these changes Mar 14, 2023

View reviewed changes

Nirusu force-pushed the ref/bootstrapper-timeouts branch from 4836c71 to c3f80bd Compare March 14, 2023 13:28

Nirusu force-pushed the ref/bootstrapper-timeouts branch from c3f80bd to 307e847 Compare March 14, 2023 14:36

Nirusu added 4 commits March 15, 2023 13:24

bootstrapper: move fixing & waiting for Cilium to earlier stage

ff6c810

bootstrapper: bump Helm timeout to 10 minutes

e20cb00

bootstrapper: more logging

1c0b3d8

bootstrapper: use zap.Duration to log durations

5715a98

Nirusu added 3 commits March 15, 2023 13:27

bootstrapper: calculate duration for all Helm charts

d45855c

e2e: Temporarily bump kubectl wait timeout from 10 mins to 20 mins

351bd01

e2e: print K8s Pods and Events when kubectl wait fails

ec961cd

Nirusu force-pushed the ref/bootstrapper-timeouts branch from 307e847 to ec961cd Compare March 15, 2023 12:27

katexochen approved these changes Mar 15, 2023

View reviewed changes

Nirusu merged commit 70ca69f into main Mar 15, 2023

Nirusu deleted the ref/bootstrapper-timeouts branch March 15, 2023 17:36

derpsteb added the needs backport This PR needs to be backported to a previous release label Mar 24, 2023

katexochen changed the title ~~bootstrapper: try to mitigate timeout issues~~ bootstrapper: mitigate timeout issue during Cilium deployment Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bootstrapper: mitigate timeout issue during Cilium deployment #1403

bootstrapper: mitigate timeout issue during Cilium deployment #1403

Nirusu commented Mar 10, 2023 •

edited

Loading

daniel-weisse left a comment

Nirusu commented Mar 13, 2023

3u13r left a comment

derpsteb commented Mar 14, 2023 •

edited

Loading

Nirusu commented Mar 14, 2023 •

edited

Loading

Nirusu commented Mar 14, 2023

bootstrapper: mitigate timeout issue during Cilium deployment #1403

bootstrapper: mitigate timeout issue during Cilium deployment #1403

Conversation

Nirusu commented Mar 10, 2023 • edited Loading

Proposed change(s)

daniel-weisse left a comment

Choose a reason for hiding this comment

Nirusu commented Mar 13, 2023

3u13r left a comment

Choose a reason for hiding this comment

derpsteb commented Mar 14, 2023 • edited Loading

Nirusu commented Mar 14, 2023 • edited Loading

Nirusu commented Mar 14, 2023

Nirusu commented Mar 10, 2023 •

edited

Loading

derpsteb commented Mar 14, 2023 •

edited

Loading

Nirusu commented Mar 14, 2023 •

edited

Loading