Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrapper: mitigate timeout issue during Cilium deployment #1403

Merged
merged 7 commits into from
Mar 15, 2023

Conversation

Nirusu
Copy link
Contributor

@Nirusu Nirusu commented Mar 10, 2023

Proposed change(s)

  • Split FixCilium into WaitForCilium and FixCilium
  • Put both directly after Cilium installation so we have a distinct error when Cilium doesn't come up instead of having the same cert-manager context deadline exceeded error
  • Set a timeout for 20 minutes for WaitForCilium (pulling can sometimes be slow with ghcr.io - in local testing the worst case was ~16 minutes occasionally) -> This should likely be reduced after the next Cilium update when/if we switch repositories that have more consistent performance.
  • Set a timeout for 10 minutes for Helm install instead of 5 minutes (cert-manager can sometimes take longer)
  • Track duration of Cilium and cert-manager install (since both can stall)
  • Minor fixes

Generally, the issue is the following:

  • We try to install too many things at once almost immediately the cluster is up, so when "cert-manager" fails, we have ~10 Pods that need to be created, need to pull images, scheduled etc.
  • The node can still be tainted as uninitialized for a bit when the cluster recently came up
  • Repositories can be slow sometimes (e.g. hit a bad PoP from a CDN - definitely happens with ghcr.io)
  • Sometimes Kubernetes takes time to schedule things

Often it's not things being completely broken - they are just occasionally slow. Given that we depend on external resources, this is not too unsurprising (though certainly annoying).

I hope we can use OpenSearch to maybe capture the duration of successful installs and derive good timeouts from them.

I choose a middle ground with 20 minutes and 10 minutes. We could also set them higher in case they still make trouble and/or we want to run more statistics on them. 20 minutes for Cilium is already graceful (only doing this because of the ghcr.io issue), 10 minutes for Helm might still be a little bit too short for worst-case scenarios.

In the later run though, however, I wish we can refactor the bootstrapper to handle the Helm installations after we return the kubeconfig to the client. Often things just take longer. Or they can be fixed manually.

Maybe we can also disable the cert-manager API health check? Not sure how that works though with the dependencies on the operator. Maybe @derpsteb can comment on that?

@Nirusu Nirusu added the bug fix Fixing a bug label Mar 10, 2023
@Nirusu Nirusu requested a review from derpsteb March 10, 2023 18:34
@Nirusu Nirusu requested a review from 3u13r as a code owner March 10, 2023 18:34
@edgelesssys edgelesssys deleted a comment from netlify bot Mar 10, 2023
@Nirusu Nirusu requested a review from katexochen as a code owner March 11, 2023 07:15
Copy link
Member

@daniel-weisse daniel-weisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are already logging the install time for cert-manager, why not also log the time for all the other helm deployments?

bootstrapper/internal/kubernetes/kubernetes.go Outdated Show resolved Hide resolved
@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 13, 2023

Mainly because only Cilium and cert-manager are the only blocking ones since they require a healthy state. The other installs go through quickly - they don't wait for their deployments to be fully functional.

But I could refactor the cert-manager duration into the general Helm install code. Cilium needs extra treatment nevertheless.

.github/actions/constellation_create/action.yml Outdated Show resolved Hide resolved
.github/actions/constellation_create/action.yml Outdated Show resolved Hide resolved
.github/actions/constellation_create/action.yml Outdated Show resolved Hide resolved
Copy link
Member

@3u13r 3u13r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derpsteb
Copy link
Member

derpsteb commented Mar 14, 2023

I really look forward to having this bug gone :D
Thanks for investigating the issue further!

What is the reason for moving the WaitForCilium code into it's own function?

e2e manual run 🟢 by Nirusu
another one because I saw the above one only after starting the second run.

@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 14, 2023

I really look forward to having this bug gone :D

Unfortunately it won't be fully gone but it should be a bit better.

What is the reason for moving the WaitForCilium code into it's own function?

No super important reason, mainly:

  • to split the functionality here (since it had two logical components - waiting and restarting the Pod)
  • have different contexts for waiting and killing (otherwise you could theoretically pass the wait but fail the rest in an unfortunate situation or have an hidden embedded context as before). Also only the first part needed the logger for warning.
  • Also helps to count the time this way.

Could also still be one function but I thought it is cleaner this way.

@Nirusu
Copy link
Contributor Author

Nirusu commented Mar 14, 2023

@katexochen Can you re-review please? It's stuck on change requested otherwise.
Will squash some commits together then and merge this and see how everything behaves in the e2e tests.

@Nirusu Nirusu merged commit 70ca69f into main Mar 15, 2023
@Nirusu Nirusu deleted the ref/bootstrapper-timeouts branch March 15, 2023 17:36
@derpsteb derpsteb added the needs backport This PR needs to be backported to a previous release label Mar 24, 2023
@katexochen katexochen changed the title bootstrapper: try to mitigate timeout issues bootstrapper: mitigate timeout issue during Cilium deployment Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixing a bug needs backport This PR needs to be backported to a previous release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants