-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bootstrapper: mitigate timeout issue during Cilium deployment #1403
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are already logging the install time for cert-manager, why not also log the time for all the other helm deployments?
Mainly because only Cilium and cert-manager are the only blocking ones since they require a healthy state. The other installs go through quickly - they don't wait for their deployments to be fully functional. But I could refactor the cert-manager duration into the general Helm install code. Cilium needs extra treatment nevertheless. |
d6bf595
to
2fc0316
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I really look forward to having this bug gone :D What is the reason for moving the e2e manual run 🟢 by Nirusu |
Unfortunately it won't be fully gone but it should be a bit better.
No super important reason, mainly:
Could also still be one function but I thought it is cleaner this way. |
7001fd0
to
4836c71
Compare
4836c71
to
c3f80bd
Compare
@katexochen Can you re-review please? It's stuck on change requested otherwise. |
c3f80bd
to
307e847
Compare
307e847
to
ec961cd
Compare
Proposed change(s)
Generally, the issue is the following:
Often it's not things being completely broken - they are just occasionally slow. Given that we depend on external resources, this is not too unsurprising (though certainly annoying).
I hope we can use OpenSearch to maybe capture the duration of successful installs and derive good timeouts from them.
I choose a middle ground with 20 minutes and 10 minutes. We could also set them higher in case they still make trouble and/or we want to run more statistics on them. 20 minutes for Cilium is already graceful (only doing this because of the ghcr.io issue), 10 minutes for Helm might still be a little bit too short for worst-case scenarios.
In the later run though, however, I wish we can refactor the bootstrapper to handle the Helm installations after we return the kubeconfig to the client. Often things just take longer. Or they can be fixed manually.
Maybe we can also disable the cert-manager API health check? Not sure how that works though with the dependencies on the operator. Maybe @derpsteb can comment on that?