Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook server could not answer in time #30

Closed
nyikesda opened this issue Aug 12, 2021 · 14 comments · Fixed by #165 or #280
Closed

Webhook server could not answer in time #30

nyikesda opened this issue Aug 12, 2021 · 14 comments · Fixed by #165 or #280

Comments

@nyikesda
Copy link
Contributor

nyikesda commented Aug 12, 2021

REPRODUCTION

  • certm-manager and its CRDs are deployed
  • webhook and operator helm chart are deployed
  • the webhook's and the operator's pods are in ready state
  • the attached Vertica descriptor is applied with the kubectl command: kubectl -n <namespace> apply -f <path-to-the-attached-file>
  • The following error is raised by the kubernetes API:
Error from server (InternalError): error when creating "test-upgrade-vertica/test_vertica_oper.yaml": Internal error occurred: failed calling webhook "vverticadb2.kb.io": Post "https://verticadb-webhook-webhook-service.analytical-processing-database-precodereview-1081.svc:443/validate-vertica-com-v1beta1-verticadb?timeout=10s": dial tcp 10.99.146.222:443: connect: connection refused
  • I tried to call the same command again and again and it was success after the 3rd call

The strange thing was that the webhook got the vertica descriptor, and a response was sent by the webhook (check the logs below). I guess there was a timeout between the webhook and the kubertetes API server, because I had to wait at least 1,5 sec to got any response, but I do not know how can I check it to give more details. If it was a timeout then it could be a wrong connection pool handling in the webhook or a wrong configuration in the kube-rbac-proxy.

webhook manager container log:

2021-08-11T10:54:35.861Z        DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-vertica-com-v1beta1-verticadb", "UID": "896bae9c-6be2-4259-8064-174092b89ce3", "kind": "vertica.com/v1beta1, Kind=VerticaDB", "resource": {"group":"vertica.com","version":"v1beta1","resource":"verticadbs"}}
2021-08-11T10:54:35.862Z        INFO    verticadb-resource      default {"name": "verticadb-upgrade-test"}
2021-08-11T10:54:35.863Z        DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-vertica-com-v1beta1-verticadb", "code": 200, "reason": "", "UID": "896bae9c-6be2-4259-8064-174092b89ce3", "allowed": true}
@spilchen
Copy link
Collaborator

There is a lag between deploying the cert-manager and its ability to hand out certs. There are some steps that are outlined here to make sure the cert-manager is operational: https://cert-manager.io/docs/installation/verify/#manual-verification

Can you add this step to your deployment to see if that solves the issue?

We automated this wait in the following script: https://github.com/vertica/vertica-kubernetes/blob/main/scripts/wait-for-cert-manager-ready.sh

@nyikesda
Copy link
Contributor Author

Hi @spilchen ,
Sorry for the late response, I was on vacation.
I forget to mention that I checked those pods as well, so the cert-manager was in ready state.
On the other hand I checked the generated certificate as well and it was injected into the created ValidatingWebhookConfiguration and the ValidatingWebhookConfiguration.

@spilchen
Copy link
Collaborator

How did you run these two steps in your repro?

  • webhook and operator helm chart are deployed
  • the webhook's and the operator's pods are in ready state

@nyikesda
Copy link
Contributor Author

Attached helm charts:
helm-charts.zip

Deploy steps:

  • helm install vertica-webhook <path-to-the-webhook-folder> --namespace vertica
  • helm install vertica-operator <path-to-the-operator-folder> --namespace vertica

Ready state check:

  • kubectl get pod -n vertica

based on the output the operator and the webhook was ready

@spilchen
Copy link
Collaborator

This is an issue with the operator-sdk framework we are using. There was a PR that went into the controller-runtime that will help alleviate this (kubernetes-sigs/controller-runtime#1588). It provides a true health check that makes sure the webhook server is up and running. This only went into controller-runtime in July in the v0.9.3 release -- for comparison the current framework we use is on v0.7.2. And there was a minor fix for it that went into the v0.9.6 release. So I'm a bit resistant to move up controller-runtime to pick this up as I don't want to destabilize things. Is this a super urgent problem that needs to be fixed?

@nyikesda
Copy link
Contributor Author

Do you mean that the /readyz gives false positive response? It could be a serious issue. I have to try some scenarios and I will back.

@spilchen
Copy link
Collaborator

The /readyz probe just tells you whether the pod is running. It doesn't tell you if the webhook port is being listened on. The listener is setup shortly after the pod starts. Until that happens, there is a timing window that fails any webhook request that comes in.

@nyikesda
Copy link
Contributor Author

Hi @spilchen,
I have re-played my scenarios with the latest master version of the vertica operator, and the issue could not be reproduced.
On the other hand, it would be pleasure to give ready state only in case of the webhook is listening.

@nyikesda
Copy link
Contributor Author

So I would like to keep this issue in opened for the controller-runtime version increase. It is not so urgent.

spilchen added a commit that referenced this issue Sep 27, 2021
The e2e tests hit a failure because we had tried to create a VerticaDB before
the webhook was fully up. This is a known issue (#30) with the webhook. We are
going to work around this for now by adding a wait script when the tests issue
make deploy.
@harisokanovic
Copy link
Contributor

Hi @spilchen, @nyikesda,

Is it possible to work around this issue in helm?

My use case: Installing a VerticaDB resource in a helm chart. I'd like to install verticadb-operator via a dependency, but doing so seems to trigger this issue. A clean installation fails with the following error:

Error: failed to create resource: Internal error occurred: failed calling webhook "mverticadb.kb.io": Post "https://verticadb-operator-webhook-service.nopvertica.svc:443/mutate-vertica-com-v1beta1-verticadb?timeout=10s": no endpoints available for service "verticadb-operator-webhook-service"

@spilchen
Copy link
Collaborator

spilchen commented Mar 4, 2022

It isn't helm based, but we have a script that works around this issue that we use in our development environment (scripts/wait-for-webhook.sh).

However, we are in the process of upgrading the go packages in #165. This will bring in a new controller-runtime that properly implements a health check and should resolve this issue.

@harisokanovic
Copy link
Contributor

Thanks. I already tried polling the webhook in a pre-install job. However, the way helm merges chart dependencies causes my job to run before any operator resources are deployed, and then stalls until timeout.

@spilchen
Copy link
Collaborator

spilchen commented Mar 9, 2022

We are still seeing cases where the helm --wait returns but the webhook still isn't 100% ready. Reopening this issue so that we can investigate this further. The scripted wait was added back in #169

@spilchen spilchen reopened this Mar 9, 2022
@harisokanovic
Copy link
Contributor

I see that as well. Our current solution is to run the aforementioned pre-install job in our chart and install both charts from Terraform sequentially instead of helm dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants