Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhooks get context deadline exceeded #1094

Open
mbrancato opened this issue Sep 4, 2024 · 19 comments
Open

Webhooks get context deadline exceeded #1094

mbrancato opened this issue Sep 4, 2024 · 19 comments

Comments

@mbrancato
Copy link

mbrancato commented Sep 4, 2024

Pulling this in from Slack for tracking - After some time I've been unable to reliably resolve this. I am running the VM Operator on 6+ cluster but only getting this frequently on a few clusters, and I could not identify and correlation as to why.

After upgrading to v0.47.3 (helm chart 0.34.7) using the helm chart, I'm getting these errors from server-side apply:

Internal error occurred: failed calling webhook "vmagent.victoriametrics.com": failed to call webhook: Post "https://victoria-metrics-operator.monitoring.svc/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s": context deadline exceeded

I tried disabling the vmagent webhook, and then I started to get the same error with other resources like vmalert. There are no errors in the logs that seem directly related.

I've been able to confirm that we can connect to the service, it just seems to not respond to the webhook posts.

@AndrewChubatiuk
Copy link
Contributor

hey @mbrancato
what tool are you using for chart deployment?

@mbrancato
Copy link
Author

@AndrewChubatiuk we are using the helm-controller which under the hood uses real helm. Its not like ArgoCD which templates charts - I've ran into multiple issues in the past using templating of charts. The helm-controller uses HelmRelease CRDs to define releases.

% helm -n flux-system status victoria-metrics-operator
NAME: victoria-metrics-operator
LAST DEPLOYED: Wed Sep  4 17:30:47 2024
NAMESPACE: monitoring
STATUS: deployed
REVISION: 14
TEST SUITE: None
NOTES:
victoria-metrics-operator has been installed. Check its status by running:
  kubectl --namespace monitoring get pods -l "app.kubernetes.io/instance=victoria-metrics-operator"

@AndrewChubatiuk
Copy link
Contributor

are you using default values for a chart?

@mbrancato
Copy link
Author

Here is my whole release definition @AndrewChubatiuk. What is under spec.values is basically the values.yaml.

I added admissionWebhooks.enabled: false yesterday after the upgrade just to stop getting errors from a slack integration during reconciliation attempts.

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: victoria-metrics-operator
  namespace: flux-system
spec:
  chart:
    spec:
      chart: victoria-metrics-operator
      sourceRef:
        kind: HelmRepository
        name: victoria-metrics
      version: 0.34.7
  interval: 1m0s
  targetNamespace: monitoring
  releaseName: victoria-metrics-operator
  timeout: "600s"

  values:
    image:
      tag: v0.47.3
    crd:
      create: true

    # Disabled webhooks that were added somewhere around 0.34.0 due to context timeouts
    # https://github.com/VictoriaMetrics/operator/issues/1094
    admissionWebhooks:
      enabled: false

    replicaCount: 1

    rbac:
      create: true

    operator:
      disable_prometheus_converter: false
      prometheus_converter_add_argocd_ignore_annotations: false
      enable_converter_ownership: false
      useCustomConfigReloader: false

    serviceAccount:
      create: true

    extraArgs:
      loggerFormat: json
      loggerJSONFields: "ts:timestamp,msg:message,level:severity"

    resources:
      limits:
        cpu: 120m
        memory: 320Mi
      requests:
        cpu: 80m
        memory: 120Mi

@mbrancato
Copy link
Author

And here is reading from helm, same data

% helm -n flux-system get values victoria-metrics-operator       
USER-SUPPLIED VALUES:
admissionWebhooks:
  enabled: false
crd:
  create: true
extraArgs:
  loggerFormat: json
  loggerJSONFields: ts:timestamp,msg:message,level:severity
image:
  tag: v0.47.3
operator:
  disable_prometheus_converter: false
  enable_converter_ownership: false
  prometheus_converter_add_argocd_ignore_annotations: false
  useCustomConfigReloader: false
rbac:
  create: true
replicaCount: 1
resources:
  limits:
    cpu: 120m
    memory: 320Mi
  requests:
    cpu: 80m
    memory: 120Mi
serviceAccount:
  create: true

@AndrewChubatiuk
Copy link
Contributor

can it be related to firewall or any connectivity issues? are other webhooks working properly?

@mbrancato
Copy link
Author

Other admission webhooks work fine. I triple checked - we're running GKE so we do have the 8443 and 9443 ports open from the master nodes. This did behave differently with a context deadline exceeded - so I'm tempted to test running it with a much longer context somehow and see if it ever completes. It succeeds on some of our clusters, but about half fail with this error.

@AndrewChubatiuk
Copy link
Contributor

from error message you've posted

Internal error occurred: failed calling webhook "vmagent.victoriametrics.com": failed to call webhook: Post "https://victoria-metrics-operator.monitoring.svc/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s": context deadline exceeded

webhook service is on port 443. do you have it opened as well?

@AndrewChubatiuk
Copy link
Contributor

hey @mbrancato
please try a latest operator version, webhook port on service was changed to 9443 there

@mbrancato
Copy link
Author

@AndrewChubatiuk

I first re-enabled the webhook in a test cluster and saw the same behavior. I then edited the webook config and upped the timeout to 30 seconds. That too failed to help.

So I upgraded the operator chart to 0.34.8, but I saw no change. I'm still getting these errors - it is not consistent but it is frequent. Our kustomize operator does a server-side dry run every minute or two, but it might be 2 minute in-between these logs, or it might be 10 minutes.

Internal error occurred: failed calling webhook "[vmagent.victoriametrics.com](http://vmagent.victoriametrics.com/)": failed to call webhook: Post "https://victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s": context deadline exceeded

I also tried deleting the webhook config:
kubectl delete validatingwebhookconfigurations victoria-metrics-operator-admission
And then forced a helm upgrade and let it be re-created, and then restarted the operator pod to let it re-register webhooks. Admittedly that "helped", but the issue does not fully go away.

Is there any way to add request logging to the webhooks? I looked and there didn't seem to be any debug logs set.

I did confirm in firewall logs that the traffic and calls to the webhook are allowed. I didn't change any rules, I just enabled logging. Note the pod IP: 10.28.16.160

% kubectl -n monitoring get pod -l app.kubernetes.io/instance=victoria-metrics-operator -o wide
NAME                                         READY   STATUS    RESTARTS   AGE   IP             NODE                                       NOMINATED NODE   READINESS GATES
victoria-metrics-operator-546c6b487f-bknbz   1/1     Running   0          32m   10.28.16.160   gke-cluster-vpc1-midmem-v4-dcf4d3cf-bbm1   <none>           <none>

And the allow log from the master node to that pod:
image

I do think the traffic is hitting the operator, but that it might just get stuck and is not responding periodically. It is very inconsistent. For example it hasn't happened in 25 minutes. I will leave the upgraded chart and firewall logs enabled in my test environment for now.

@AndrewChubatiuk
Copy link
Contributor

AndrewChubatiuk commented Sep 11, 2024

we're using helm lookup function to see if a secret with webhook certificates is already present in K8s cluster, so this might be a reason of an issue you're having as kustomize doesn't communicate with a cluster during template rendering, so it might recreate a secret during each run, so solution for you is either disable webhooks or use certmanager

@mbrancato
Copy link
Author

mbrancato commented Sep 11, 2024

@AndrewChubatiuk kustomize is only used to define the helm release resource (CRD). It really is helm used to run the upgrade.

I wouldn't suspect a context error for certificate errors or connection failures. I would only expect that if a firewall silently dropped packets, but that doesn't apply here.

edit: kustomize it also used to apply the VMAgent resources etc, that is what is triggering the failed webhooks.

edit2: just to clarify - there is no templating going on with helm like Argo does.

@mbrancato
Copy link
Author

As an update, I haven't seen any more alerts of dry run apply on VMAgent and other resources causing failed webhooks. The process that seems to work is to:

  • delete the existing validating webhook config
  • perform the helm upgrade

@mbrancato
Copy link
Author

mbrancato commented Sep 13, 2024

@AndrewChubatiuk I tried deploying again, and I'm still getting a lot of timeouts.

The good news is I'm able to manually reproduce the slow validation using curl. The performance here varies a lot, and the timeout alerts I'm getting are fairly infrequent, but its enough to be a problem.

curl -k -X POST "https://victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmagent?timeout=10s" \
  -H "Content-Type: application/json" \
  -d '{
    "apiVersion": "admission.k8s.io/v1",
    "kind": "AdmissionReview",
    "request": {
      "uid": "b0d0cc74-565a-4d0e-acd6-eb2b82e41126",
      "kind": {
        "group": "operator.victoriametrics.com",
        "version": "v1beta1",
        "kind": "VMAgent"
      },
      "resource": {
        "group": "operator.victoriametrics.com",
        "version": "v1beta1",
        "resource": "vmagents"
      },
      "namespace": "monitoring",
      "operation": "CREATE",
      "object": {
        "apiVersion": "operator.victoriametrics.com/v1beta1",
        "kind": "VMAgent",
        "metadata": {
          "name": "vmagent",
          "namespace": "monitoring"
        },
        "spec": {
          "image": {
            "tag": "v1.103.0"
          },
          "selectAllByDefault": true,
          "scrapeInterval": "30s",
          "replicaCount": 1,
          "shardCount": 4,
          "logFormat": "json",
          "extraArgs": {
            "memory.allowedPercent": "40",
            "promscrape.cluster.memberLabel": "vmagent",
            "remoteWrite.forceVMProto": "true,true",
            "loggerJSONFields": "ts:timestamp,msg:message,level:severity",
            "promscrape.maxScrapeSize": "32MB"
          },
          "resources": {
            "requests": {
              "cpu": "5",
              "memory": "10Gi"
            }
          },
          "remoteWrite": [
            {
              "url": "http://myvm1:8080/insert/0/prometheus/api/v1/write",
              "sendTimeout": "4m",
              "inlineUrlRelabelConfig": [
                { "action": "drop_metrics", "regex": "^jsjdsds_[^:]*" },
                { "action": "drop_metrics", "regex": "^jsdujf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^igsifjsf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^ufshufs_$" },
                { "action": "drop_metrics", "regex": "^uwyfdjf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^jsfhdhyf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^event_handler_[^:]*$" },
                { "action": "drop_metrics", "regex": "^sfuhfe_[^:]*$" },
                { "action": "drop_metrics", "regex": "^node_[^:]*$" },
                { "action": "drop_metrics", "regex": "^adbfjwd_[^:]*$" },
                { "action": "drop_metrics", "regex": "^oiufjwdh_[^:]*$" }
              ]
            },
            {
              "url": "http://myvm2:8080/insert/0/prometheus/api/v1/write",
              "sendTimeout": "4m",
              "streamAggrConfig": {
                "keepInput": false,
                "rules": [
                  {
                    "match": [ "{__name__=~\"^sdfsf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "id", "name", "instance", "pod", "node" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^ifsufhdf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "uid", "pod", "container_id" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^iwefjsdf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^sdfwefw_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "total_prometheus" ],
                    "without": [ "pod", "instance", "controller_pod" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^ysdfsf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^kdodjf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^hdjksdf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "sum_samples", "max", "total_prometheus" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^dsdjfwe_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "total_prometheus" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^lssd_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^jwewjefjwf_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total", "total_prometheus" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  },
                  {
                    "match": [ "{__name__=~\"^hdfhwfw_.+\"}" ],
                    "interval": "30s",
                    "outputs": [ "total" ],
                    "without": [ "pod", "instance" ],
                    "staleness_interval": "5m"
                  }
                ]
              }
            }
          ],
          "remoteWriteSettings": {
            "flushInterval": "15s",
            "queues": 32
          },
          "externalLabels": {
            "rfwfewf": "ifwefwf",
            "idfjugu": "jfugrs",
            "sdiwenf": "dsfegwef"
          },
          "serviceAccountName": "my-svc-acct",
          "containers": [
            {
              "name": "urnferjsdf",
              "image": "lfdnwe/urnferjsdf:0.0.2",
              "env": [
                { "name": "FOO_OSDFSDF", "value": "juef" },
                { "name": "FOO_OISDFJ", "value": "oijfogjw" },
                { "name": "FOO_TWGSDF", "value": "urbweyfb" },
                { "name": "FOO_WGSSDJG", "value": "nsbhqkf" },
                { "name": "FOO_JQWJGTB", "value": "iusdfuh" }
              ]
            }
          ]
        }
      },
      "dryRun": true
    }
  }'

timing this with curl, its pretty slow:

{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","response":{"uid":"b0d0cc74-565a-4d0e-acd6-eb2b82e41126","allowed":true,"status":{"metadata":{},"code":200}}}

real	0m9.470s
user	0m0.011s
sys	0m0.006s

If I drop this from the remoteWrite section, it becomes extremely fast to validate:

            {
              "url": "http://myvm1:8080/insert/0/prometheus/api/v1/write",
              "sendTimeout": "4m",
              "inlineUrlRelabelConfig": [
                { "action": "drop_metrics", "regex": "^jsjdsds_[^:]*" },
                { "action": "drop_metrics", "regex": "^jsdujf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^igsifjsf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^ufshufs_$" },
                { "action": "drop_metrics", "regex": "^uwyfdjf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^jsfhdhyf_[^:]*$" },
                { "action": "drop_metrics", "regex": "^event_handler_[^:]*$" },
                { "action": "drop_metrics", "regex": "^sfuhfe_[^:]*$" },
                { "action": "drop_metrics", "regex": "^node_[^:]*$" },
                { "action": "drop_metrics", "regex": "^adbfjwd_[^:]*$" },
                { "action": "drop_metrics", "regex": "^oiufjwdh_[^:]*$" }
              ]
            },

timing after removed:

{"kind":"AdmissionReview","apiVersion":"admission.k8s.io/v1","response":{"uid":"9dfd4ec3-587a-40d6-968c-9dafe3b286cb","allowed":true,"status":{"metadata":{},"code":200}}}

real	0m0.018s
user	0m0.009s
sys	0m0.003s

@f41gh7
Copy link
Collaborator

f41gh7 commented Sep 13, 2024

Thanks for detailed investigation!

It's related to VictoriaMetrics issue VictoriaMetrics/VictoriaMetrics#6911

@hagen1778
Copy link
Contributor

@mbrancato could you try this vmagent image built from VictoriaMetrics/VictoriaMetrics#6934 with regex fix ?

@mbrancato
Copy link
Author

@mbrancato could you try this vmagent image built from VictoriaMetrics/VictoriaMetrics#6934 with regex fix ?

@hagen1778 is vmagent the correct component to update here? The webhook targets the operator binary and that appears to be what is slow

@hagen1778
Copy link
Contributor

I'm confused. @AndrewChubatiuk @f41gh7 does operator just parses those regexps from relabeling configs or does some matching?

@f41gh7
Copy link
Collaborator

f41gh7 commented Sep 16, 2024

I'm confused. @AndrewChubatiuk @f41gh7 does operator just parses those regexps from relabeling configs or does some matching?

Operator performs logic validation for relabel configs rules (aka dry-run). It reuses vmagent config parsing libs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants