Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure large cluster deployment fails due to Kubernetes dashboard #1828

Closed
jefflill opened this issue Jul 22, 2023 · 2 comments
Closed

Azure large cluster deployment fails due to Kubernetes dashboard #1828

jefflill opened this issue Jul 22, 2023 · 2 comments
Assignees

Comments

@jefflill
Copy link
Collaborator

jefflill commented Jul 22, 2023

Our large Azure test cluster deployment is failing due to the kubernetes dashboard pod. The cluster is still running on runner-01

Name:                 kubernetes-dashboard-b5688559d-6q2h6
Namespace:            neon-system
Priority:             900003000
Priority Class Name:  neon-app
Node:                 worker-0/10.100.0.13
Start Time:           Sat, 22 Jul 2023 00:44:13 +0000
Labels:               app.kubernetes.io/component=kubernetes-dashboard
                      app.kubernetes.io/instance=kubernetes-dashboard
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=kubernetes-dashboard
                      app.kubernetes.io/version=2.4.0
                      helm.sh/chart=kubernetes-dashboard-5.0.4
                      pod-template-hash=b5688559d
                      security.istio.io/tlsMode=istio
                      service.istio.io/canonical-name=kubernetes-dashboard
                      service.istio.io/canonical-revision=2.4.0
Annotations:          cni.projectcalico.org/containerID: 29ca23aebe21514a06a890cd3f5af787918bba22a72ecf039a12822d26c82cc4
                      cni.projectcalico.org/podIP: 10.254.43.6/32
                      cni.projectcalico.org/podIPs: 10.254.43.6/32
                      k8s.v1.cni.cncf.io/networks: istio-cni
                      kubectl.kubernetes.io/default-container: kubernetes-dashboard
                      kubectl.kubernetes.io/default-logs-container: kubernetes-dashboard
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
                      sidecar.istio.io/interceptionMode: REDIRECT
                      sidecar.istio.io/status:
                        {"initContainers":["istio-validation"],"containers":["istio-proxy"],"volumes":["workload-socket","workload-certs","istio-envoy","istio-dat...
                      traffic.sidecar.istio.io/excludeInboundPorts: 15020
                      traffic.sidecar.istio.io/includeInboundPorts: *
                      traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status:               Running
IP:                   10.254.43.6
IPs:
  IP:           10.254.43.6
Controlled By:  ReplicaSet/kubernetes-dashboard-b5688559d
Init Containers:
  istio-validation:
    Container ID:  cri-o://521e19dd6e8fa0eef150f3b7cdb5dbd6c162d612b79f552f5ef6f7226ef25cb4
    Image:         registry.neon.local/neonkube/proxyv2:1.14.1-distroless
    Image ID:      ghcr.io/neonkube-stage/proxyv2@sha256:74d5c9da2e70004a7d217ee0000fb24b19b49291d9076b04f92ad6f64ed390dc
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      *
      -d
      15090,15021,15020
      --run-validation
      --skip-rule-apply
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 22 Jul 2023 00:44:13 +0000
      Finished:     Sat, 22 Jul 2023 00:44:13 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     10m
      memory:  32Mi
    Environment:
      SECRET_TTL:  2160h
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rfqsv (ro)
Containers:
  istio-proxy:
    Container ID:  cri-o://3ebf4eb5ca3d8117e6a7b3ec0d6f099920ab9eef37ee0da8deb000774bc84092
    Image:         registry.neon.local/neonkube/proxyv2:1.14.1-distroless
    Image ID:      ghcr.io/neonkube-stage/proxyv2@sha256:74d5c9da2e70004a7d217ee0000fb24b19b49291d9076b04f92ad6f64ed390dc
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --log_as_json
      --concurrency
      2
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 22 Jul 2023 01:58:43 +0000
      Finished:     Sat, 22 Jul 2023 01:59:48 +0000
    Ready:          False
    Restart Count:  16
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   32Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.neon-ingress.svc:15012
      POD_NAME:                      kubernetes-dashboard-b5688559d-6q2h6 (v1:metadata.name)
      POD_NAMESPACE:                 neon-system (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      PROXY_CONFIG:                  {"discoveryAddress":"istiod.neon-ingress.svc:15012","tracing":{"openCensusAgent":{"address":"grafana-agent-node.neon-monitor:4320","context":["W3C_TRACE_CONTEXT","GRPC_BIN","CLOUD_TRACE_CONTEXT","B3"]},"sampling":1},"proxyMetadata":{"SECRET_TTL":"2160h"}}
                                     
      ISTIO_META_POD_PORTS:          [
                                         {"name":"http","containerPort":9090,"protocol":"TCP"}
                                         ,{"name":"http-metrics","containerPort":8000,"protocol":"TCP"}
                                         ,{"containerPort":8000,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     kubernetes-dashboard,dashboard-metrics-scraper
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      kubernetes-dashboard
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/neon-system/deployments/kubernetes-dashboard
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
      SECRET_TTL:                    2160h
      ISTIO_KUBE_APP_PROBERS:        {"/app-health/dashboard-metrics-scraper/livez":{"httpGet":{"path":"/","port":8000,"scheme":"HTTP"},"timeoutSeconds":30},"/app-health/kubernetes-dashboard/livez":{"httpGet":{"path":"/","port":9090,"scheme":"HTTP"},"timeoutSeconds":30}}
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rfqsv (ro)
      /var/run/secrets/tokens from istio-token (rw)
      /var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
      /var/run/secrets/workload-spiffe-uds from workload-socket (rw)
  kubernetes-dashboard:
    Container ID:  cri-o://460c40add5f0dff5e6783088b9da975cf866f09e2e3c9b0471a878f579e3c63e
    Image:         registry.neon.local/neonkube/kubernetesui-dashboard:v2.6.0
    Image ID:      ghcr.io/neonkube-stage/kubernetesui-dashboard@sha256:3e9d42bf2ed2d98f65b8583d200248ef88f1e3d1839d011572abeca1fa68f2e1
    Ports:         9090/TCP, 8000/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --namespace=neon-system
      --enable-insecure-login
      --sidecar-host=http://127.0.0.1:8000
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sat, 22 Jul 2023 01:59:48 +0000
      Finished:     Sat, 22 Jul 2023 01:59:48 +0000
    Ready:          False
    Restart Count:  16
    Limits:
      memory:  128Mi
    Requests:
      memory:     64Mi
    Liveness:     http-get http://:15020/app-health/kubernetes-dashboard/livez delay=30s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /certs from kubernetes-dashboard-certs (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rfqsv (ro)
  dashboard-metrics-scraper:
    Container ID:   cri-o://99e4d5ac5e5fb4b5f4f44f1311efbecbbf9efd1f3b31c176989e9e6d6e890c61
    Image:          registry.neon.local/neonkube/kubernetesui-metrics-scraper:v1.0.7
    Image ID:       ghcr.io/neonkube-stage/kubernetesui-metrics-scraper@sha256:408f4ab2019db4831558df05d7a5aff63aa787981041c47a03e4cb7d93686d8e
    Port:           8000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 22 Jul 2023 02:00:19 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sat, 22 Jul 2023 01:54:23 +0000
      Finished:     Sat, 22 Jul 2023 01:55:13 +0000
    Ready:          True
    Restart Count:  24
    Liveness:       http-get http://:15020/app-health/dashboard-metrics-scraper/livez delay=30s timeout=30s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rfqsv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  workload-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  workload-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kubernetes-dashboard-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubernetes-dashboard-certs
    Optional:    false
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-rfqsv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 30s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 30s
Events:
  Type     Reason               Age   From               Message
  ----     ------               ----  ----               -------
  Normal   Pulled               76m   kubelet            Container image "registry.neon.local/neonkube/proxyv2:1.14.1-distroless" already present on machine
  Normal   Created              76m   kubelet            Created container istio-validation
  Normal   Started              76m   kubelet            Started container istio-validation
  Normal   Scheduled            76m   default-scheduler  Successfully assigned neon-system/kubernetes-dashboard-b5688559d-6q2h6 to worker-0
  Warning  FailedPostStartHook  75m   kubelet            Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "kubernetes-dashboard-b5688559d-6q2h6_neon-system(1f7bbc81-9285-4caa-bc07-678f8489dcb6)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp [::1]:15021: connect: connection refused
, message: "2023-07-22T00:44:14.698394Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2023-07-22T00:45:14.812853Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\n"
  Normal   Created              75m                kubelet  Created container kubernetes-dashboard
  Normal   Started              75m                kubelet  Started container dashboard-metrics-scraper
  Normal   Created              75m (x2 over 76m)  kubelet  Created container istio-proxy
  Normal   Started              75m (x2 over 76m)  kubelet  Started container istio-proxy
  Normal   Created              75m                kubelet  Created container dashboard-metrics-scraper
  Normal   Pulled               75m (x2 over 76m)  kubelet  Container image "registry.neon.local/neonkube/proxyv2:1.14.1-distroless" already present on machine
  Normal   Started              75m                kubelet  Started container kubernetes-dashboard
  Normal   Pulled               75m                kubelet  Container image "registry.neon.local/neonkube/kubernetesui-metrics-scraper:v1.0.7" already present on machine
  Warning  Unhealthy            74m (x3 over 74m)  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 500
  Normal   Killing              74m (x2 over 75m)  kubelet  FailedPostStartHook
  Warning  FailedPostStartHook  74m                kubelet  Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "kubernetes-dashboard-b5688559d-6q2h6_neon-system(1f7bbc81-9285-4caa-bc07-678f8489dcb6)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp [::1]:15021: connect: connection refused
, message: "2023-07-22T00:45:20.729196Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2023-07-22T00:46:20.844607Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\n"
  Normal   Pulled               74m (x2 over 75m)  kubelet  Container image "registry.neon.local/neonkube/kubernetesui-dashboard:v2.6.0" already present on machine
  Warning  FailedPostStartHook  71m                kubelet  Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "kubernetes-dashboard-b5688559d-6q2h6_neon-system(1f7bbc81-9285-4caa-bc07-678f8489dcb6)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp [::1]:15021: connect: connection refused
, message: "2023-07-22T00:48:18.412256Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2023-07-22T00:49:18.526012Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\n"
  Warning  Unhealthy            56m (x26 over 74m)  kubelet  Liveness probe failed: Get "http://10.254.43.6:15020/app-health/dashboard-metrics-scraper/livez": dial tcp 10.254.43.6:15020: connect: connection refused
  Warning  FailedPostStartHook  51m                 kubelet  Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "kubernetes-dashboard-b5688559d-6q2h6_neon-system(1f7bbc81-9285-4caa-bc07-678f8489dcb6)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp [::1]:15021: connect: connection refused
, message: "2023-07-22T01:08:46.421693Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2023-07-22T01:09:46.543681Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\n"
  Warning  BackOff              6m31s (x325 over 74m)  kubelet  Back-off restarting failed container
  Warning  FailedPostStartHook  66s (x8 over 44m)      kubelet  (combined from similar events): Exec lifecycle hook ([pilot-agent wait]) for Container "istio-proxy" in Pod "kubernetes-dashboard-b5688559d-6q2h6_neon-system(1f7bbc81-9285-4caa-bc07-678f8489dcb6)" failed - error: command 'pilot-agent wait' exited with 255: Error: timeout waiting for Envoy proxy to become ready. Last error: Get "http://localhost:15021/healthz/ready": dial tcp [::1]:15021: connect: connection refused
, message: "2023-07-22T01:58:43.411362Z\tinfo\tWaiting for Envoy proxy to be ready (timeout: 60 seconds)...\n2023-07-22T01:59:43.530668Z\terror\ttimeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\nError: timeout waiting for Envoy proxy to become ready. Last error: Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused\n"
@jefflill
Copy link
Collaborator Author

@marcusbooyah looked at this and it appears to be an intermittent DNS issue.

@jefflill
Copy link
Collaborator Author

CLOSING: I just deployed a large Azure cluster and this was not a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants