GCE load balancer health check does match k8s pod health #1656

scarby · 2022-01-21T18:25:27Z

Issue

It would appear that there is zero connection between kubernetes' concept of when a pod is healthy and the GCE load balancer's concept of the same.

As such when a deployment is updating:

kubernetes spins up new pods,
new pods pass their health-checks and kubernetes considers them to be ready
At this point the GCE load balancer is not guaranteed to have passed it's health-check,
k8s will then potentially terminate the old pods before new pods are considered healthy by the GCE load balancer (and they are instantly dropped from the NEG)

The only 'solution' we have found to this is to add a significant initial delay on the kubernetes health-checks. Not only is this hacky but it doesn't guarantee that there are actually pods able to serve traffic when he old pods are removed (we're just hoping)

Expected behaviour

I would expect k8s not to terminate the pod until the load balancer had a new pod ready to replace it

Is there any way to tie these two together so we avoid a situation where there are no pods available?

freehan · 2022-01-25T17:37:01Z

When NEG is enabled, LB health checks are feedback into pod readiness https://cloud.google.com/kubernetes-engine/docs/concepts/container-native-load-balancing#pod_readiness

To configure custom LB health check, use BackendConfig

scarby · 2022-01-26T17:18:54Z

Ok. I'm mistaken wrong no connection however I'm not sure this is fit for purpose.

GKE sets the value of cloud.google.com/load-balancer-neg-ready for a Pod to True if any of the following conditions are met:

One or more of the Pod's IP addresses are endpoints in a GCE_VM_IP_PORT NEG managed by the GKE control plane. The NEG is attached to a backend service. The load balancer health check for the backend service times out.

Which is likely what is happening in my case. If my health check times out I clearly don't want my pod to be considered ready?

So going back to my original point there appears to be no way to ensure there is actually a pod ready to serve traffic.

kundan2707 · 2022-01-27T12:27:24Z

/kind support

dry4ng · 2022-02-23T16:41:21Z

It appears that GCP load balancer creates health check once, when the ingress is created, and then never updates is. At least from what I have observed. From there on there is no connection between the pod state and the GCP load balancer.
I have different health checks for startup and liveness. I don't want GCP load balancer to be hitting the startup health check, as it's quite heavy.

jmcarp · 2022-04-25T20:36:25Z

Does this controller intentionally not update backend health checks? Changing readiness probes doesn't seem to change health checks on the backend.

swetharepakula · 2022-05-20T21:27:08Z

The ingress controller waits to make sure that a pod, on startup, is considered healthy by the load balancer before updating the readiness check on the pod. If in 15 minutes, the load balancer does not consider the pod ready, the readiness check on the pod will be ready. The idea is to only let Kubernetes consider the pod ready once the load balancer considers the pod ready. If you require a different health check for the load balancer, that can be specified using the BackendConfig CRD.

Are your pods taking longer than 15 minutes to pass the load balancer health check?

goobysnack · 2022-06-01T22:47:30Z

The pods in our deployment can take up to 90s to fully initialize and pass the readiness probe (yay java!). The load balancer healthcheck is just hitting the tomcat listener. THIS ALONE passes before pod readiness passes, and marks the NEGS as ready. It seems that the load balancer backend shouldn't forward traffic to a pod unless both the backend healthcheck and the pod readiness are in a good state.

goobysnack · 2022-06-01T22:54:01Z

I opened a Google case, and their response is "by design", and offered to open a feature request. That seems more like a bug than a feature.

My response:

This seems like a bug to fix, not a feature request.  Why would you bypass k8s readiness probes just because the ingress check passes?  That makes zero sense and undermines the purpose of readiness probes.```

thomas-riccardi · 2022-06-06T10:20:41Z

@goobysnack same story here, the GCP support ended up opening this feature request: https://issuetracker.google.com/230729446 for us.

After reading the code, issues, and design docs for readiness gates and this ingress-gce, I believe it's a non trivial issue to fix because the whole design of the Readiness Gates rely on transmitting the GCLB programmation success state to other components via the Pod Readiness condition.
We are at a deadlock:

for proper rolling update, Deployment & co use the Pod Readiness condition to know when the new pods actually receive traffic from GCLB: gce-ingress-controller marks the Pod Ready (via the readiness gates) after it has successfully added it to the GCLB; that's the whole goal of the Readiness Gate feature.
We would like the gce-ingress-controller to ignore Pods that are not Ready (yet?)

Maybe a way forward would be for the gce-ingress-controller to use the Pod Ready minus its own gclb-readiness-gate; but that's not an information which is exposed in Endpoints/EndpointSlices (we only have ready).

In the meantime, a possible solution would be to have a sidecar container which computes that value by self inspecting (probably asking k8s api for self pod status to get individual containers conditions; doesn't seem ideal though), and exposes it as a HTTP endpoint, to be configured as the GCLB HealthCheck for that Pod/Service.
I am not aware of any existing implementation of this idea though.

In the meantime, we forced the old Instance Group mode everywhere (vs NEG), where the traffic actually goes through kubernetes Services, which respect the Pod Ready condition; and accepted all the limitations of this old way.

swetharepakula · 2022-06-14T23:37:19Z

Thanks @thomas-riccardi for the great explanation!

Currently the load balancer health checks are the only signal we can provide to the load balancer that the pod is ready to receive traffic. We do not have a solution at this time for making the load balancer Kubernetes aware.

For those affected by this now, my recommendation is to make sure that the health check on the application only passes once the application is ready to accept traffic.

thomas-riccardi · 2022-06-15T09:19:29Z

Thanks @swetharepakula

Are there plans to improve the situation in GKE? Discussions in upstream kubernetes like there was for the introduction of the Readiness Gates?
Because otherwise it seems we will be stuck with the old Ingress+IG, also loosing the much awaited Gateway API with all the new features (plus #33, #109, ...).

k8s-triage-robot · 2022-09-13T10:12:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

goobysnack · 2022-09-13T15:26:34Z

I learned that the backend-config isn't assigned to the ingress annotation, it's assigned to the workload service. Once you do that, it all works like magic. That was the fine print in the documentation I missed.

k8s-triage-robot · 2022-10-13T15:36:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-11-12T16:08:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2022-11-12T16:08:48Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Jan 27, 2022

rocketraman mentioned this issue May 19, 2022

Unable to avoid unhealthy backend / 502s on rolling deployments #1718

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 13, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 13, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCE load balancer health check does match k8s pod health #1656

GCE load balancer health check does match k8s pod health #1656

scarby commented Jan 21, 2022 •

edited

Loading

freehan commented Jan 25, 2022

scarby commented Jan 26, 2022 •

edited

Loading

kundan2707 commented Jan 27, 2022

dry4ng commented Feb 23, 2022

jmcarp commented Apr 25, 2022

swetharepakula commented May 20, 2022

goobysnack commented Jun 1, 2022

goobysnack commented Jun 1, 2022

thomas-riccardi commented Jun 6, 2022 •

edited

Loading

swetharepakula commented Jun 14, 2022

thomas-riccardi commented Jun 15, 2022

k8s-triage-robot commented Sep 13, 2022

goobysnack commented Sep 13, 2022

k8s-triage-robot commented Oct 13, 2022

k8s-triage-robot commented Nov 12, 2022

k8s-ci-robot commented Nov 12, 2022

GCE load balancer health check does match k8s pod health #1656

GCE load balancer health check does match k8s pod health #1656

Comments

scarby commented Jan 21, 2022 • edited Loading

Issue

Expected behaviour

freehan commented Jan 25, 2022

scarby commented Jan 26, 2022 • edited Loading

kundan2707 commented Jan 27, 2022

dry4ng commented Feb 23, 2022

jmcarp commented Apr 25, 2022

swetharepakula commented May 20, 2022

goobysnack commented Jun 1, 2022

goobysnack commented Jun 1, 2022

thomas-riccardi commented Jun 6, 2022 • edited Loading

swetharepakula commented Jun 14, 2022

thomas-riccardi commented Jun 15, 2022

k8s-triage-robot commented Sep 13, 2022

goobysnack commented Sep 13, 2022

k8s-triage-robot commented Oct 13, 2022

k8s-triage-robot commented Nov 12, 2022

k8s-ci-robot commented Nov 12, 2022

scarby commented Jan 21, 2022 •

edited

Loading

scarby commented Jan 26, 2022 •

edited

Loading

thomas-riccardi commented Jun 6, 2022 •

edited

Loading