helm:Envoy sidecar shutting down too early causes requests to fail #650

nflaig · 2021-03-17T21:00:25Z

Hey guys,

during a rolling update (pod termination) we are getting 502 errors in nginx for upstream requests. This happens because the load balancer in k8s still sends requests to the terminating pod due to the fact that the endpoint deregistration happens asynchronously, see kubernetes/kubernetes#43576.

For nginx itself, I was able to resolve this race condition issue by adding a simple sleep to the preStop hook of the container

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5 && /usr/sbin/nginx -s quit"]

but the problem is not fully resolved since the envoy sidecar is still shutting down too early which causes requests to fail.

Current behavior

Nginx returns 502 errors because the upstream requests are send through the envoy sidecar proxy which is already shutting down due to the SIGTERM send by k8s.

Expected behavior

The envoy sidecar proxy should handle requests of the proxied service as long as the service is still running, e.g. if the service does some cleanup and still needs to send data when terminating or in case of nginx if further requests are still being routed to the pod.

Suggestion

In the envoy preStop command a delay could be added to prevent instantly sending the SIGTERM signal to the container. Since a hardcoded sleep duration seems to be too static maybe it would make sense to add something like this

while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do sleep 1; done

It would delay the shutdown process of envoy until there are no more TCP listeners, i.e. the proxied service is no longer running and it is save to also shutdown envoy without causing further requests to fail.

Another option could be to allow the user of the helm chart to customize the envoy preStop hook.

Environment details

consul-k8s version: 0.24.0
consul-helm version: 0.30.0
consul version: 1.9.3
envoy version: 1.16.0

@ryan4yin Let me know if I understand this correctly. While the pod ip is being removed from iptable, istio-proxy and your service have already received SIGTERM and stopped accepting any new request, this is causing problem because requests are still coming in through the available Pod IP.
If my understanding is correct, wouldn't just add preStop in both service and istio-proxy containers work so they can still accept request until pod IP is removed?

ryan4yin · 2021-11-18T03:30:37Z

@Samjin Correct, that's what this issue suggested.

hamishforbes · 2021-12-08T02:59:11Z

I'm seeing a similar issue but using the AWS ALB Controller
My app receives public HTTP requests via an ALB and forwards them on to other services via Consul Connect.

We also make extensive use of long polling so often have HTTP requests that take 27s to return.
I have set a deregistration delay and a preStop that sleeps for 30s to allow these requests to complete.

There's also a small lag between a pod being marked as terminating and the ALB being fully updated to not send new requests to the pod.

At the moment though the pod is marked terminating, preStop fires on my container and the envoy sidecar immediately shuts down.
Some requests sneak through before the ALB updates, my app tries to forward these on and finds the envoy sidecar listeners are closed, and so returns an error.

Being able to set a preStop on the envoy container as well would resolve this problem entirely.

bmihaescu · 2022-03-07T14:43:42Z

I'm seeing a similar issue but using the AWS ALB Controller My app receives public HTTP requests via an ALB and forwards them on to other services via Consul Connect.

@hamishforbes, what did you configure on the alb ingress controller side to make it work with consul connect? I'm not able to send requests to a service that's under consul connect, by accessing it through ALB.

stk0vrfl0w · 2022-03-17T19:15:02Z

We're seeing similar issues with release 0.41.1.

In our case, the mesh-gateway container's probes fail while the envoy proxy is still initializing. In turn, we see ENVOY_SIGTERM messages in the logs and the pods in a perpetual CrashLoopBackoff state. We suspect that the long initialization times may be due to the number of federated clusters we have connected -- just over a dozen and counting. Fortunately, we've been able to work around this by manually tweaking the deployment's probes' parameters.

As we're using helmfile and kustomize internally, our deployment pipeline can dynamically patch the charts, though it would be beneficial to all if the probes' settings were parameterized in values.yaml and if a startupProbe stanza was added.

narendrapatel · 2022-08-05T11:23:06Z

We have a similar issue where our app container has draining configured. As such on termination it starts draining the requests with a max time f.e say 3 minutes. But since the sidecar has no corresponding config it shuts down immediately. The problem is that when during this period the draining requests makes a connection to the sidecar they receive "connection refused" error as the sidecar is already stopped, resulting in failures and noise in logs.
Is there a way we can add preStop hook for the sidecars?

alt-dima · 2022-09-27T13:22:58Z

Already being discussed : #536

david-yu · 2023-06-28T03:12:23Z

Closing as the pod shutdown use case of sidecar lifecycle should now be addressed by #2233. Please open a new issue if you still have issues.

nflaig changed the title ~~Envoy sidecar shutting down to early causes requests to fail~~ Envoy sidecar shutting down too early causes requests to fail Mar 17, 2021

t-eckert changed the title ~~Envoy sidecar shutting down too early causes requests to fail~~ helm:Envoy sidecar shutting down too early causes requests to fail Aug 24, 2021

t-eckert transferred this issue from hashicorp/consul-helm Aug 24, 2021

hamishforbes mentioned this issue Dec 8, 2021

Add configurable preStop delay to Envoy sidecar #911

Closed

2 tasks

This was referenced May 10, 2023

cmd: add CLI flags for proxy shutdown lifecycle management hashicorp/consul-dataplane#100

Merged

acceptance: add connect proxy lifecycle shutdown test #2119

Closed

mikemorris mentioned this issue May 18, 2023

proxy-lifecycle: add HTTP Server with endpoints for proxy lifecycle shutdown hashicorp/consul-dataplane#115

Merged

1 task

This was referenced May 31, 2023

proxy-lifecycle: catch SIGTERM and initiate graceful shutdown hashicorp/consul-dataplane#130

Merged

helm: add configuration for proxy lifecycle management #2233

Merged

david-yu closed this as completed Jun 28, 2023

nflaig mentioned this issue Jul 22, 2023

Why not call server.close immediately? gajus/http-terminator#22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

helm:Envoy sidecar shutting down too early causes requests to fail #650

helm:Envoy sidecar shutting down too early causes requests to fail #650

nflaig commented Mar 17, 2021

thisisnotashwin commented Mar 17, 2021

marcusweinhold commented Jul 9, 2021 •

edited

Loading

Samjin commented Nov 1, 2021

nflaig commented Nov 1, 2021

ryan4yin commented Nov 16, 2021

Samjin commented Nov 17, 2021

ryan4yin commented Nov 18, 2021 •

edited

Loading

hamishforbes commented Dec 8, 2021

bmihaescu commented Mar 7, 2022

stk0vrfl0w commented Mar 17, 2022 •

edited

Loading

narendrapatel commented Aug 5, 2022 •

edited

Loading

alt-dima commented Sep 27, 2022

david-yu commented Jun 28, 2023

helm:Envoy sidecar shutting down too early causes requests to fail #650

helm:Envoy sidecar shutting down too early causes requests to fail #650

Comments

nflaig commented Mar 17, 2021

Current behavior

Expected behavior

Suggestion

Environment details

Related

thisisnotashwin commented Mar 17, 2021

marcusweinhold commented Jul 9, 2021 • edited Loading

Samjin commented Nov 1, 2021

nflaig commented Nov 1, 2021

ryan4yin commented Nov 16, 2021

Samjin commented Nov 17, 2021

ryan4yin commented Nov 18, 2021 • edited Loading

hamishforbes commented Dec 8, 2021

bmihaescu commented Mar 7, 2022

stk0vrfl0w commented Mar 17, 2022 • edited Loading

narendrapatel commented Aug 5, 2022 • edited Loading

alt-dima commented Sep 27, 2022

david-yu commented Jun 28, 2023

marcusweinhold commented Jul 9, 2021 •

edited

Loading

ryan4yin commented Nov 18, 2021 •

edited

Loading

stk0vrfl0w commented Mar 17, 2022 •

edited

Loading

narendrapatel commented Aug 5, 2022 •

edited

Loading