Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helm:Envoy sidecar shutting down too early causes requests to fail #650

Closed
nflaig opened this issue Mar 17, 2021 · 13 comments
Closed

helm:Envoy sidecar shutting down too early causes requests to fail #650

nflaig opened this issue Mar 17, 2021 · 13 comments

Comments

@nflaig
Copy link

nflaig commented Mar 17, 2021

Hey guys,

during a rolling update (pod termination) we are getting 502 errors in nginx for upstream requests. This happens because the load balancer in k8s still sends requests to the terminating pod due to the fact that the endpoint deregistration happens asynchronously, see kubernetes/kubernetes#43576.

For nginx itself, I was able to resolve this race condition issue by adding a simple sleep to the preStop hook of the container

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5 && /usr/sbin/nginx -s quit"]

but the problem is not fully resolved since the envoy sidecar is still shutting down too early which causes requests to fail.

Current behavior

Nginx returns 502 errors because the upstream requests are send through the envoy sidecar proxy which is already shutting down due to the SIGTERM send by k8s.

Expected behavior

The envoy sidecar proxy should handle requests of the proxied service as long as the service is still running, e.g. if the service does some cleanup and still needs to send data when terminating or in case of nginx if further requests are still being routed to the pod.

Suggestion

In the envoy preStop command a delay could be added to prevent instantly sending the SIGTERM signal to the container. Since a hardcoded sleep duration seems to be too static maybe it would make sense to add something like this

while [ $(netstat -plunt | grep tcp | grep -v envoy | wc -l | xargs) -ne 0 ]; do sleep 1; done

It would delay the shutdown process of envoy until there are no more TCP listeners, i.e. the proxied service is no longer running and it is save to also shutdown envoy without causing further requests to fail.

Another option could be to allow the user of the helm chart to customize the envoy preStop hook.

Environment details

  • consul-k8s version: 0.24.0
  • consul-helm version: 0.30.0
  • consul version: 1.9.3
  • envoy version: 1.16.0

Related

@nflaig nflaig changed the title Envoy sidecar shutting down to early causes requests to fail Envoy sidecar shutting down too early causes requests to fail Mar 17, 2021
@thisisnotashwin
Copy link
Contributor

hey @nflaig !! Thanks so much for bringing this to our attention. Also, the attention to detail as well as the suggestions in the issue are really appreciated!!

The team is stretched a little thin at the moment and we might not be in a position to solve this in the next release but will try and prioritize this for the one after.

Will keep this issue updated with the status of the fix! Thanks again!!

@marcusweinhold
Copy link

marcusweinhold commented Jul 9, 2021

Having the same problem. Maybe a more flexible solution would be calling something like curl http://127.0.0.1:19000/drain_listeners; sleep $SOME_CONFIGURABLE_VALUE in the containers preStop command?

Edit: using the latest helm chart 0.32.1

@t-eckert t-eckert changed the title Envoy sidecar shutting down too early causes requests to fail helm:Envoy sidecar shutting down too early causes requests to fail Aug 24, 2021
@t-eckert t-eckert transferred this issue from hashicorp/consul-helm Aug 24, 2021
@Samjin
Copy link

Samjin commented Nov 1, 2021

hi @nflaig doesn't terminationDrainDuration solve the problem or I'm missing anything?

@nflaig
Copy link
Author

nflaig commented Nov 1, 2021

Hi @Samjin, it does not solve the problem as it still prevents new connections

@ryan4yin
Copy link

@Samjin only preStop can delay sending SIGTERM signals to Envoy.
without preStop, Envoy will receive the SIGTERM signal immediately, and reject all the new connections.

@Samjin
Copy link

Samjin commented Nov 17, 2021

@ryan4yin Let me know if I understand this correctly. While the pod ip is being removed from iptable, istio-proxy and your service have already received SIGTERM and stopped accepting any new request, this is causing problem because requests are still coming in through the available Pod IP.
If my understanding is correct, wouldn't just add preStop in both service and istio-proxy containers work so they can still accept request until pod IP is removed?

@ryan4yin
Copy link

ryan4yin commented Nov 18, 2021

@Samjin Correct, that's what this issue suggested.

@hamishforbes
Copy link
Contributor

I'm seeing a similar issue but using the AWS ALB Controller
My app receives public HTTP requests via an ALB and forwards them on to other services via Consul Connect.

We also make extensive use of long polling so often have HTTP requests that take 27s to return.
I have set a deregistration delay and a preStop that sleeps for 30s to allow these requests to complete.

There's also a small lag between a pod being marked as terminating and the ALB being fully updated to not send new requests to the pod.

At the moment though the pod is marked terminating, preStop fires on my container and the envoy sidecar immediately shuts down.
Some requests sneak through before the ALB updates, my app tries to forward these on and finds the envoy sidecar listeners are closed, and so returns an error.

Being able to set a preStop on the envoy container as well would resolve this problem entirely.

@bmihaescu
Copy link

I'm seeing a similar issue but using the AWS ALB Controller My app receives public HTTP requests via an ALB and forwards them on to other services via Consul Connect.

@hamishforbes, what did you configure on the alb ingress controller side to make it work with consul connect? I'm not able to send requests to a service that's under consul connect, by accessing it through ALB.

@stk0vrfl0w
Copy link

stk0vrfl0w commented Mar 17, 2022

We're seeing similar issues with release 0.41.1.

In our case, the mesh-gateway container's probes fail while the envoy proxy is still initializing. In turn, we see ENVOY_SIGTERM messages in the logs and the pods in a perpetual CrashLoopBackoff state. We suspect that the long initialization times may be due to the number of federated clusters we have connected -- just over a dozen and counting. Fortunately, we've been able to work around this by manually tweaking the deployment's probes' parameters.

As we're using helmfile and kustomize internally, our deployment pipeline can dynamically patch the charts, though it would be beneficial to all if the probes' settings were parameterized in values.yaml and if a startupProbe stanza was added.

@narendrapatel
Copy link

narendrapatel commented Aug 5, 2022

We have a similar issue where our app container has draining configured. As such on termination it starts draining the requests with a max time f.e say 3 minutes. But since the sidecar has no corresponding config it shuts down immediately. The problem is that when during this period the draining requests makes a connection to the sidecar they receive "connection refused" error as the sidecar is already stopped, resulting in failures and noise in logs.
Is there a way we can add preStop hook for the sidecars?

@alt-dima
Copy link

Already being discussed : #536

@david-yu
Copy link
Contributor

Closing as the pod shutdown use case of sidecar lifecycle should now be addressed by #2233. Please open a new issue if you still have issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests