Graceful shutdown with injected envoy-sidecar #536

svobol13 · 2021-06-16T16:17:25Z

Envoy proxy sidecar receives SIGTERM at the exact same moment as my main container. In oppossite to my main container (which shuts down in like 15-30 seconds) envoy sidecar shuts down immediately (0.5 - 3s). This means that my main lost upstream connections and cannot gracefully shutdown - even rolling update means lost requests/data.

There should be some kind of mechanism so the upstream listeners exits as last. My main container is Consul Connect enabled and communicate with upstreams through Connect but the service itself is not being accessed through Connect but through Consul DNS instead.

Is there a workaround/hack (some kind of prestop sleep) or do I have get rid of Consul Connect?

Similar issues:

pedrohdz · 2021-06-18T10:11:18Z

Have you tried adding terminationGracePeriodSeconds to your pod? You might end up with #540 instead then.

ishustava · 2021-07-22T04:10:31Z

Thanks for the issue @svobol13! This is the problem that is a larger issue in Kubernetes itself in that there is no lifecycle hooks that can help us control container shutdown. We'd need to investigate how to work around this until there's a proper solution in k8s.

kevin-lindsay-1 · 2021-08-09T20:36:42Z

Posting a reference to a comment that I made after going down this rabbit hole, to hopefully save others some time:
istio/istio#18333 (comment)

Joxit · 2021-12-30T10:57:37Z

@pedrohdz I did some tests and terminationGracePeriodSeconds is not enough. As you can see in this image, the grace period starts after the SIGTERM signal, at this moment it's already too late.

The problem really comes from Envoy which should not stop without having completed the last received requests on SIGTERM.

My workaround was add a preStop: sleep 30s on my container + terminationGracePeriodSeconds and to overload the envoy docker image to ignore SIGTERM.

This work fine because consul remove the service from its catalog when the SIGTERM is triggered, so I have 30 second to finish the work.

Unfortunately ingress/terminating gateways suffer from the same problem and we cannot use the same workarounds... 😕 (preStop already set)

lkysow · 2022-01-14T22:22:18Z

I think we need to do something similar to ECS: https://www.consul.io/docs/ecs/architecture#task-shutdown where we deregister the service immediately but keep Envoy running until the application container shuts down.

dschaaff · 2022-01-15T02:06:07Z

As a user I really need to be able to control the shutdown. In my case I have cli applications that are only using envoy for outbound connections. Some of these take 1-2 minutes to gracefully stop their current work after receiving the sigterm. During that period envoy needs to stay up and available. What happens now is we get errors because envoy shuts down very quickly. Being able to add a simple prestop hook with a sleep to envoy would make it simple for me to do this.

dschaaff · 2022-04-13T23:41:58Z

Any updates on this? We just an annotation as suggested in https://github.com/hashicorp/consul-k8s/pull/911/files.

Here is how linkerd handles it https://linkerd.io/2.11/tasks/graceful-shutdown/.

I think this issue should be given really high priority as it is actually impossible to deployment in kubernetes with connection errors. The problem also arises any an HPA scales down pods.

dschaaff · 2022-05-18T15:39:48Z

I've resorted to running a custom built binary with a patch containing the changes here #911. It is the only way to run Consul Connect in production at present without getting 5xx errors during deployments and scale downs. The product definitely needs to look at this issue and make it a priority as this is a basic piece of it being production ready.

david-yu · 2022-06-22T00:30:49Z

Hi @dschaaff thanks for the feedback. We are monitoring this issue as well, aside from other items we have targeted for our next releases tied with Consul Core that are more architecturally related. I can't definitely say when we will address this but I do want to support a native solution in Consul K8s.

narendrapatel · 2022-08-05T12:28:06Z

We are also facing this issue wherein our app has some draining configured. But as the pod receives SIGTERM, envoy immediately shuts down while the app is still draining. Graceful termination would be very important and helpful to us especially since its a high traffic app and any connection issues get quickly noticed and reported. As @dschaaff rightly mentioned this is important for release to production.

dschaaff · 2022-08-05T17:45:25Z

I continue to build a forked image of the control plane binary for each release in order to add a prestop hook to the envoy sidecar. It's quite disappointing that this feature hasn't been added. This issue has been open for over a year and this remains a blocker to production use of consul connect.

david-yu · 2022-08-06T17:32:59Z

Hi @dschaaff and @narendrapatel thanks for the feedback. I don't disagree that it is important to address and a blocker to getting to production. Right we are at a point of competing priorities due to large architecture changes within Consul that we are actively working on.

narendrapatel · 2022-08-11T09:01:01Z

Hi @dschaaff, If possible, can you please share how are you building the image. Here is what I tried:

Forked the repo and checked out release(v0.46.1 in my case)
Added the annotation changes, ref: https://github.com/narendrapatel/consul-k8s/pull/1/files
Finally built the image with : make control-plane-dev-docker DEV_IMAGE=consul-k8s-control-plane:0.46.0

I tested it in my local setup and confirm it is working as expected but not sure of the build process.

dschaaff · 2022-08-11T17:07:37Z

I use this docker file to build

FROM public.ecr.aws/docker/library/golang:1.18.4-alpine3.15 as build
ARG TARGETOS
ARG TARGETARCH

COPY . /go

RUN cd /go/control-plane && \
	set -x; go build -o pkg/bin/consul-k8s-control-plane

# final image
# we are simply copying our custom built binary over the standard binary in the image
FROM hashicorp/consul-k8s-control-plane:0.46.1

ARG TARGETOS
ARG TARGETARCH

COPY --from=build /go/control-plane/pkg/bin/ /bin

alt-dima · 2022-08-25T20:18:53Z

Does anyone use Consul Mesh/Connect in production?
I can't understand how it can be used as is (without patch like this) and avoid errors in application, that needs time to finish jobs?
maybe there is something new/fixed in 1.13.1?

alt-dima · 2022-09-27T13:26:06Z

@dschaaff Thank you for the Dockerfile!
I use it with a combination of custom-image and "dynamic entrypoint" (#1397 (comment))

dschaaff · 2022-12-06T20:03:56Z

The big 1.0 rewrite has been released for a bit. Can anyone from HashiCorp comment on the timeline for fixing this issue? Due to the delay on this and other bugs we are facing we are considering dropping the Consul service mesh.

nrichu-hcp · 2022-12-09T17:44:11Z

@dschaaff Were taking this issue very seriously and have a solid idea of potential fixes to alleviate the problems you're having. The timeline is a little gray but with a medium, to strong probability, you will see a fix in the 2023 calendar year.

coconut30 · 2023-02-24T10:03:26Z

Any news on this issue ?

oliver-buckley-salmon-db · 2023-04-13T16:29:02Z

Hi @nrichu-hcp is there any update? We have multiple teams looking to go live in the next quarter with Consul Service Mesh, which we have spent 2 years arguing for vs Istio / ASM. This issue could easily force us to have to abandon Consul and migrate everyone to Istio/ASM.
If we could just have a quarter date, that would be great.

dschaaff · 2023-04-13T18:14:42Z

We abandoned the Consul mesh after 2 years in production. We had to run a forked build of the controller to enable adding the pre stop sleep that entire time. In the end we switched to linkerd.

oliver-buckley-salmon-db · 2023-04-14T07:02:38Z

Hi, @nrichu-hcp it looks like this PR fixes the issue, any idea when it will be merged and what release it will be in?

svobol13 added the type/question Question about product, ideally should be pointed to discuss.hashicorp.com label Jun 16, 2021

ishustava added area/connect Related to Connect service mesh, e.g. injection type/enhancement New feature or request and removed type/question Question about product, ideally should be pointed to discuss.hashicorp.com labels Jul 22, 2021

narendrapatel mentioned this issue Sep 9, 2022

Adding support for lifecycle hooks and health probe for sidecars #1482

Closed

2 tasks

alt-dima mentioned this issue Sep 27, 2022

helm:Envoy sidecar shutting down too early causes requests to fail #650

Closed

david-yu mentioned this issue Dec 13, 2022

New liveness probes on consul-dataplane container prevents pods created by Kubernetes jobs to terminate #1791

Closed

Pveasey mentioned this issue Jan 27, 2023

V0.43.0 hmhco hmhco/consul-k8s#1

Merged

2 tasks

This was referenced May 10, 2023

cmd: add CLI flags for proxy shutdown lifecycle management hashicorp/consul-dataplane#100

Merged

acceptance: add connect proxy lifecycle shutdown test #2119

Closed

mikemorris mentioned this issue May 18, 2023

proxy-lifecycle: add HTTP Server with endpoints for proxy lifecycle shutdown hashicorp/consul-dataplane#115

Merged

1 task

major-hmhco mentioned this issue May 24, 2023

Release 0.49.5 supporting [email protected] hmhco/consul-k8s#2

Merged

2 tasks

This was referenced May 31, 2023

proxy-lifecycle: catch SIGTERM and initiate graceful shutdown hashicorp/consul-dataplane#130

Merged

helm: add configuration for proxy lifecycle management #2233

Merged

curtbushko closed this as completed in #2233 Jun 27, 2023

This was referenced Jun 27, 2023

Backport of helm: add configuration for proxy lifecycle management into release/1.0.x #2465

Closed

Backport of helm: add configuration for proxy lifecycle management into release/1.1.x #2466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown with injected envoy-sidecar #536

Graceful shutdown with injected envoy-sidecar #536

svobol13 commented Jun 16, 2021 •

edited

Loading

pedrohdz commented Jun 18, 2021

ishustava commented Jul 22, 2021

kevin-lindsay-1 commented Aug 9, 2021

Joxit commented Dec 30, 2021

lkysow commented Jan 14, 2022

dschaaff commented Jan 15, 2022

dschaaff commented Apr 13, 2022 •

edited

Loading

dschaaff commented May 18, 2022

david-yu commented Jun 22, 2022

narendrapatel commented Aug 5, 2022

dschaaff commented Aug 5, 2022

david-yu commented Aug 6, 2022

narendrapatel commented Aug 11, 2022 •

edited

Loading

dschaaff commented Aug 11, 2022

alt-dima commented Aug 25, 2022

alt-dima commented Sep 27, 2022

dschaaff commented Dec 6, 2022

nrichu-hcp commented Dec 9, 2022

coconut30 commented Feb 24, 2023

oliver-buckley-salmon-db commented Apr 13, 2023 •

edited

Loading

dschaaff commented Apr 13, 2023 •

edited

Loading

oliver-buckley-salmon-db commented Apr 14, 2023

Graceful shutdown with injected envoy-sidecar #536

Graceful shutdown with injected envoy-sidecar #536

Comments

svobol13 commented Jun 16, 2021 • edited Loading

pedrohdz commented Jun 18, 2021

ishustava commented Jul 22, 2021

kevin-lindsay-1 commented Aug 9, 2021

Joxit commented Dec 30, 2021

lkysow commented Jan 14, 2022

dschaaff commented Jan 15, 2022

dschaaff commented Apr 13, 2022 • edited Loading

dschaaff commented May 18, 2022

david-yu commented Jun 22, 2022

narendrapatel commented Aug 5, 2022

dschaaff commented Aug 5, 2022

david-yu commented Aug 6, 2022

narendrapatel commented Aug 11, 2022 • edited Loading

dschaaff commented Aug 11, 2022

alt-dima commented Aug 25, 2022

alt-dima commented Sep 27, 2022

dschaaff commented Dec 6, 2022

nrichu-hcp commented Dec 9, 2022

coconut30 commented Feb 24, 2023

oliver-buckley-salmon-db commented Apr 13, 2023 • edited Loading

dschaaff commented Apr 13, 2023 • edited Loading

oliver-buckley-salmon-db commented Apr 14, 2023

svobol13 commented Jun 16, 2021 •

edited

Loading

dschaaff commented Apr 13, 2022 •

edited

Loading

narendrapatel commented Aug 11, 2022 •

edited

Loading

oliver-buckley-salmon-db commented Apr 13, 2023 •

edited

Loading

dschaaff commented Apr 13, 2023 •

edited

Loading