-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful shutdown with injected envoy-sidecar #536
Comments
Have you tried adding |
Thanks for the issue @svobol13! This is the problem that is a larger issue in Kubernetes itself in that there is no lifecycle hooks that can help us control container shutdown. We'd need to investigate how to work around this until there's a proper solution in k8s. |
Posting a reference to a comment that I made after going down this rabbit hole, to hopefully save others some time: |
@pedrohdz I did some tests and The problem really comes from Envoy which should not stop without having completed the last received requests on SIGTERM. My workaround was add a This work fine because consul remove the service from its catalog when the SIGTERM is triggered, so I have 30 second to finish the work. Unfortunately ingress/terminating gateways suffer from the same problem and we cannot use the same workarounds... 😕 (preStop already set) |
I think we need to do something similar to ECS: https://www.consul.io/docs/ecs/architecture#task-shutdown where we deregister the service immediately but keep Envoy running until the application container shuts down. |
As a user I really need to be able to control the shutdown. In my case I have cli applications that are only using envoy for outbound connections. Some of these take 1-2 minutes to gracefully stop their current work after receiving the sigterm. During that period envoy needs to stay up and available. What happens now is we get errors because envoy shuts down very quickly. Being able to add a simple prestop hook with a sleep to envoy would make it simple for me to do this. |
Any updates on this? We just an annotation as suggested in https://github.com/hashicorp/consul-k8s/pull/911/files. Here is how linkerd handles it https://linkerd.io/2.11/tasks/graceful-shutdown/. I think this issue should be given really high priority as it is actually impossible to deployment in kubernetes with connection errors. The problem also arises any an HPA scales down pods. |
I've resorted to running a custom built binary with a patch containing the changes here #911. It is the only way to run Consul Connect in production at present without getting 5xx errors during deployments and scale downs. The product definitely needs to look at this issue and make it a priority as this is a basic piece of it being production ready. |
Hi @dschaaff thanks for the feedback. We are monitoring this issue as well, aside from other items we have targeted for our next releases tied with Consul Core that are more architecturally related. I can't definitely say when we will address this but I do want to support a native solution in Consul K8s. |
We are also facing this issue wherein our app has some draining configured. But as the pod receives SIGTERM, envoy immediately shuts down while the app is still draining. Graceful termination would be very important and helpful to us especially since its a high traffic app and any connection issues get quickly noticed and reported. As @dschaaff rightly mentioned this is important for release to production. |
I continue to build a forked image of the control plane binary for each release in order to add a prestop hook to the envoy sidecar. It's quite disappointing that this feature hasn't been added. This issue has been open for over a year and this remains a blocker to production use of consul connect. |
Hi @dschaaff and @narendrapatel thanks for the feedback. I don't disagree that it is important to address and a blocker to getting to production. Right we are at a point of competing priorities due to large architecture changes within Consul that we are actively working on. |
Hi @dschaaff, If possible, can you please share how are you building the image. Here is what I tried:
I tested it in my local setup and confirm it is working as expected but not sure of the build process. |
I use this docker file to build
|
Does anyone use Consul Mesh/Connect in production? |
@dschaaff Thank you for the Dockerfile! |
The big 1.0 rewrite has been released for a bit. Can anyone from HashiCorp comment on the timeline for fixing this issue? Due to the delay on this and other bugs we are facing we are considering dropping the Consul service mesh. |
@dschaaff Were taking this issue very seriously and have a solid idea of potential fixes to alleviate the problems you're having. The timeline is a little gray but with a medium, to strong probability, you will see a fix in the 2023 calendar year. |
Any news on this issue ? |
Hi @nrichu-hcp is there any update? We have multiple teams looking to go live in the next quarter with Consul Service Mesh, which we have spent 2 years arguing for vs Istio / ASM. This issue could easily force us to have to abandon Consul and migrate everyone to Istio/ASM. |
We abandoned the Consul mesh after 2 years in production. We had to run a forked build of the controller to enable adding the pre stop sleep that entire time. In the end we switched to linkerd. |
Hi, @nrichu-hcp it looks like this PR fixes the issue, any idea when it will be merged and what release it will be in? |
Envoy proxy sidecar receives SIGTERM at the exact same moment as my main container. In oppossite to my main container (which shuts down in like 15-30 seconds) envoy sidecar shuts down immediately (0.5 - 3s). This means that my main lost upstream connections and cannot gracefully shutdown - even rolling update means lost requests/data.
There should be some kind of mechanism so the upstream listeners exits as last. My main container is Consul Connect enabled and communicate with upstreams through Connect but the service itself is not being accessed through Connect but through Consul DNS instead.
Is there a workaround/hack (some kind of prestop sleep) or do I have get rid of Consul Connect?
Similar issues:
The text was updated successfully, but these errors were encountered: