Provide a way to shutdown the process from another container #828

villesau · 2021-06-28T10:22:54Z

Feature Description

It is a common practice to run database migrations in an init container and wait for the migration job to finish. However, when the proxy never shuts down, init script stalls. Would be good to provide a way to shut down the proxy gracefully. For example an endpoint for other containers to call when finished.

Alternatives Considered

Not sure really. If Kubernetes had a native way to shutdown the whole pod and deliver the signal to the container, that could work too. But I think an endpoint could be easier and more straightforward.

Additional Context

An example tutorial on how to run migrations with init containers, kubernetes jobs and k8s-wait-for: https://andrewlock.net/deploying-asp-net-core-applications-to-kubernetes-part-8-running-database-migrations-using-jobs-and-init-containers/

villesau · 2021-06-28T15:31:56Z

Seems that there was a proposal for kubernetes native functionality for this, but it will be reworked and there is no timeline for it: kubernetes/enhancements#753

I think an endpoint could be a good workaround for this in the meantime from the user perspective.

kurtisvg · 2021-07-07T21:04:37Z

I'm not entirely convinced an endpoint makes the most sense here. This problem was discussed in #128. #206 added the -term_timeout flag to close that issue.

You could start the proxy with this flag, sleep for some amount of time for your application to start up, and then send a sigterm to the proxy. It'll exit gracefully when the active connections hit 0 (or ungracefully when the term_timeout is reached).

Additionally, this method was suggested as a workaround in the original issue. It essentially uses a sentry file to tell a script when to kill the proxy.

Both seem like they are equal (or better?) options to an endpoint until k8s adds better support.

enocom · 2021-07-13T19:03:37Z

@villesau Does the term_timeout flag work for your use case?

arbourd · 2021-10-19T20:03:42Z

term_timeout should work for our case (similar to @villesau) but I feel like the /shutdown endpoint is safer. I can guarantee it if the job takes 5 seconds or 5 hours. term_timeout forces me to set an arbitrary time to sleep, SIGTERM, set an arbitrary timeout.

There's some prior art here with Linkerd: linkerd/linkerd2-proxy#811

The correct answer is definitely Kubernetes handling this natively. In every other case it's a workaround.

kurtisvg · 2021-10-20T16:51:45Z

I can guarantee it if the job takes 5 seconds or 5 hours. term_timeout forces me to set an arbitrary time to sleep

@arbourd Can you clarify this? To be clear, term_timeout it sets a MAXIMUM amount of time it waits before shutting down. It'll shut down early if all the connections are closed.

arbourd · 2021-10-20T17:06:28Z

Understood. Some of our jobs may run for many hours and some others (like migration) complete reasonably. The migration task is far far more common. Feels strange to set a 5h term_timeout. The workaround (trapping EXIT) is what we're using at the moment.

ettancos · 2022-02-11T11:39:25Z

I just run into this while trying to use the proxy for postgres migrations. What do you people think about something like an -idle-timeout x flag that would trigger a normal shutdown after x seconds if no new connections came in after the last connection finished? That could be useful for other type of jobs too, maybe it could be a good compromise while the community figures out how to manage container dependencies.

enocom · 2022-02-11T17:49:05Z

If we were to support this, I think following linkerd would be the way to go.

There's some prior art here with Linkerd: linkerd/linkerd2-proxy#811

#828 (comment)

yehoshuadimarsky · 2022-02-25T04:19:42Z

Yes, I also would love to see this implemented. My use case is documented in #1115, but briefly, it is to use with Apache Airflow when deployed on Kubernetes using their official Helm chart, and the sidecar needs to be killed semi-manually after the main app is done its work with the database.

enocom · 2022-03-14T19:50:42Z

Another option would be to follow Istio with its /quitquitquit endpoint.

zitikay · 2022-03-15T16:05:41Z

Having an explicit shutdown endpoint (/quitquitquit I think would be a great option following Istio example) would definitely help. Additionally:

It'll shut down early if all the connections are closed.

This is not the behavior I am currently seeing with proxy sidecar integration with pega-helm-charts installer subchart. After installer container completion, and checking cloud-sql-proxy logs, there are just as many
"New connection for..."
as
"Client closed local connection on 127.0.0.1:5432"
and still the sql-proxy container is running + refreshing ephemeral certificate even though main container has shut down cleanly (closing all connections)

enocom · 2022-03-15T16:25:04Z

To be clear, setting term_timeout makes the proxy wait so long after it receives a SIGTERM.

A workaround proposed elsewhere (not finding the comment at the moment) is to start the proxy with the alpine or buster container like so:

timeout 60 /cloud_sql_proxy --instances=<...> --term_timeout=3600s

The idea here is that the timeout command sends a SIGTERM so many seconds after starting (60s here), and then the proxy waits up to the term_timeout duration for all the connections to close before shutting down (3600s, or longer). If all connections close before the duration has elapsed, the proxy will shutdown early.

zitikay · 2022-03-15T17:12:14Z

Ah thanks @enocom it seems I was missing the part of actually sending the proxy a SIGTERM. Using timeout wrapper on the command, or having a shared volume that our installer container can write to where proxy is notified for shutdown on a given file's presence both seem to be viable ways of initiating this SIGTERM. We could still set a term_timeout, but this may not be needed as all connections are being closed

enocom · 2022-03-16T20:14:04Z

For completeness, here's a blog post that describes how to use a PID file to accomplish the same.

In plain bash the approach translates to:

$ /cloud_sql_proxy -instances=my-project:us-central1:my-instance=tcp:3306 \
  -credential_file=/secrets/cloudsql/credentials.json & CHILD_PID=$!
$ (while true; do if [[ -f "/tmp/pod/terminated" ]]; then 
    kill $CHILD_PID; echo "Killed $CHILD_PID because the main container terminated."
  fi; sleep 1; done) &
$ wait $CHILD_PID
$ if [[ -f "/tmp/pod/terminated" ]]; then exit 0; echo "Job completed. Exiting..."; fi

kriswuollett · 2022-04-25T15:53:25Z

I think an endpoint like /shutdown, or reading a file to be written, is likely the only solution that covers most of the different ways the sql proxy could be used. The endpoint method seems easier since it wouldn't require the use of a shared volume.

Relying on timeouts could result in flaky behavior because it makes an assumption that past a certain time in the future if the connections drop to 0, then the proxy can exit. I believe that is coming from an assumption that the other container would only be using 1 connection, and that it is reasonable to try to predict how long it will take database operations to complete or the other containers themselves to start. IMO those are not reasonable assumptions; therefore, I think it needs to be stated that using timeouts to enter the exiting state is not an acceptable solution for this issue. However, using timeouts for then connections to close after the proxy being explicitly to shutdown seem perfectly fine.

enocom · 2022-11-02T18:17:17Z

Adding a suggestion from @taylorchu from here:

We use in-pod supervisor (https://github.com/taylorchu/wait-for) to solve the sidecar container issue. It is simple and can be used even if you don't use k8s.

enocom · 2022-11-08T18:40:43Z

Random thought: we could add support for SIGQUIT or similar to have the Proxy shutdown. That might sidestep the problems of securing an HTTP endpoint.

enocom · 2022-11-09T18:18:51Z

The current plan is to add a localhost-only /quitquitquit endpoint. If there are any open connections the proxy will exit with a 143 code (the conventional SIGTERM code) and 0 otherwise if the shutdown is clean.

taylorchu · 2022-11-10T00:46:37Z

I found quitquitquit endpoint is not very scalable, and could easily create race conditions.

For example,

you have a pod with 1 keystone container and 2 sidecars. Both sidecars will need to have this quitquitquit-like endpoint (they might even have different http status code, and http path), and the keystone app needs to adapt and tightly couple to these sidecars.
you have a pod with 2 keystone containers, which depend on 1 sidecar. Both keystone containers could send to quitquitquit endpoint. If sidecar exits from quitquitquit, then the other keystone will need to watch for a broken connection and then decide to exit.

Now imagine the count is not 2 but >= 3, and you have many apps like this. All this additional/complex code added is because we run things on k8s; the application is bloated because of the "environment" you run it on.

I think this issue is more of a k8s problem and therefore out of scope.

enocom · 2022-11-10T01:59:44Z

It's definitely a K8s problem. The idea here is to ease the pain for people running jobs or init containers. And we'll make it opt-in with a CLI flag, so by default the behavior will be off.

What's a keystone container? I don't know that phrase and don't find much when searching for it.

taylorchu · 2022-11-10T02:24:42Z

kubernetes/enhancements#2872

Fixes #828.

enocom · 2023-01-31T21:43:22Z

We'll be releasing this in the next version of v2 currently scheduled for February 21.

red8888 · 2023-02-08T19:16:24Z

@enocom question about this, will you also provide a way to have it ignore sigterm as well or is that already possible?

Consider the following case:

Pod scales down
App AND proxy get a sigterm (cause they are in the same pod)
App's cleanup routine requires it to make new connections to the DB after getting a sigterm
The proxy will refuse new connections after getting a sigterm
This race condition shuts down the proxy before the app can access the DB

It doesn't matter if there is an endpoint the app can hit if the proxy is getting a sigterm from k8s. There needs to be a way to keep the proxy running.

I could use the entrypoint config hacks I've seen in github pages to catch sigterm in the entrypoint and ignore but it would be nice to support this in the proxy process itself! Maybe another endpoint that could be hit to cancel shutdown the app could call? Or will the quitquitquit endpoint support setting a hard time out so they app could force it to stay alive for X seconds when getting a sigterm?

enocom · 2023-02-08T19:43:56Z

Take a look at the --max-sigterm-delay flag which should handle the use case you're describing.

red8888 · 2023-02-08T19:58:03Z

--max-sigterm-delay

So I guess I was misunderstanding how that worked. I though using that meant it would stay alive only while there are pending connections. I'm actually using that switch currently. Your when the proxy gets a sig term and --max-sigterm-delay is in effect it will stay alive for the delay I set even if there are no open connections?

So I'm wrong in thinking The proxy will refuse new connections after getting a sigterm if --max-sigterm-delay is set?

@enocom I'm actually using V1 and term_timeout, maybe the behavior changed in V2?

enocom · 2023-02-08T19:59:55Z

If you have open connections, the proxy will stay alive until the duration elapses. Otherwise it will shut down early. In either case, let's pull this discussion into a new feature request and we can explore whether there's something that needs to be changed here or elsewhere.

red8888 · 2023-02-08T20:20:07Z

@enocom OK I created a feature request explaining my use case. Thanks!

#1640

dastanko · 2023-03-30T18:02:22Z

In case someone was struggling with the same problem. I've come up with something working for me.

command: [ "sh", "-c", "python manage.py migrate && pkill -2 pgbouncer && curl -X POST localhost:9091/quitquitquit" ]

bdt-cd-bot · 2023-04-12T20:25:27Z

The flag for -term_timeout doesn't work and I'm going to have to insist on the fact that it does not work. It works locally, yes, on my Mac, but does not work in the gcr.io/cloudsql-docker/gce-proxy container where it should absolutely work.

enocom · 2023-04-13T15:16:19Z

Let's talk through the details here: #1742.

GuillaumeCisco · 2023-04-27T21:40:59Z

@dastanko answer helped me a lot for setuping a post-install/post-upgrade helm hook for a kubernetes job in order to run django migrations.
This way, my django deployment can correctly handle graceful shutdown by handling the SIGTERM signal.
And migrations ran automatically upon each helm upgrade with a graceful shutdown on the sidecar cloudsql proxy 👍

enocom · 2023-04-27T22:06:33Z

@GuillaumeCisco Does that change your opinion on the need for #1640?

GuillaumeCisco · 2023-04-28T07:08:49Z

@GuillaumeCisco Does that change your opinion on the need for #1640?

No it is a total different thing.
The current thread is about killing the sidecar cloudsql proxy with an exit code 0 from another container after main process exit with 0.

Here I want to run a django migration, which will execute in a job with a post-install/post-upgrade helm hook. Then, exit gracefully the sidecar cloudsql proxy because I want the job to terminate, and my help upgrade stop hanging and finish process correctly.
The other one Support forcing the proxy to accept new connections while max-sigterm-delay is in effect #1640 deals with SIGTERM signal sent when deleting a pod.
In my case I have three containers in the pod (one main with two sidecars):
- cloudsql proxy
- pgbouncer (connection pool)
- celery worker

cloudsql proxy and pgbouncer communicates with some connection opened, so using max-sigterm-delay may work (I'm till heavily testing this case).
Then, I need pgbouncer to only exit when the tasks from celery workers are all done.
When the celery worker receives a SIGTERM, then a warm shutdown is launched: no more accepting new tasks, releasing prefetch ones, and finishing current tasks in the graceful period specified.
When pgbouncer receives a SIGTERM, there is an Immediate shutdown which I don't want, as my celery worker needs it. So using a preStop hook may work, by using the SIGINT signal thanks to kill -INT 1, which will do a Safe shutdown (PAUSE +SHUTDOWN). But **PAUSE** signal prevents new connections from being accepted. So I will maybe need to pair it with a sleep whom value is equal to the graceful period.
Or maybe start all my celery tasks with a with transaction being sure to keep an opened connection...

I'm still hacking around for trying to resolve this issue. I did what you suggest by creating a connection pool with pgbouncer, but I don't think it is a part of the solution. Like moving the issue somewhere else.

In my opinion, the EASIEST way, could be simply to declare a "main" container in a pod, and only this one get the SIGTERM signal, gracefully shutdown, then other sidecar pods shutdown.
I don't understand why in the sidecar philosophy we don't have this behavior, it seems very intuitive to me. Because there are... sidecars...

Other hacky way I thought about is to run sidecar containers with a bash -c so pid 1 is bash and not the actual sidecar process, then the SIGTERM is never catched, and I can wait for containers to die after the graceful period or send a quitquitquit event in the case of cloudsql proxy and kill pgbouncer directly.
But seems far too hacky to me actually.

I'm still scratching my head to understand how to address this issue.

abelmatos · 2024-04-26T11:23:18Z

"sidecar using Kubernetes-style init containers"
https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/

besteban1989 · 2024-10-25T00:01:52Z

"sidecar using Kubernetes-style init containers" https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/

Adding the cloud sql proxy as init container with restartPolicy=Always did the trick.

Thanks.

villesau added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Jun 28, 2021

blunderbuss-gcf bot assigned enocom Jun 28, 2021

enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jun 28, 2021

kurtisvg unassigned enocom Jul 7, 2021

yehoshuadimarsky mentioned this issue Feb 24, 2022

CloudSQL Proxy not shutting down after a few hours despite setting term_timeout=120s #1115

Closed

product-auto-label bot added the api: cloudsql label May 6, 2022

kurtisvg removed the api: cloudsql label May 11, 2022

kurtisvg mentioned this issue May 19, 2022

How to use cloudsql proxy in a kubernetes cron job? #143

Closed

kurtisvg added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jun 9, 2022

enocom mentioned this issue Aug 29, 2022

Allow setting a min delay before termination is initiated, e.g. --min-delay-before-term #1333

Closed

kurtisvg mentioned this issue Oct 24, 2022

term_timeout flag not working #1501

Closed

enocom added the priority: p0 Highest priority. Critical issue. P0 implies highest priority. label Nov 9, 2022

enocom removed the priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. label Nov 9, 2022

enocom mentioned this issue Nov 9, 2022

Proxy exits with wrong code after SIGTERM #1529

Closed

enocom self-assigned this Nov 9, 2022

enocom added the k8s Kubernetes related issue or functionality. label Nov 19, 2022

enocom mentioned this issue Dec 21, 2022

Add method to terminate Cloud SQL Auth Proxy over localhost #1592

Closed

enocom added a commit that referenced this issue Jan 26, 2023

feat: add support for quitquitquit endpoint

9fcef71

Fixes #828.

enocom mentioned this issue Jan 26, 2023

feat: add support for quitquitquit endpoint #1624

Merged

enocom closed this as completed in #1624 Jan 31, 2023

enocom added a commit that referenced this issue Jan 31, 2023

feat: add support for quitquitquit endpoint (#1624)

43f9857

Fixes #828.

abelmatos mentioned this issue Apr 26, 2024

Ability to modify command and args section of pod generated by KeyCloakRealmImport CRD (Keycloak Operator - k8s) keycloak/keycloak#15458

Open

Provide a way to shutdown the process from another container #828

Provide a way to shutdown the process from another container #828

Comments

villesau commented Jun 28, 2021 • edited Loading

Feature Description

Alternatives Considered

Additional Context

villesau commented Jun 28, 2021 • edited Loading

kurtisvg commented Jul 7, 2021

enocom commented Jul 13, 2021

arbourd commented Oct 19, 2021

kurtisvg commented Oct 20, 2021

arbourd commented Oct 20, 2021

ettancos commented Feb 11, 2022 • edited Loading

enocom commented Feb 11, 2022

yehoshuadimarsky commented Feb 25, 2022

enocom commented Mar 14, 2022

zitikay commented Mar 15, 2022

enocom commented Mar 15, 2022 • edited Loading

zitikay commented Mar 15, 2022

enocom commented Mar 16, 2022

kriswuollett commented Apr 25, 2022

enocom commented Nov 2, 2022

enocom commented Nov 8, 2022 • edited Loading

enocom commented Nov 9, 2022

taylorchu commented Nov 10, 2022 • edited Loading

enocom commented Nov 10, 2022

taylorchu commented Nov 10, 2022

enocom commented Jan 31, 2023

red8888 commented Feb 8, 2023

enocom commented Feb 8, 2023

red8888 commented Feb 8, 2023 • edited Loading

enocom commented Feb 8, 2023

red8888 commented Feb 8, 2023

dastanko commented Mar 30, 2023

bdt-cd-bot commented Apr 12, 2023

enocom commented Apr 13, 2023

GuillaumeCisco commented Apr 27, 2023

enocom commented Apr 27, 2023

GuillaumeCisco commented Apr 28, 2023

abelmatos commented Apr 26, 2024

besteban1989 commented Oct 25, 2024

villesau commented Jun 28, 2021 •

edited

Loading

villesau commented Jun 28, 2021 •

edited

Loading

ettancos commented Feb 11, 2022 •

edited

Loading

enocom commented Mar 15, 2022 •

edited

Loading

enocom commented Nov 8, 2022 •

edited

Loading

taylorchu commented Nov 10, 2022 •

edited

Loading

red8888 commented Feb 8, 2023 •

edited

Loading