Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orderly router shutdown and resource cleanup #293

Closed
kgiusti opened this issue Apr 6, 2022 · 6 comments · Fixed by #505
Closed

Orderly router shutdown and resource cleanup #293

kgiusti opened this issue Apr 6, 2022 · 6 comments · Fixed by #505
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@kgiusti
Copy link
Contributor

kgiusti commented Apr 6, 2022

On shutdown the router stops all processes "mid stream" and attempts to manually clean up resources that remain in use. This is a best effort attempt and involves bespoke cleanup code for just about every system in the router. Here is an example where the router core attempts to release resources for all links that were active at the point of shutdown.

This is error prone and triggers many leak events in ASAN. This makes it nearly impossible to determine if a leak occurs during runtime (probably a serious issue) or just at shutdown (sloppy, but not service affecting).

The router's shutdown needs to be refactored to instead perform an orderly shutdown controlled by management. This feature would cause management to:

  • close the $management subscription to prevent further external management access.
  • delete all listeners and connectors to prevent new connections being established.
  • close all active connections
  • stop any periodic services (idle link scrubber, stuck delivery detection, etc).

This will trigger the existing run-time cleanup for all outstanding messages, dispositions, etc. Once all connections finish closing, all in-flight messages are released, and other provisioned entities and services have been released, the subsystems (core, proactor, etc) can clean up and shutdown.

The devil is in the details for sure - this will be no small job.

@kgiusti kgiusti added the enhancement New feature or request label Apr 6, 2022
@kgiusti kgiusti self-assigned this Apr 6, 2022
@kgiusti
Copy link
Contributor Author

kgiusti commented Apr 6, 2022

Needs fixing: #156

@kgiusti
Copy link
Contributor Author

kgiusti commented Apr 22, 2022

Goal: We want avoid runtime leaks in the router

Old way: All leak analysis at process exit (by asan or alloc pool leak detection)

New way: Shutdown all listeners, terminate connections, and OTHER STUFF; and then,
run leak analysis at a time prior to process exit.
The key thing here is to detect leaks only for the desired scope, not for the whole router process.

@jiridanek
Copy link
Contributor

jiridanek commented Apr 23, 2022

I'm sorry but I still don't much understand this.

  • Runtime leaks are quite different problem from orderly shutdown. You can have one and not the other, vice versa. Consider the memory leaks you can have in Java and JavaScript, for example. (You keep reference to something you should not, or somebody else keeps reference on your behalf (how insidious!), and so gc cannot gc it. Or you keep reference longer than necessary, so things get eventually collected, but you might get OOM in the meantime.)
  • I do not see what the new way would translate into, practically. How does it apply to, for example, Mick's asan leak in qd_policy_c_counts_alloc #253 or to Leaked termini and qdr_link_t at router shutdown #335?

@jiridanek
Copy link
Contributor

Even simpler trial case of a leak to be judged by the new rules, #156

@jiridanek
Copy link
Contributor

jiridanek commented Jun 22, 2022

The key thing here is to detect leaks only for the desired scope, not for the whole router process.

I guess this is the main stumbling block for me. How do you do that? How do you tell that "this buffer was 'leaked' because we did not bother to free it (yet). How can you know it is not truly leaked because there is a bug in the code and it would never be cleared up, even if the connection it is related to was closed (while the router continues running)"?

@kgiusti
Copy link
Contributor Author

kgiusti commented Mar 24, 2023

Won't fix.

It has been decided to maintain the current resource cleanup architecture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants