Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python (shutdown) leaks #412

Closed
jiridanek opened this issue Apr 30, 2022 · 1 comment · Fixed by #448 or #413
Closed

Python (shutdown) leaks #412

jiridanek opened this issue Apr 30, 2022 · 1 comment · Fixed by #448 or #413
Assignees
Milestone

Comments

@jiridanek
Copy link
Contributor

jiridanek commented Apr 30, 2022

Brought forward from DISPATCH-1962 Make LSan suppressions more targeted and specific (within reason)

= Python leaks

Leaks in CPython itself are known and there is ongoing effort to fight them: https://bugs.python.org/issue1635741 (https://bugs.python.org/issue25302) and https://www.python.org/dev/peps/pep-3121

Here's valgrind suppression file from somebody who actually investigated the Python leaks and identified the harmless ones: https://github.com/libgit2/pygit2/blob/master/misc/valgrind-python.supp

= Router leaks

One example of a hidden leak in dispatch, which is revealed by making Python suppressions more targetted:

9: Direct leak of 56 byte(s) in 1 object(s) allocated from:
9:     #0 0x7f78a3606e8f in __interceptor_malloc (/nix/store/g40sl3zh3nv52vj0mrl4iki5iphh5ika-gcc-10.2.0-lib/lib/libasan.so.6+0xace8f)
9:     #1 0x7f78a2d64afb in qd_malloc ../include/qpid/dispatch/ctools.h:229
9:     #2 0x7f78a2d657da in qdr_core_subscribe ../src/router_core/route_tables.c:149
9:     #3 0x7f78a2c83072 in IoAdapter_init ../src/python_embedded.c:711
9:     #4 0x7f78a2353a6c in type_call (/nix/store/r85nxfnwiv45nbmf5yb60jj8ajim4m7w-python3-3.8.5/lib/libpython3.8.so.1.0+0x165a6c)

The problem is in

class Agent:
    ...
    def activate(self, address):
        ...
        self.io = IoAdapter(self.receive, address, 'L', '0', TREATMENT_ANYCAST_CLOSEST)

IoAdapter refers to Agent (through the bound method reference self.receive) and Agent refers to IoAdapter (through property self.io). Since IoAdapter is implemented in C and does not implement support for Python's cyclic GC, there is no way to break the cycle, collect the garbage, and run destructors.

Heap dump in attachment of the original Jira. The bound method is at the top of the picture. (Ignore the Mock objects, I was trying to simplify the picture while not getting crashes due to too much meddling).

== Random observations

It is possible to build special Debug build of Python, which has tools to detect leaks, asserts to prevent negative refcounts, etc. https://pythonextensionpatterns.readthedocs.io/en/latest/debugging/debug_python.html#debug-version-of-python-memory-alloc-label

Use the following to detect python leaks (instead of valgrind)
https://docs.python.org/3/library/tracemalloc.html

Use https://pypi.org/project/objgraph (with graphviz) to view heap object trees. The following renders the picture as a png under /tmp and prints the path to stdout.

int ret = PyRun_SimpleString("import objgraph; objgraph.show_backrefs(config.global_agent, max_depth=10)\n\n");
PyErr_PrintEx(0);
assert(ret == 0);

== Tools

=== CPyChecker

=== Pungi

@jiridanek
Copy link
Contributor Author

jiridanek added a commit to jiridanek/skupper-router that referenced this issue May 9, 2022
jiridanek added a commit to jiridanek/skupper-router that referenced this issue May 9, 2022
jiridanek added a commit to jiridanek/skupper-router that referenced this issue May 9, 2022
jiridanek added a commit to jiridanek/skupper-router that referenced this issue May 10, 2022
jiridanek added a commit to jiridanek/skupper-router that referenced this issue May 13, 2022
jiridanek added a commit to jiridanek/skupper-router that referenced this issue May 25, 2022
jiridanek added a commit that referenced this issue May 26, 2022
* This is revival of #259
* It attempts to fix issue described in #412
* and by the way it seems to also fix #253

## About capsules and men

The Capsule is a container for a pointer, opaque from the Python side, and translucent from the C side. The C code can put a pointer in and Python can treat it as py_object and pass it to some other C code, but cannot unwrap the pointer (unless it really tries, see example code on Stack Overflow for how to break the encapsulation). The important thing for our purposes is that Capsules can own the pointer and can have a destructor associated. More at https://docs.python.org/3/c-api/capsule.html

The idea of the #253 fix is to use Capsule to associate destructor to the pointer, and then rely on Python to run GC eventually and free the memory. Thereby leak is prevented.

## Getting the GC to actually run

Capsule is owned by `AppStats`, which is owned by `PolicyLocal`, which is contained in `PolicyManager` which is contained in `Agent`, which is very bad, because `Agent` cannot be GC'd because there is a refcounting cycle with `IoAdapter`, which does not participate in the cyclic GC. The Dispatch PR which tried to fix the GC for IoAdapter hit into weird shutdown crashes because when the Python was actually running GC fully (which it was not before) it hit... issues.

## In conclusion

I considered writing a TODO list here, but I guess it would be best to just discuss this first... Honestly, I am not sure why I am not seeing the horrible problems that plagued the Dispatch PR. And I still feel uneasy about it.

One thing I had to do in Dispatch that does not seem necessary here is this (in python_embedded.c::qd_io_rx_handler)

```diff
diff --git a/src/python_embedded.c b/src/python_embedded.c
--- a/src/python_embedded.c	(revision 922ca0ad7d43f2d838c4bbb282d69e14b8b457c3)
+++ b/src/python_embedded.c	(revision 714e506d601054eb839e5968caf01c81afa1c3f0)
@@ -639,6 +639,11 @@
     IoAdapter *self = (IoAdapter*) context;
     *error = 0;
 
+    if (self->handler == NULL) {
+        *error = qdr_error(QD_AMQP_COND_INTERNAL_ERROR, "Router is shutting down");
+        return PN_REJECTED;
+    }
+
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment