-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for free-threaded Python #695
Conversation
7001e18
to
67f2752
Compare
Dear @colesbury, following your work on the pybind11 free-threaded support, I was wondering if I could ask two quick questions about getting something similar into nanobind.
Thanks! |
few cents on the last question
@wjakob maybe you have seen this page, otherwise please check this : https://py-free-threading.github.io/running-gil-disabled/ Here is a PR for numpy enabling free-threading tests: https://github.com/numpy/numpy/pull/26463/files HTH |
Hi @wjakob,
|
One more thing: https://docs.python.org/3.13/howto/free-threading-extensions.html#memory-allocation-apis One annoying thing is that I think the other usages can use |
Thank you for the clarification @colesbury.
|
03cb139
to
93a39a7
Compare
@colesbury: I have two more questions about functions that weren't mentioned in PEP 703.
(These pairs are commonly called in existing pybind11/nanobind extensions to mark boundaries between parallel C++ code and previously GIL-protected Python regions.) And an unrelated question about PEP 703: I'm asking about critical sections because I would like to provide a builtin nanobind annotation to protect C++ instances from concurrent access (for unsafe/legacy code). Basically I'm wondering about the right primitive to build on, and whether Thanks! |
Hi @wjakob -
I'll reply about |
The Py_BEGIN_CRITICAL_SECTION(obj);
obj->x++; // first atomically executed
obj->y--; // executed critical section
Py_DECREF(other); /* may call arbitrary code via finalizer */
obj->z *= 2; // second atomically executed critical section
Py_END_CRITICAL_SECTION(); In the above example:
The implicit breakup into multiple smaller atomically executed critical sections occurs at the same places where the GIL would have been released --
If you are dealing mostly with, for example, the C++ STL then plain mutexes are probably appropriate. If you're dealing mostly with the Python C API, then |
Thank you for the clarification @colesbury. That makes a lot of sense. So the new critical sections are really how things have always been with possible code execution through I'm think I'm done with all of the tweaks to nanobind and now wanted to test that it actually works as expected. This led me to a surprising result with v3.13rc1. I wrote a test that launches 8 threads that each construct 1M objects and destroy them afterwards. You can run it as follows with this PR $ python3.13 -m pytest tests/test_thread.py The file contains just a single test: def test01_object_creation():
from test_thread_ext import Counter
def f():
n = 1000000
r = [None]*n
for i in range(n):
r[i] = Counter()
del r
parallelize(f, n_thread=8)
gc.collect() I would expect performance to stay mostly constant until the Interrupting the interpreter via Ctrl-C indicates that this bottleneck seems to be related to the construction of the * frame #0: 0x0000000198dfff88 libsystem_kernel.dylib` swtch_pri + 8
frame #1: 0x0000000198e3ce34 libsystem_pthread.dylib` cthread_yield + 32
frame #2: 0x000000010026aaa0 python3.13` _PyMutex_LockTimed [inlined] _Py_yield + 208 at lock.c:46
frame #3: 0x000000010026aa9c python3.13` _PyMutex_LockTimed(m=0x00000001005f8a88, timeout=-1, flags=_PY_LOCK_DET
ACH) + 204 at lock.c:88
frame #4: 0x000000010026b4c0 python3.13` PyMutex_Lock(m=<unavailable>) + 8 at lock.c:582
frame #5: 0x000000010022e980 python3.13` _PyCriticalSection_BeginSlow [inlined] _PyMutex_Lock(m=0x00000001005f8a
88) + 80 at lock.h:49
frame #6: 0x000000010022e96c python3.13` _PyCriticalSection_BeginSlow(c=0x0000000173e2aac8, m=0x00000001005f8a88
) + 60 at critical_section.c:20
frame #7: 0x000000010013fc5c python3.13` _PyType_LookupRef [inlined] _PyCriticalSection_BeginMutex(c=0x000000017
3e2aac8, m=<unavailable>) + 524 at pycore_critical_section.h:119
frame #8: 0x000000010013fc2c python3.13` _PyType_LookupRef(type=0x0000020000d70010, name=0x00000001005b96f8) +
476 at typeobject.c:5256
frame #9: 0x0000000100152370 python3.13` lookup_maybe_method(self='0x2001f611b40', attr=<unavailable>, unbound=0
x0000000173e2ab7c) + 32 at typeobject.c:2521
frame #10: 0x000000010014d850 python3.13` slot_tp_init [inlined] lookup_method(self='0x2001f611b40', attr=<unava
ilable>, unbound=0x0000000173e2ab7c) + 28 at typeobject.c:2543
frame #11: 0x000000010014d834 python3.13` slot_tp_init(self='0x2001f611b40', args=(), kwds='0x00000000') + 56 a
t typeobject.c:9773
frame #12: 0x00000001001433b0 python3.13` type_call(self='0x20000d70010', args=(), kwds='0x00000000') + 464 at
typeobject.c:1990
frame #13: 0x0000000100096bcc python3.13` _PyObject_MakeTpCall(tstate=0x0000000101019210, callable='0x20000d7001
0', args='0x1009fc270', nargs=4305437280, keywords='0x00000000') + 356 at call.c:242
frame #14: 0x0000000100096838 python3.13` _PyObject_VectorcallTstate(tstate=<unavailable>, callable=<unavailable
>, args=<unavailable>, nargsf=<unavailable>, kwnames=<unavailable>) + 228 at pycore_call.h:166
frame #15: 0x0000000100097704 python3.13` PyObject_Vectorcall(callable=<unavailable>, args=<unavailable>, nargsf
=<unavailable>, kwnames=<unavailable>) + 48 at call.c:327
frame #16: 0x00000001001f9b90 python3.13` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=<unavailable>, th
rowflag=0) + 13112 at generated_cases.c.h:813
frame #17: 0x000000010009b1d4 python3.13` _PyObject_VectorcallTstate(tstate=0x0000000101019210, callable='0x2000
0756790', args='0x173e2aed8', nargsf=1, kwnames='0x00000000') + 116 at pycore_call.h:168 Looking at the CPython code, it seems to me that there is a caching scheme in place that should make it possible to avoid the critical section. However, for some reason, this isn't happening. Do you have any ideas what could be wrong? Thank you! |
I think there is a combination of two performance issues: First, instances of The second problem is in the implementation of the MRO type cache in CPython. Basically, the fast-path lookup fails so we fall back to the locking slow path, but we're also not enabling future lookups in the fast-path to succeed. This case only happens with entries that don't use deferred reference counting (like |
Thank you for looking into this @colesbury! For context: nanobind uses a custom function object to implement an efficient function dispatch mechanism (i.e., faster and more flexible than the builtin C function wrappers). It is central to the design of the library—this design choice preventing multi-threaded speedups in free-threaded builds would be bad news. Could you let me know if you see any way to force |
I tried doing the hack myself. Based on the CPython internals, I added the function void enable_deferred(PyObject *op) {
assert(PyType_IS_GC(Py_TYPE(op)) && _Py_IsOwnedByCurrentThread(op) &&
op->ob_ref_shared == 0);
op->ob_gc_bits |= 64;
op->ob_ref_local += 1;
op->ob_ref_shared = _Py_REF_QUEUED;
} (This function is called on a newly created Calling this on the function objects seems to fix the issue with critical sections. However, I'm still not seeing multithreaded speedups. Examining the callstack shows that most of the time is spent dealing with shared refcount atomics. Does this mean that nanobind needs to bite the bullet and make immortal types and function objects? Is it correct to assume that this will break nanobind's leak tracking? For context: one feature of the library is that it complains loudly if any of its objects (classes, functions, instances) are still alive after Python has full shut down and invokes the |
I tried the main branch (v3.14.0) as well immortalizing the functions and still observed significant performance regressions. Then I thought I'd try this in plain Python, and it turns out that the behavior there is similar. So perhaps it's better to understand the situation there. Here is a reproducer: import threading
def parallelize(func, n_threads):
barrier = threading.Barrier(n_threads)
def wrapper():
barrier.wait()
return func()
workers = []
for _ in range(n_threads):
t = threading.Thread(target=wrapper)
t.start()
workers.append(t)
for worker in workers:
worker.join()
def f():
def q():
return 1
for i in range(1000000):
q()
parallelize(f, n_threads=8) Performance on an M1 Max laptop (macOS) on AC power
When I move the
Especially in the former case with a callable declared per thread, I would have expected the runtime cost to be relatively flat until 8 threads (matching the number of performance cores in this machine.) Backtracethis is where time seems to be spent (for the former case with a local function): * thread #2
frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940) + 8 at pyatomic_gcc.h:130
127
128 static inline int
129 _Py_atomic_compare_exchange_ssize(Py_ssize_t *obj, Py_ssize_t *expected, Py_ssize_t desired)
-> 130 { return __atomic_compare_exchange_n(obj, expected, desired, 0,
131 __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); }
132
133 static inline int
warning: python3.14t was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2
* frame #0: 0x00000001000c4e78 python3.14t` _Py_DecRefShared [inlined] _Py_atomic_compare_exchange_ssize(obj=0x000003c380398970, expected=<unavailable>, desired=4611686018427387940) + 8 at pyatomic_gcc.h:130
frame #1: 0x00000001000c4e70 python3.14t` _Py_DecRefShared [inlined] _Py_DecRefSharedDebug(o=0x000003c380398960, filename=0x0000000000000000, lineno=0) + 28 at object.c:353
frame #2: 0x00000001000c4e54 python3.14t` _Py_DecRefShared(o=0x000003c380398960) + 20 at object.c:371
frame #3: 0x00000001001afed0 python3.14t` _PyEval_FrameClearAndPop + 4 at ceval.c:1684
frame #4: 0x00000001001afecc python3.14t` _PyEval_FrameClearAndPop(tstate=0x000000010081f200, frame=0x00000001006e0290)+ 232 at ceval.c:1710
frame #5: 0x00000001001ab9d4 python3.14t` _PyEval_EvalFrameDefault(tstate=<unavailable>, frame=0x00000001006e0218, throwflag=<unavailable>) + 58020 at generated_cases.c.h:6499
frame #6: 0x000000010005e71c python3.14t` method_vectorcall [inlined] _PyObject_VectorcallTstate(tstate=0x000000010081f200, callable=0x000003c3805363c0, args=0x0000000171e12ee8, nargsf=1, kwnames=0x0000000000000000) + 44 at pycore_call.h:167
frame #7: 0x000000010005e6f0 python3.14t` method_vectorcall(method=<unavailable>, args=0x0000000100492b90, nargsf=<unavailable>, kwnames=0x0000000000000000) + 284 at classobject.c:70
frame #8: 0x000000010029a284 python3.14t` thread_run(boot_raw=0x0000600000c00e40) + 128
frame #9: 0x000000010022f2bc python3.14t` pythread_wrapper(arg=<unavailable>) + 28
frame #10: 0x0000000198e3ffa8 libsystem_pthread.dylib` _pthread_start + 148 |
Yes, deferred reference counting isn't fully implemented in 3.13 and is more of a placeholder, so functions and types need to be immortalized.
I'll take a look at this today or tomorrow. |
Hi @wjakob - Here's a patch that demonstrates how to make functions and type objects immortal: https://gist.github.com/colesbury/99e15d904d84d239231f465a5aca6c36 If you are doing multi-threaded benchmarking, you won't want to use
The other change was to move the |
Thank you very much @colesbury, it fixes the performance regression. I am now seeing the following performance on the v3.14 (main) branch:
The cell performance caveat you mention sounds tricky to figure out for users, that would be good to fix indeed. |
Thank you for all your help so far @colesbury. I was wondering about one more thing: In your pybind11 implementation, you sharded the (very frequently accessed) instance map based on the top 44 bits of the pointer address. The rationale was that this region is related the thread ID, and that way we can limit lock contention if every thread mostly works with its own objects. This sounded good to me, and I copied your approach in nanobind. One potential issue is that nanobind extensions talk to each other through shared data structures, and that means that such a sharding scheme will effectively become part of the "ABI" (i.e. difficult to modify after the fact without breaking the ABI contract). It's the same also in pybind11. Hence I want to be sure that this is a good long-term solution. Therefore, I was wondering where this came from, and what can be said about it? A few questions:
Thank you! |
I'm not sure that using the top 44 bits does much to help with lock contention compared to using all the bits. We are still applying a hash function ( To the extent it helps, it's because a given thread is likely to re-use the same shard for lookups (fewer cache misses). The improvement here is small compared to reducing lock contention by increasing the number of shards. There's a good deal of flexibility in terms of how many low bits to discard -- I don't think it's particularly sensitive to the choice or to the allocator. The important things are just that:
The first condition is more important. You definitely don't want to have multiple threads forced to use the same shard. But we have a good deal of slack here. Glibc's arenas are (mostly) per-thread and by default 64 MiB. Mimalloc's segments are per-thread and 32 MiB. If you are concerned about the choice of shift, you can put it in the internals struct (and use that value instead of hardcoding the shift). I think that would give more flexibility about changing it without affecting the ABI. In other words, the creator of the I think more likely paths for improvement are:
|
Hi @colesbury, thank you your the detailed explanation and clarifying my misunderstanding. The PR is now almost done in the sense that all the internal data structures are protected for free-threading. I wanted to ask about your thoughts regarding one more thing: Binding tools like nanobind and pybind11 are often used to bind existing code. However, that code might often not be thread-safe and therefore unsafe to use in free-threaded Python. It's worse than pure Python code because of the possibility of memory-unsafe operations that crash the process. To amelierate such issues, I would like to add a way for nanobind to lock Python objects involved in a function call (without having to modify the function itself). Example: with this feature, the user could change a binding of a function with 3 named arguments
into one guaranteeing that the
In general, this sounds like something that At first, I thought this is simple: I'll just sort all of the objects by pointer address and then run Do you think what I am trying to do is possible in some forward-compatible way? Thanks! |
Hi @wjakob - I don't think there's a way to do what you are proposing. You should avoid using If possible, I'd recommend limiting the implementation to locking one or two arguments so that you can use the critical section APIs. We do something similar in CPython with Argument Clinic: https://github.com/python/cpython/blob/782217f28f0d67916fc3ff82b03b88573686c0e7/Objects/setobject.c#L1998. In our cases:
|
02d190d
to
6d60ed7
Compare
Thank you for the helpful feedback everyone! Two more updates:
|
This feature has been in the works for a while, and I will merge it for now. If there are any previously undiscovered problems, please let me know here or open a separate issue. My plan is to wait a bit more for things to stabilize and then push out a new release. In principle, this PR should not affect existing (non-free-threaded) extensions. However, it is a whopping 2K diff with a rewrite of internal data structures. So I am again pinging users who have chimed in on GitHub in the past: @WKarel, @ManifoldFR, @hpkfft, @yosh-matsuda, @garth-wells, @luigifcruz, @awni, @qnzhou, @skallweitNV, @oremanj, @wojdyr, @jmalkin, @Chronum94. It would be great if you could compile your projects against the master branch and let me know if anything breaks between ee23846 (revision prior to merging this PR) and the latest |
This commit adds the ``FREE_THREADED`` parameter to the CMake ``nanobind_add_module()`` command. It does not do anything for now besides defining ``NB_FREE_THREADED`` in C++ compilation units.
Nanobind's internal data structures switch to a different layout when free-threading was requested. We must give such extensions a different ABI tag since they are incompatible with non-free-threaded ones.
This commit implements RAII scope guards that are thin wrappers around existing Python API to lock Python mutexes and enter critical sections with respect to one or two Python objects. In ordinary (non-free-threaded) Python builds, they don't do anything and can be optimized away.
The garbage collector in free-threaded Python does not like it when the reference count of an object temporarily increases while traversing the object graph in ``tp_traverse``, and doing so introduces leaks. Unfortunately, example implementations of ``tp_traverse`` in both documentation and test suite fall into this trap and must be adapted.
This commit enables free-threaded extension builds on Python 3.13+, which involves the following changes: - nanobind must notify Python that an extension supports free-threading. - All internal data structures must be protected from concurrent modification. The approach taken varies with respect to the specific data structure, and a long comment in ``nb_internals.h`` explains the design decisions all of the changes. In general, the implementation avoids centralized locks as much as possible to improve scalability. - Adopting safe versions of certain operations where needed, e.g. ``PyList_GetItemRef()``. - Switching non-object allocation from ``PyObject_Allo()`` to ``PyMem_Alloc()``.
Global objects that undergo a high rate of reference count changes can become a bottleneck in free-threaded Python extensions, since the associated atomic operation require coordination between processor cores. Function and type objects are a particular concern. This commit immortalizes such objects, which exempts them from free-threading. The downside of this is that they will leak.
Adapting C++ to handle parallelism due to free-threaded Python can be tricky, especially when the original code is given as-is. This commit an tentative API to retrofit locking onto existing code by locking the arguments of function calls.
This commit documents free-threading in general and in the context of nanobind extensions.
Several parallel tests to check that locking works as expected
This commit changes the ``nb::dict`` iterator so that nanobind can implement the recommendation from https://docs.python.org/3.14/howto/free-threading-extensions.html#pydict-next The primary goal of ``nb::internal::dict_iterator`` was to be able to write ```cpp nb::dict my_dict = /* ... */; for (auto [k, v] : my_dict) { // .... } ``` This in fact the only associated feature that is explicitly mentioned in the documentation, and this continues to work. However, some undocumented features are lost: - The dictionary iterator is no longer copyable. This is because it must acquire an exclusive lock to the underlying dictionary. - The pre-increment operator ``++dict_it`` (which relied on copying) is gone. Post-increment continues to work, and that is enough for the loop structure mentioned above.
This commit refactors argument the locking locking so that it occurs at compile-time without imposing runtime overheads. The change applies to free-threaded extensions. Behavior differences compared to the prior approach: - it is no longer possible to do ``nb::arg().lock(false)`` or ``.lock(runtime_determined_value)`` - we no longer prohibit locking self in ``__init__``; changing this would also require restoring ``cast_flags::lock``, and it's not clear that the benefit outweighs the complexity.
bab17da
to
a1a055a
Compare
MLX tests clear on current master. Thanks for the heads up. |
My FFT extension is all good with the current master. Thanks! |
@@ -819,24 +819,29 @@ struct fast_iterator { | |||
|
|||
class dict_iterator { | |||
public: | |||
NB_NONCOPYABLE(dict_iterator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjakob can you please suggest what would be the best way to fix a thirdparty code using dict_iterator copy and now failing to compile due to this change?
Here is the code:
https://github.com/openxla/xla/blob/182111eb601b6491b37625df51f488905d217ae2/xla/python/weakref_lru_cache.cc#L58-L60
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks complicated (weak references, LRU caches, dictionary iterators, ...) Could I ask you to explain in a bit more what the use case is, and why dict iterators need to be copied and hashed?
Maybe that code can be rewritten without needing a dictionary iterator. Or perhaps it can call .items()
and then run a normal iterator over that.
The change made in the commit is difficult to roll back or avoid since it needed to make PyDict_Next
safe in free-threaded builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjakob thanks! We'll finally rewrite openxla code to take into account this change.
The change made in the commit is difficult to roll back or avoid since it needed to make PyDict_Next safe in free-threaded builds.
Yes, let's keep this change as it is for the free-threading.
Thanks! Worked like a charm. |
Sorry for the delay in responding. Our unit tests didn't notice the change so everything is good for us. |
@colesbury I've just released v2.2.0 on PyPI with the free-threading support included. Thank you for all of your help with this PR. |
This PR adapts nanobind to run on free-threaded Python builds. The changes involve:
A CMake parameter
nanobind_add_module(... FREE_THREADED ...)
to opt into free threaded behavior. It's a no-op in GIL-protected Python.Adapated internal data structures to protect them from concurrent modification.
Protected access to the high-bandwidth CPP->Python hash table (
internals.inst_c2p
) and and keep-alive lists data structure (internals.keep_alive
). This uses sharding based on pybind11 PR #5148 by @colesbury to reduce lock contention.internals.type_c2p_fast
turns into a fast thread-local cache around a slower globally lockedinternals.type_c2p_slow
. (These are hash tables that map from C++ to Python types, and they are accessed relatively often).Ordinary locking for the remaining fields that aren't on the critical path.
The design avoids centralized locks as much as possible to improve scalability (for example, pybind11 locks to the
internals
data structure several times per dispatched call, which is costly and seems somewhat counter to the design of the free-threaded Python implementation with its focus on local locks). A documentation string added tosrc/nb_internals.h
provides a detailed explanation of the locking semantics of every individual field.Threading awareness for list accessors and sequence type casters.
RAII wrappers for Python mutexes and critical sections.
Documentation explaining free-threading in general, and in the context of nanbind
Parallel tests that exercise various critical sections
The plan is to merge this into the upcoming nanobind release v2.2.0. Three related points: