-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: support free-threaded CPython with GIL disabled #5148
Conversation
Some additional locking is added in the free-threaded build when `Py_GIL_DISABLED` is defined: - Most accesses to internals are protected by a single mutex - The registered_instances uses a striped lock to improve concurrency Pybind11 modules can indicate support for running with the GIL disabled by calling `set_gil_not_used()`.
a165d28
to
e5470eb
Compare
Signed-off-by: Henry Schreiner <[email protected]>
include/pybind11/detail/internals.h
Outdated
@@ -542,6 +577,14 @@ PYBIND11_NOINLINE internals &get_internals() { | |||
internals_ptr->static_property_type = make_static_property_type(); | |||
internals_ptr->default_metaclass = make_default_metaclass(); | |||
internals_ptr->instance_base = make_object_base_type(internals_ptr->default_metaclass); | |||
#ifdef Py_GIL_DISABLED | |||
size_t num_shards = (size_t) next_pow2(2 * std::thread::hardware_concurrency()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
auto num_shards = static_cast<size_t>(...);
?
(The main motivation is to minimize C-style casts. Similarly in a few other places changed in this PR.)
Could it be worth adding a comment to explain why 2 * ...
?
auto &internals = get_internals(); | ||
|
||
#ifdef Py_GIL_DISABLED | ||
// Hash address to compute shard, but ignore low bits. We'd like allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to move this comment to the splitmix64
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is specific to this section of code: it's mostly about why we are ignoring the low bits (addr >> 20
) when hashing the address.
include/pybind11/detail/internals.h
Outdated
// other threads/cores to map to other shards. Using the high bits is a good | ||
// heuristic because memory allocators often have a per-thread | ||
// arena/superblock/segment from which smaller allocations are served. | ||
auto addr = reinterpret_cast<uintptr_t>(ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_cast
seems to work here (with Linux gcc at least). (I see we're already using reinterpret_cast
in a bunch of similar situations, but maybe that isn't ideal?)
Would using std::uintptr_t
and std::uint64_t
be slightly better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_cast
here causes a compiler error for me: https://gcc.godbolt.org/z/bTnosGf61
I've updated the other casts and used the std
namespace.
include/pybind11/pybind11.h
Outdated
@@ -1199,6 +1206,8 @@ struct handle_type_name<cpp_function> { | |||
|
|||
PYBIND11_NAMESPACE_END(detail) | |||
|
|||
struct gil_not_used {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about the alternative API here:
rwgk@a1c0f8e
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how much simpler it is. I'm not sure about the mod
; if we put anything, it should have the full module
, we don't use mod
elsewhere for Python, and it isn't clearly module. But it can't be misplaced, and we don't "namespace" anything else like it.
I'm not sure it would actually be any simpler if we added a second option later (like some sort of multiphase init option). But maybe it's better to simplify now rather than worry too much about a future (which would not be more complex by this style either, at least).
py::free_threaded()
seems like it would match the naming of CPython better, though most of the CPython internals related to the feature seem to refer to the GIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how much simpler it is.
It's simpler only here, but my suggested alternative is simpler overall, because it avoids the overloads and the private
implementation of the create_extension_module
function. For the user it doesn't make a difference.
My suggestion includes the mod_
so that it comes out very similar to the existing name: Py_MOD_GIL_NOT_USED
vs. py::mod_gil_not_used().
Another thought was that it's very easy to experiment, e.g. if
PYBIND11_MODULE(external_module, m, py::mod_gil_not_used())
has some issues that are difficult to resolve in the short term, it can be changed to
PYBIND11_MODULE(external_module, m, py::mod_gil_not_used(false))
and it'll be very obvious that free-threading support was attempted, and how to start trying again. It'll also be very easy to make the free-threading support platform-specific this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but my suggested alternative is simpler overall, because it avoids the overloads and the private implementation
That's what I meant by simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also called PyUnstable_Module_SetGIL
, which does have the module
in it. But I hadn't noticed how that mimics the slot value name. Personally still like the more readable, less cryptic names, but happy with whatever we go with.
And py::free_threaded(false)
avoids the double negative. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I meant by simpler.
Oh, got it. Thanks. (The way it appeared in the email message got me on the wrong track.)
And
py::free_threaded(false)
avoids the double negative.
That works for me, too. — Usually I strive to not introduce new names if there is one already, but the double negative bothered me, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to have @colesbury have final say on what name he prefers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards matching the name from the Python C API because I think using the same names helps developers when they switch between the C API and pybind11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks great to me, except the question about adopting the simplification under rwgk@a1c0f8e. @colesbury What's your opinion about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @rwgk. I've added some comments and renamed a few functions.
auto &internals = get_internals(); | ||
|
||
#ifdef Py_GIL_DISABLED | ||
// Hash address to compute shard, but ignore low bits. We'd like allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is specific to this section of code: it's mostly about why we are ignoring the low bits (addr >> 20
) when hashing the address.
include/pybind11/detail/internals.h
Outdated
// other threads/cores to map to other shards. Using the high bits is a good | ||
// heuristic because memory allocators often have a per-thread | ||
// arena/superblock/segment from which smaller allocations are served. | ||
auto addr = reinterpret_cast<uintptr_t>(ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static_cast
here causes a compiler error for me: https://gcc.godbolt.org/z/bTnosGf61
I've updated the other casts and used the std
namespace.
include/pybind11/pybind11.h
Outdated
@@ -1199,6 +1206,8 @@ struct handle_type_name<cpp_function> { | |||
|
|||
PYBIND11_NAMESPACE_END(detail) | |||
|
|||
struct gil_not_used {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards matching the name from the Python C API because I think using the same names helps developers when they switch between the C API and pybind11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colesbury If you could let us know if you like @rwgk's variation (I don't mind applying it if you do), otherwise it looks good to me.
Changelog in the description is a little out of date, too. |
Yes, please apply it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @colesbury for transporting pybind11 into the free-threading world :-)
Description
Accesses to the
internals
struct now goes through awith_internals([] { ... });
statement. In the free-threaded build, the internals mutex is locked for the duration of the access.The
registered_instances
map has its own locking scheme because it may be frequently accesses by multiple thread. The map is shareded by address and a lock per shard is used to increase concurrency.Other smaller changes:
gil_not_used()
tag to indicate that a module supports running with the GIL disabled. SeePyUnstable_Module_SetGIL
.PyMem_MALLOC
fortp_doc
in 3.13+dict_getitemstringref
to wrap newPyDict_GetItemStringRef
APISuggested changelog entry:
* Support free-threaded CPython (3.13t) * Add ``py::mod_gil_not_used()`` tag to indicate if a module supports running with the GIL disabled.