tracing, core: fix deadlock in `register_callsite` #2020

hawkw · 2022-03-24T20:08:58Z

Motivation

Currently, there is a potential deadlock in register_callsite that can
occur if registering a callsite causes another callsite to be hit for
the first time. If this occurs, the thread will try to lock the callsite
registry while it is already holding the lock, resulting in a deadlock.

This is unfortunate, since it can occur if a subscriber's
register_callsite calls any function that contains a tracing event, if
that event has not previously been registered. This has occurred in e.g.
tokio-console due to using a tokio channel in the subscriber.

Solution

This commit fixes the deadlock by using the callsite::try_register
function added to tracing-core, which uses a thread-local to bail if
the current thread is already registering a callsite. In order to do
this, we replace the std::sync::Once in MacroCallsite with a
hand-written atomic once-like loop. This is because, while we only want
to register the callsite once, we may have to try to register it more
than one time before try_register returns true. Now, we set a flag
only if the callsite has been successfully registered.

Signed-off-by: Eliza Weisman [email protected]

Signed-off-by: Eliza Weisman <[email protected]>

## Motivation Currently, the `callsite::register` function will unconditionally attempt to lock the global callsite registry. This is unfortunate, because it can result in a deadlock if a subscriber's `register_callsite` method calls into code that itself contains `tracing` instrumentation with callsites that haven't yet been registered. ## Solution This commit adds a new `callsite::try_register` function that *attempts* to register a callsite, returning `true` if it was successfully registered. If the current thread is registering a callsite (or, on `no_std`, if a callsite is being registered at all), it will return `false`, indicating that the registration must be retried. Signed-off-by: Eliza Weisman <[email protected]>

## Motivation Currently, there is a potential deadlock in `register_callsite` that can occur if registering a callsite causes _another_ callsite to be hit for the first time. If this occurs, the thread will try to lock the callsite registry while it is already holding the lock, resulting in a deadlock. This is unfortunate, since it can occur if a subscriber's `register_callsite` calls any function that contains a tracing event, if that event has not previously been registered. This has occurred in e.g. `tokio-console` due to using a tokio channel in the subscriber. ## Solution This commit fixes the deadlock by using the `callsite::try_register` function added to `tracing-core`, which uses a thread-local to bail if the *current* thread is already registering a callsite. In order to do this, we replace the `std::sync::Once` in `MacroCallsite` with a hand-written atomic once-like loop. This is because, while we only want to register the callsite once, we may have to *try* to register it more than one time before `try_register` returns true. Now, we set a flag only if the callsite has been successfully registered. Signed-off-by: Eliza Weisman <[email protected]>

tokio-rs/tracing#2020

…1.x (#2083) ## Motivation Currently on `v0.1.x`, the global callsite registry is implemented as a `Mutex<Vec<&'static dyn Callsite>>`. This has a few downsides: * Every time a callsite is registered, the `Mutex` must be locked. This can cause a deadlock when a `register_callsite` implementation calls into code that _also_ registers a callsite. See #2020 for details. * Registering callsites is relatively slow and forces the program to synchronize on the callsite registry (and not on *individual* callsites). This means that if two threads are both registering totally different callsites, both threads must wait for the lock. Although this overhead is amortized over the entire rest of the program, it does have an impact in short-running applications where any given callsite may only be hit once or twice. * Memory allocation may occur while the registry is locked. This makes the use of `tracing` inside of memory allocators difficult or impossible. On the `master` branch (v0.2.x), PR #988 rewrote the callsite registry to use an intrusive atomic singly-linked list of `Callsite`s. This removed the need for locking the callsite registry, and made it possible to register callsites without ever allocating memory. Because the callsite registry on v0.2 will *never* allocate, this also makes it possible to compile `tracing-core` for `no_std` platforms without requiring `liballoc`. Unfortunately, however, we cannot use an identical implementation on v0.1.x, because using the intrusive linked-list registry for *all* callsites requires a breaking change to the `Callsite` trait (in order to add a way to get the callsite's linked-list node), which we cannot make on v0.1.x. ## Solution This branch adds a linked-list callsite registry for v0.1.x in a way that allows us to benefit from *some* of the advantages of this approach in a majority of cases. The trick is introducing a new `DefaultCallsite` type in `tracing-core` that implements the `Callsite` trait. This type can contain an intrusive list node, and *when a callsite is a `DefaultCallsite`*, we can register it without having to push it to the `Mutex<Vec<...>>`. The locked vec still _exists_, for `Callsite`s that are *not* `DefaultCallsite`s, so we can't remove the `liballoc` dependency, but in most cases, we can avoid the mutex and allocation. Naturally, `tracing` has been updated to use the `DefaultCallsite` type from `tracing-core`, so the `Vec` will only be used in the following cases: * User code has a custom implementation of the `Callsite` trait, which is [not terribly common][1]. * An older version of the `tracing` crate is in use. This fixes the deadlock described in #2020 when `DefaultCallsite`s are used. Additionally, it should reduce the performance impact and memory usage of the callsite registry. Furthermore, I've changed the subscriber registry so that a `RwLock<Vec<dyn Dispatch>>` is only used when there actually are multiple subscribers in use. When there's only a global default subscriber, we can once again avoid locking for the registry of subscribers. When the `std` feature is disabled, thread-local scoped dispatchers are unavailable, so the single global subscriber will always be used on `no_std`. We can make additional changes, such as the ones from #2020, to _also_ resolve potential deadlocks when non-default callsites are in use, but since this branch rewrites a lot of the callsite registry code, that should probably be done in a follow-up. [1]: https://cs.github.com/?scopeName=All+repos&scope=&q=%28%2Fimpl+.*Callsite%2F+AND+language%3Arust%29+NOT+%28path%3A%2Ftracing%2F**+OR+path%3A%2Ftracing-*%2F**+OR+path%3A%2Ftokio-trace*%2F**%29%29 Signed-off-by: Eliza Weisman <[email protected]>

hawkw · 2022-05-02T18:04:55Z

Whoops, this PR was supposed to be closed by #2083, which kind of obsoleted it. We may still want to bring back a version of this change as well, for non-linked-list callsites.

…1.x (tokio-rs#2083) ## Motivation Currently on `v0.1.x`, the global callsite registry is implemented as a `Mutex<Vec<&'static dyn Callsite>>`. This has a few downsides: * Every time a callsite is registered, the `Mutex` must be locked. This can cause a deadlock when a `register_callsite` implementation calls into code that _also_ registers a callsite. See tokio-rs#2020 for details. * Registering callsites is relatively slow and forces the program to synchronize on the callsite registry (and not on *individual* callsites). This means that if two threads are both registering totally different callsites, both threads must wait for the lock. Although this overhead is amortized over the entire rest of the program, it does have an impact in short-running applications where any given callsite may only be hit once or twice. * Memory allocation may occur while the registry is locked. This makes the use of `tracing` inside of memory allocators difficult or impossible. On the `master` branch (v0.2.x), PR tokio-rs#988 rewrote the callsite registry to use an intrusive atomic singly-linked list of `Callsite`s. This removed the need for locking the callsite registry, and made it possible to register callsites without ever allocating memory. Because the callsite registry on v0.2 will *never* allocate, this also makes it possible to compile `tracing-core` for `no_std` platforms without requiring `liballoc`. Unfortunately, however, we cannot use an identical implementation on v0.1.x, because using the intrusive linked-list registry for *all* callsites requires a breaking change to the `Callsite` trait (in order to add a way to get the callsite's linked-list node), which we cannot make on v0.1.x. ## Solution This branch adds a linked-list callsite registry for v0.1.x in a way that allows us to benefit from *some* of the advantages of this approach in a majority of cases. The trick is introducing a new `DefaultCallsite` type in `tracing-core` that implements the `Callsite` trait. This type can contain an intrusive list node, and *when a callsite is a `DefaultCallsite`*, we can register it without having to push it to the `Mutex<Vec<...>>`. The locked vec still _exists_, for `Callsite`s that are *not* `DefaultCallsite`s, so we can't remove the `liballoc` dependency, but in most cases, we can avoid the mutex and allocation. Naturally, `tracing` has been updated to use the `DefaultCallsite` type from `tracing-core`, so the `Vec` will only be used in the following cases: * User code has a custom implementation of the `Callsite` trait, which is [not terribly common][1]. * An older version of the `tracing` crate is in use. This fixes the deadlock described in tokio-rs#2020 when `DefaultCallsite`s are used. Additionally, it should reduce the performance impact and memory usage of the callsite registry. Furthermore, I've changed the subscriber registry so that a `RwLock<Vec<dyn Dispatch>>` is only used when there actually are multiple subscribers in use. When there's only a global default subscriber, we can once again avoid locking for the registry of subscribers. When the `std` feature is disabled, thread-local scoped dispatchers are unavailable, so the single global subscriber will always be used on `no_std`. We can make additional changes, such as the ones from tokio-rs#2020, to _also_ resolve potential deadlocks when non-default callsites are in use, but since this branch rewrites a lot of the callsite registry code, that should probably be done in a follow-up. [1]: https://cs.github.com/?scopeName=All+repos&scope=&q=%28%2Fimpl+.*Callsite%2F+AND+language%3Arust%29+NOT+%28path%3A%2Ftracing%2F**+OR+path%3A%2Ftracing-*%2F**+OR+path%3A%2Ftokio-trace*%2F**%29%29 Signed-off-by: Eliza Weisman <[email protected]>

hawkw added 4 commits March 24, 2022 11:49

tracing: test reproducing register_callsite deadlock

890acc0

Signed-off-by: Eliza Weisman <[email protected]>

core: remove mutex poisoning

37c664a

Signed-off-by: Eliza Weisman <[email protected]>

hawkw requested a review from a team as a code owner March 24, 2022 20:08

jswrenn added a commit to jswrenn/tracing-allocations that referenced this pull request Mar 24, 2022

try 'eliza/fix-register-deadlock'

8ff1dfe

tokio-rs/tracing#2020

hawkw added 2 commits March 25, 2022 09:09

Merge branch 'v0.1.x' into eliza/fix-register-deadlock

5be3591

Merge branch 'v0.1.x' into eliza/fix-register-deadlock

c6ffc6e

This was referenced Apr 18, 2022

add feature to show mpsc channel metrics tokio-rs/console#278

Open

core: add a limited form of the linked-list callsite registry for v0.1.x #2083

Merged

hawkw closed this May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing, core: fix deadlock in `register_callsite` #2020

tracing, core: fix deadlock in `register_callsite` #2020

hawkw commented Mar 24, 2022

hawkw commented May 2, 2022

tracing, core: fix deadlock in register_callsite #2020

tracing, core: fix deadlock in register_callsite #2020

Conversation

hawkw commented Mar 24, 2022

Motivation

Solution

hawkw commented May 2, 2022

tracing, core: fix deadlock in `register_callsite` #2020

tracing, core: fix deadlock in `register_callsite` #2020