-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing, core: fix deadlock in register_callsite
#2020
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
## Motivation Currently, the `callsite::register` function will unconditionally attempt to lock the global callsite registry. This is unfortunate, because it can result in a deadlock if a subscriber's `register_callsite` method calls into code that itself contains `tracing` instrumentation with callsites that haven't yet been registered. ## Solution This commit adds a new `callsite::try_register` function that *attempts* to register a callsite, returning `true` if it was successfully registered. If the current thread is registering a callsite (or, on `no_std`, if a callsite is being registered at all), it will return `false`, indicating that the registration must be retried. Signed-off-by: Eliza Weisman <[email protected]>
## Motivation Currently, there is a potential deadlock in `register_callsite` that can occur if registering a callsite causes _another_ callsite to be hit for the first time. If this occurs, the thread will try to lock the callsite registry while it is already holding the lock, resulting in a deadlock. This is unfortunate, since it can occur if a subscriber's `register_callsite` calls any function that contains a tracing event, if that event has not previously been registered. This has occurred in e.g. `tokio-console` due to using a tokio channel in the subscriber. ## Solution This commit fixes the deadlock by using the `callsite::try_register` function added to `tracing-core`, which uses a thread-local to bail if the *current* thread is already registering a callsite. In order to do this, we replace the `std::sync::Once` in `MacroCallsite` with a hand-written atomic once-like loop. This is because, while we only want to register the callsite once, we may have to *try* to register it more than one time before `try_register` returns true. Now, we set a flag only if the callsite has been successfully registered. Signed-off-by: Eliza Weisman <[email protected]>
jswrenn
added a commit
to jswrenn/tracing-allocations
that referenced
this pull request
Mar 24, 2022
This was referenced Apr 18, 2022
hawkw
added a commit
that referenced
this pull request
Apr 22, 2022
…1.x (#2083) ## Motivation Currently on `v0.1.x`, the global callsite registry is implemented as a `Mutex<Vec<&'static dyn Callsite>>`. This has a few downsides: * Every time a callsite is registered, the `Mutex` must be locked. This can cause a deadlock when a `register_callsite` implementation calls into code that _also_ registers a callsite. See #2020 for details. * Registering callsites is relatively slow and forces the program to synchronize on the callsite registry (and not on *individual* callsites). This means that if two threads are both registering totally different callsites, both threads must wait for the lock. Although this overhead is amortized over the entire rest of the program, it does have an impact in short-running applications where any given callsite may only be hit once or twice. * Memory allocation may occur while the registry is locked. This makes the use of `tracing` inside of memory allocators difficult or impossible. On the `master` branch (v0.2.x), PR #988 rewrote the callsite registry to use an intrusive atomic singly-linked list of `Callsite`s. This removed the need for locking the callsite registry, and made it possible to register callsites without ever allocating memory. Because the callsite registry on v0.2 will *never* allocate, this also makes it possible to compile `tracing-core` for `no_std` platforms without requiring `liballoc`. Unfortunately, however, we cannot use an identical implementation on v0.1.x, because using the intrusive linked-list registry for *all* callsites requires a breaking change to the `Callsite` trait (in order to add a way to get the callsite's linked-list node), which we cannot make on v0.1.x. ## Solution This branch adds a linked-list callsite registry for v0.1.x in a way that allows us to benefit from *some* of the advantages of this approach in a majority of cases. The trick is introducing a new `DefaultCallsite` type in `tracing-core` that implements the `Callsite` trait. This type can contain an intrusive list node, and *when a callsite is a `DefaultCallsite`*, we can register it without having to push it to the `Mutex<Vec<...>>`. The locked vec still _exists_, for `Callsite`s that are *not* `DefaultCallsite`s, so we can't remove the `liballoc` dependency, but in most cases, we can avoid the mutex and allocation. Naturally, `tracing` has been updated to use the `DefaultCallsite` type from `tracing-core`, so the `Vec` will only be used in the following cases: * User code has a custom implementation of the `Callsite` trait, which is [not terribly common][1]. * An older version of the `tracing` crate is in use. This fixes the deadlock described in #2020 when `DefaultCallsite`s are used. Additionally, it should reduce the performance impact and memory usage of the callsite registry. Furthermore, I've changed the subscriber registry so that a `RwLock<Vec<dyn Dispatch>>` is only used when there actually are multiple subscribers in use. When there's only a global default subscriber, we can once again avoid locking for the registry of subscribers. When the `std` feature is disabled, thread-local scoped dispatchers are unavailable, so the single global subscriber will always be used on `no_std`. We can make additional changes, such as the ones from #2020, to _also_ resolve potential deadlocks when non-default callsites are in use, but since this branch rewrites a lot of the callsite registry code, that should probably be done in a follow-up. [1]: https://cs.github.com/?scopeName=All+repos&scope=&q=%28%2Fimpl+.*Callsite%2F+AND+language%3Arust%29+NOT+%28path%3A%2Ftracing%2F**+OR+path%3A%2Ftracing-*%2F**+OR+path%3A%2Ftokio-trace*%2F**%29%29 Signed-off-by: Eliza Weisman <[email protected]>
Whoops, this PR was supposed to be closed by #2083, which kind of obsoleted it. We may still want to bring back a version of this change as well, for non-linked-list callsites. |
kaffarell
pushed a commit
to kaffarell/tracing
that referenced
this pull request
May 22, 2024
…1.x (tokio-rs#2083) ## Motivation Currently on `v0.1.x`, the global callsite registry is implemented as a `Mutex<Vec<&'static dyn Callsite>>`. This has a few downsides: * Every time a callsite is registered, the `Mutex` must be locked. This can cause a deadlock when a `register_callsite` implementation calls into code that _also_ registers a callsite. See tokio-rs#2020 for details. * Registering callsites is relatively slow and forces the program to synchronize on the callsite registry (and not on *individual* callsites). This means that if two threads are both registering totally different callsites, both threads must wait for the lock. Although this overhead is amortized over the entire rest of the program, it does have an impact in short-running applications where any given callsite may only be hit once or twice. * Memory allocation may occur while the registry is locked. This makes the use of `tracing` inside of memory allocators difficult or impossible. On the `master` branch (v0.2.x), PR tokio-rs#988 rewrote the callsite registry to use an intrusive atomic singly-linked list of `Callsite`s. This removed the need for locking the callsite registry, and made it possible to register callsites without ever allocating memory. Because the callsite registry on v0.2 will *never* allocate, this also makes it possible to compile `tracing-core` for `no_std` platforms without requiring `liballoc`. Unfortunately, however, we cannot use an identical implementation on v0.1.x, because using the intrusive linked-list registry for *all* callsites requires a breaking change to the `Callsite` trait (in order to add a way to get the callsite's linked-list node), which we cannot make on v0.1.x. ## Solution This branch adds a linked-list callsite registry for v0.1.x in a way that allows us to benefit from *some* of the advantages of this approach in a majority of cases. The trick is introducing a new `DefaultCallsite` type in `tracing-core` that implements the `Callsite` trait. This type can contain an intrusive list node, and *when a callsite is a `DefaultCallsite`*, we can register it without having to push it to the `Mutex<Vec<...>>`. The locked vec still _exists_, for `Callsite`s that are *not* `DefaultCallsite`s, so we can't remove the `liballoc` dependency, but in most cases, we can avoid the mutex and allocation. Naturally, `tracing` has been updated to use the `DefaultCallsite` type from `tracing-core`, so the `Vec` will only be used in the following cases: * User code has a custom implementation of the `Callsite` trait, which is [not terribly common][1]. * An older version of the `tracing` crate is in use. This fixes the deadlock described in tokio-rs#2020 when `DefaultCallsite`s are used. Additionally, it should reduce the performance impact and memory usage of the callsite registry. Furthermore, I've changed the subscriber registry so that a `RwLock<Vec<dyn Dispatch>>` is only used when there actually are multiple subscribers in use. When there's only a global default subscriber, we can once again avoid locking for the registry of subscribers. When the `std` feature is disabled, thread-local scoped dispatchers are unavailable, so the single global subscriber will always be used on `no_std`. We can make additional changes, such as the ones from tokio-rs#2020, to _also_ resolve potential deadlocks when non-default callsites are in use, but since this branch rewrites a lot of the callsite registry code, that should probably be done in a follow-up. [1]: https://cs.github.com/?scopeName=All+repos&scope=&q=%28%2Fimpl+.*Callsite%2F+AND+language%3Arust%29+NOT+%28path%3A%2Ftracing%2F**+OR+path%3A%2Ftracing-*%2F**+OR+path%3A%2Ftokio-trace*%2F**%29%29 Signed-off-by: Eliza Weisman <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Currently, there is a potential deadlock in
register_callsite
that canoccur if registering a callsite causes another callsite to be hit for
the first time. If this occurs, the thread will try to lock the callsite
registry while it is already holding the lock, resulting in a deadlock.
This is unfortunate, since it can occur if a subscriber's
register_callsite
calls any function that contains a tracing event, ifthat event has not previously been registered. This has occurred in e.g.
tokio-console
due to using a tokio channel in the subscriber.Solution
This commit fixes the deadlock by using the
callsite::try_register
function added to
tracing-core
, which uses a thread-local to bail ifthe current thread is already registering a callsite. In order to do
this, we replace the
std::sync::Once
inMacroCallsite
with ahand-written atomic once-like loop. This is because, while we only want
to register the callsite once, we may have to try to register it more
than one time before
try_register
returns true. Now, we set a flagonly if the callsite has been successfully registered.
Signed-off-by: Eliza Weisman [email protected]