sync: fix `notify_waiters` notifying sequential awaits #5404

satakuma · 2023-01-28T00:08:55Z

Motivation

Closes: #5396

tokio::sync::Notify has an internal waiters queue locked behind a mutex. Notify::notify_waiters is a function to notify all waiters from this queue. It first acquires the lock, then removes waiters from the queue and wakes them up. However, the whole process is done in batches of 32 waiters to avoid deadlocks. The function has to release the lock before waking up a batch and re-acquire it before proceeding to the next one. In this short timespan another thread can acquire the lock and modify the queue and internal state. For example, it is possible to insert a new waiter to the queue, which will be included in one of the later batches. Current implementation of notify_waiters does not account for that, what leads to the unexpected behavior described in #5396.

Solution

Notify tracks the number of calls to notify_waiters internally and each time a new Notified future is created, the current value of this counter is stored inside the future. The solution is to utilize this counter and compare its value with the number of the current call to notify_waiters inside this function. This way, futures created during the notify_waiters execution can be filtered out and kept in the queue.

satakuma · 2023-01-28T00:10:46Z

tokio/src/sync/notify.rs

+                    // Safety: the waiter is still not inserted
+                    let initial_notify_waiters_calls =
+                        unsafe { (*waiter.get()).notify_waiters_calls };
+
+                    // Optimistically check if notify_waiters has been called
+                    // after the future was created.
+                    if get_num_notify_waiters_calls(curr) != initial_notify_waiters_calls {
+                        *state = Done;
+                        return Poll::Ready(());
+                    }


This additional check is basically free and should help to avoid waiting for the lock.

tokio/tests/sync_notify.rs

satakuma · 2023-01-28T00:14:58Z

tokio/src/sync/notify.rs

@@ -222,8 +222,11 @@ struct Waiter {
    /// Waiting task's waker.
    waker: Option<Waker>,

-    /// `true` if the notification has been assigned to this waiter.
-    notified: Option<NotificationType>,


Renamed this field because it was confusing to me because of the similarly named future.

Darksonn · 2023-01-28T07:57:03Z

It seems like there is a compilation failure with some choice of feature flags:

 info: running `cargo build -Z avoid-dev-deps --no-default-features --features signal,time` on tokio (104/107)
   Compiling tokio v1.24.2 (/home/runner/work/tokio/tokio/tokio)
error[E0599]: no method named `drain_filter` found for struct `MutexGuard<'_, util::linked_list::LinkedList<Waiter, Waiter>>` in the current scope
   --> tokio/src/sync/notify.rs:526:36
    |
526 |             let mut iter = waiters.drain_filter(|w| w.notify_waiters_calls <= call_num_bound);
    |                                    ^^^^^^^^^^^^ method not found in `MutexGuard<'_, util::linked_list::LinkedList<Waiter, Waiter>>`

satakuma · 2023-01-28T13:41:14Z

Yeah, thanks, I fixed it.

There is something complicated going on with atomic ordering causing a regression. I will look into this more later.

satakuma · 2023-01-28T16:23:32Z

Turned out to be a logic bug unrelated to atomics. This is ready for a review @Darksonn. Sorry for so many scattered changes, but I found some comments inaccurate and decided to change them as well.

Darksonn · 2023-01-31T12:24:30Z

Let's say we have two pending notified() futures fut1 and fut2 initially, and then we call notify_waiters in parallel with the following:

let res1 = fut1.poll();
let res2 = fut2.poll();
assert!(res1.is_pending() || res2.is_ready()); // if res1 is ready, then res2 must also be ready

Can the assert fail under this change?

satakuma · 2023-01-31T15:07:17Z

If we have only two futures, then this assert cannot fail.

To see why, note that notify_waiters increments the call number and sets notifications while holding the lock. All code paths for pending futures have to acquire the lock and either check the call number (State::Init) or check if a notification is set (State::Waiting). Therefore the call to notify_waiters and both polls can be ordered on a timeline. If fut1 observes the effects of notify_waiters, then fut2 also has to because its poll happens after the fut1.poll. I will add a loom test to confirm it.

However, this reasoning falls apart if we consider more pending futures. If we have an array of 33 not yet polled notify.notified() futures and we enable them in a specific order (enabling pushes a future to the waiters queue) then your assert can fail:

futs[1].enable();
for i in 2..33 {
    futs[i].enable();
}
futs[0].enable();
// Waiters queue: head -> [futs[0], ..., futs[2] | futs[1]]
//                          0 ^          31 ^      32 ^

tokio::spawn(async move {
    notify.notify_waiters();
});

let res1 = futs[0].poll();
let res2 = futs[1].poll();

// this can fail
assert!(res1.is_pending() || res2.is_ready()); // if res1 is ready, then res2 must also be ready

It can fail because futs[0] will be in an earlier batch than futs[1]. While the lock is released between the batches, we can poll both futures and fail the assertion. Note that for this to happen we have to enable them in the reverse order of the order in which we poll them, which in my opinion is quite unusual.

This change uses drain_filter which traverses the queue starting from the newest entries. This differs from the current behavior of notify_waiters, which pops futures from the queue in the order they were added. It means that the above example can also fail under current implementation, but we would have to enable futures in the same order we poll them, which in my opinion is more likely.

futs[0].enable();
for i in 2..33 {
    futs[i].enable();
}
futs[1].enable();
// ...

Darksonn · 2023-01-31T16:00:59Z

Oh, I was thinking of the case where there are also other waiters.

satakuma · 2023-01-31T17:00:22Z

Then it is possible to fail this assertion under this change and on master branch. Please let me know if this should be fixed together in this PR, I think it will require more work then.

Darksonn · 2023-01-31T17:09:49Z

It seems like an important property to me. I don't care whether it is fixed in this PR, or in a separate one.

satakuma · 2023-02-05T00:45:33Z

It should be fixed now. Please take a look @Darksonn.

Darksonn · 2023-02-05T18:24:21Z

Did you double-check that notify_waiters_poll_consistency_wake_up_batches fails on current master?

satakuma · 2023-02-05T18:35:58Z

It does not fail because the code on current master traverses the queue in reverse order, but it fails under 572db8d where this test was added without the actual fix.

satakuma · 2023-02-05T18:45:45Z

I've added a test variant which fails under the current master, double-checked that.

satakuma · 2023-02-05T19:16:19Z

As a side note: the whole reason behind the issue this PR is trying to fix is that we have to wake up waiters in batches to avoid deadlocks. As far as I can tell from previous discussion, a deadlock can happen if a waker has a drop impl which tries to acquire the mutex again, for example by calling notify_waiters. I've noticed that many poll functions (including Notified::poll) can replace a waker with a new one, effectively dropping the old one, all while holding the lock, what can lead to a similar deadlock. It's hard for me to imagine a proper use case where this would happen, so I decided not to touch it, but it may be a thing to consider. A simple fix is obviously to drop wakers after dropping the lock.

Darksonn · 2023-02-05T19:21:13Z

Ok, that sounds good. I don't have time to do a full review right now, but I'm happy with the tests and the claimed behavior.

Please file bugs for wakers that are dropped/woken while a mutex is held.

Darksonn · 2023-02-09T10:51:37Z

Hmm, something complicated is going on. For example, it seems like one can still observe it not being atomic in the following manner:

Have one thread call notify_waiters.
On another thread, notice that notify_waiters has triggered by having a Notified future complete.
Create a new Notified future and poll it.
Call notify_one to notify the new Notified future.

Then, the notify_one call might just consume one of the old Notified futures that the notify_waiters call still hasn't notified, rather than the new one. This means that the notify_one is lost.

I'm not sure what to do about this. I will need to think about it.

satakuma · 2023-02-09T13:46:51Z

That's a good call. I think one way to fix it would be to make every call to notify_waiters take ownership of the waiters list by replacing it with a new, empty one. Looks doable to me, but it will be a bit tricky to implement, because a waiter will have to be agnostic about the list it belongs to and still be able to somehow remove itself from it. I think this might be a better way to go than adding more ifs to the notify_one code accounting for this weird "between batches" state.

Darksonn · 2023-02-09T13:56:53Z

Yeah, adding even more special cases to handle the thing I mentioned seems too error prone. Moving all of the to-be-notified waiters to another list seems reasonable. We might be able to do it by giving each waiter a raw pointer to the list it is stored in? (It has to be able to remove itself from the list in case the waiter is destroyed during the notify_waiters call.)

satakuma · 2023-02-09T14:54:28Z

Hmm, but to keep those raw pointers valid, we would have to either box the list or set each pointer at notify_waiters call by traversing the whole list even before starting to wake-up waiters.

I think it can be avoided by creating a pinned version of a linked list in every notify_waiters call, let's say Pin<PinnedLinkedList>. Such list would be created from a normal LinkedList by adding a special head entry which would allow removing any entry simply by swapping adjacent pointers, without knowing the head's address (similarly to how lists work in the Linux kernel). Note that we have to pin the head entry to make it work.

A waiter can then determine whether the list it belongs to is a pinned one by comparing its notify_waiters call number with the current number. If the numbers are different, then the list is pinned, and a waiter can use something like PinnedLinkedList::remove(entry) to remove itself from the list, instead of using waiters.remove, which requires a reference to the list.

Darksonn · 2023-02-09T14:58:42Z

I imagined that we would traverse the entire list, yes.

Some sort of solution where the waiters can remove themselves without knowing which list it is in, is an interesting idea. I don't think we have support for that in our current linked list implementation, though.

satakuma · 2023-02-11T01:39:17Z

I went ahead with the idea of giving each waiter a raw pointer. The idea of a pinned linked list is definitely a possibility for future improvement, but the amount of unsafe code it would require is a bit too much for me right now.

I think there is still some work to do to make the aliasing correct. I will get back to it later and let you know when it's ready for another review.

Darksonn · 2023-02-11T12:20:05Z

tokio/src/sync/notify.rs

@@ -222,8 +228,9 @@ struct Waiter {
    /// Waiting task's waker.
    waker: Option<Waker>,

-    /// `true` if the notification has been assigned to this waiter.
-    notified: Option<NotificationType>,
+    notify_waiters_queue: Option<NonNull<WaitList>>,


As far as I can tell, the invariant of this field is that it is None when it is in the Notify's own list, and Some when it is in some list. Please document that.

True, I've added a note about this.

Darksonn · 2023-02-11T12:26:46Z

tokio/src/sync/notify.rs

+                // See if the node was notified but not received. In this case, if
+                // the notification was triggered via `notify_one`, it must be sent
+                // to the next waiter.
+                //
+                // Safety: with the entry removed from the linked list, there can be
+                // no concurrent access to the entry
+                if matches!(
+                    unsafe { (*waiter.get()).notification },
+                    Some(NotificationType::OneWaiter)
+                ) {
+                    if let Some(waker) = notify_locked(&mut waiters, &notify.state, notify_state) {
+                        drop(waiters);
+                        waker.wake();
+                    }
                }


This can also cause weird things ... I know this was already here, and maybe the answer is to just not change it.

Imagine this sequence of actions:

Create two Notified futures and poll them.

Call notify_one twice.

Create two new Notified futures.

Drop the original two Notified futures.

Both of the futures from step 3 complete.

This is weird since the futures that completed were created after the notify_one calls, and the fact that there are two means that we can't just pretend that this is due to the permit being stored.

I thought about this for a while and I think that's a tough problem. The issue is that we would like to think about cancelling a Notified future as simply removing it from the queue, like it's written in the docs. In reality, a future is removed much earlier and at drop we only simulate what would happen, if the future was removed before receiving the permit. In this code, it is done by passing the permit to another waiting future.

However, we can pass the permit from a future A to a waiting future B if and only if at any point between receiving the permit by A and pushing B to the queue the state of Notify didn't reach NOTIFIED. We could check for that by enumerating all events of calling notify_one and enabling Notified futures and keeping the last number when the state was NOTIFIED. In Drop it would suffice to check if the future being dropped isn't older than this number.

Unfortunately, the problem is deeper than that, and your example demonstrates this. We can drop multiple futures, and simulating that one of them was removed from the queue before receiving the permit will affect how we simulate the same for others.

I like to think about it the following way: let $e_t$ represent the $t$-th event. $e_t = 1$ if the $t$-th event is enabling a Notified future and $e_t = -1$ if it is a notify_one call. Then we can define the number of waiters after the $t$-th event $w_t = \max(-1, w_{t-1} + e_t)$. There is a special case $w_t = -1$ if there are no waiters and the state is NOTIFIED. To check whether we can transfer the permit from a dropped future, we should track the last number when the state was NOTIFIED which is the greatest $T$ which fulfills $w_T = -1$.

The issue is that dropping a future enabled in the $t$-th event means we change $e_t$ from $1$ to $0$, what affects all newer values $w_t, w_{t+1}, \ldots$. In particular, it can change one of them to $-1$. Such a change can increase $T$ and we have to somehow account for that inside the Drop impl.

To sum up, we would need a complicated data structure to effectively track whether we should transfer the permit from a dropped future, for example a deque or some sort of segment tree, which probably is not feasible here. I think the best we can do is the heuristic I've mentioned earlier, but it doesn't even fix your case, so in my opinion we should probably leave this code as it is.

I believe that we ran into a similar problem during a discussion about adding a condition variable to Tokio. To me, it seems that the best option is to just document that dropping a future that has consumed a permit from a notify_one call will pass on its permit as-if notify_one was called in the destructor.

Sure, I will be happy to open a PR improving the docs.

Darksonn · 2023-02-11T12:34:31Z

tokio/src/sync/notify.rs

+        let waiters_to_notify = UnsafeCell::new(std::mem::take(&mut *waiters));
+        pin!(waiters_to_notify);


The LinkedList type is Unpin, so pinning it is a no-op.

Yeah, it felt awkward to use Pin here...
My intention was to only shadow the original variable binding to prevent from accidentally moving it. I am not sure what is the best way to express that this variable won't change its address. Anyway, I replaced the pin macro with manual shadowing.

Darksonn · 2023-02-11T12:35:33Z

tokio/src/sync/notify.rs

+        for mut waiter in unsafe { &*waiters_to_notify.get() } {
+            // Safety: we hold the `waiters` lock.
+            let waiter = unsafe { waiter.as_mut() };
+
+            // Safety: address of `waiters_to_notify` is not null.
+            waiter.notify_waiters_queue =
+                unsafe { Some(NonNull::new_unchecked(waiters_to_notify.get())) };
+        }


I would prefer to create a raw pointer once and reuse it. This has clearer semantics aliasing-wise.

Refactored this part.

satakuma · 2023-02-11T14:00:36Z

I've noticed a slight degradation in performance when all waiters can fit into a one batch. In such case there is no need to traverse the list and set all the pointers, as they will be immediately removed. For that reason, now notify_waiters will only decouple the list if there is a need to do so, i.e. there are two or more batches. I have also addressed your recent review comments, please take another look @Darksonn.

Just for the record, as there are many new loom tests, almost all of them complete instantaneously. Only the notify_waiters_sequential test takes around 1-2 minutes on my machine.

satakuma · 2023-02-15T17:52:13Z

Closing in favor of #5458.

sync: fix notify_waiters notifying sequential awaits

2056c5b

github-actions bot added the R-loom Run loom tests on this PR label Jan 28, 2023

satakuma commented Jan 28, 2023

View reviewed changes

tokio/tests/sync_notify.rs Outdated Show resolved Hide resolved

satakuma commented Jan 28, 2023

View reviewed changes

satakuma added 5 commits January 28, 2023 11:45

fix features CI and wasi CI

2786d9e

replace brittle test with a loom test

c987491

Merge branch 'master' into notify-fix

170894e

format feature list

407986c

fix a bug in counter increment 💀

c51ee21

fix a bug storing an old counter value

0c3d6f1

Darksonn added A-tokio Area: The main tokio crate M-sync Module: tokio/sync labels Jan 31, 2023

satakuma added 3 commits January 31, 2023 16:15

add loom test to check poll consistency

98cfdb3

fix a typo

4735283

Merge branch 'master' into notify-fix

9b56e74

satakuma added 3 commits February 4, 2023 20:29

add loom test for polling between batches

572db8d

add polling consistency between batches

a430f98

remove magic constant

b1b23f9

add test variant with different poll order

252f2e8

satakuma added 4 commits February 10, 2023 21:30

Merge branch 'master' into notify-fix

5a36e1b

test atomicity when used with notify_one

5da293e

take atomically entire list in notify_waiters

f75f8b1

fix clippy errors with potential ub

712566b

decouple the list only if with multiple batches

dbac1c7

Darksonn reviewed Feb 11, 2023

View reviewed changes

satakuma added 2 commits February 11, 2023 13:43

add test for linked list iter

f5c7816

apply review suggestions

063c426

satakuma mentioned this pull request Feb 15, 2023

sync: make notify_waiters calls atomic #5458

Merged

satakuma closed this Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: fix `notify_waiters` notifying sequential awaits #5404

sync: fix `notify_waiters` notifying sequential awaits #5404

satakuma commented Jan 28, 2023

satakuma Jan 28, 2023

satakuma Jan 28, 2023

Darksonn commented Jan 28, 2023

satakuma commented Jan 28, 2023

satakuma commented Jan 28, 2023

Darksonn commented Jan 31, 2023

satakuma commented Jan 31, 2023

Darksonn commented Jan 31, 2023

satakuma commented Jan 31, 2023

Darksonn commented Jan 31, 2023

satakuma commented Feb 5, 2023

Darksonn commented Feb 5, 2023

satakuma commented Feb 5, 2023

satakuma commented Feb 5, 2023

satakuma commented Feb 5, 2023

Darksonn commented Feb 5, 2023

Darksonn commented Feb 9, 2023 •

edited

Loading

satakuma commented Feb 9, 2023

Darksonn commented Feb 9, 2023

satakuma commented Feb 9, 2023

Darksonn commented Feb 9, 2023

satakuma commented Feb 11, 2023

Darksonn Feb 11, 2023

satakuma Feb 11, 2023

Darksonn Feb 11, 2023

satakuma Feb 13, 2023

Darksonn Feb 13, 2023

satakuma Feb 14, 2023

Darksonn Feb 11, 2023

satakuma Feb 11, 2023 •

edited

Loading

Darksonn Feb 11, 2023

satakuma Feb 11, 2023

satakuma commented Feb 11, 2023

satakuma commented Feb 15, 2023

		let waiters_to_notify = UnsafeCell::new(std::mem::take(&mut *waiters));
		pin!(waiters_to_notify);

sync: fix notify_waiters notifying sequential awaits #5404

sync: fix notify_waiters notifying sequential awaits #5404

Conversation

satakuma commented Jan 28, 2023

Motivation

Solution

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Darksonn commented Jan 28, 2023

satakuma commented Jan 28, 2023

satakuma commented Jan 28, 2023

Darksonn commented Jan 31, 2023

satakuma commented Jan 31, 2023

Darksonn commented Jan 31, 2023

satakuma commented Jan 31, 2023

Darksonn commented Jan 31, 2023

satakuma commented Feb 5, 2023

Darksonn commented Feb 5, 2023

satakuma commented Feb 5, 2023

satakuma commented Feb 5, 2023

satakuma commented Feb 5, 2023

Darksonn commented Feb 5, 2023

Darksonn commented Feb 9, 2023 • edited Loading

satakuma commented Feb 9, 2023

Darksonn commented Feb 9, 2023

satakuma commented Feb 9, 2023

Darksonn commented Feb 9, 2023

satakuma commented Feb 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satakuma Feb 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satakuma commented Feb 11, 2023

satakuma commented Feb 15, 2023

sync: fix `notify_waiters` notifying sequential awaits #5404

sync: fix `notify_waiters` notifying sequential awaits #5404

Darksonn commented Feb 9, 2023 •

edited

Loading

satakuma Feb 11, 2023 •

edited

Loading