`get_with` deadlock #240

Skepfyr · 2023-03-08T11:20:12Z

The following code deadlocks. The idea is to simulate one task calling get_with but then not being polled for an arbitrarily long period of time (or even just forgotten), another task then comes along and appears to get blocked behind the first being polled to completion.

use futures::{future::poll_immediate, pin_mut};
use moka::future::Cache;
use tokio::task::yield_now;

#[tokio::main]
async fn main() {
    let cache = Cache::builder().build();
    let a = cache.get_with(1, async {
        yield_now().await;
        "A"
    });
    pin_mut!(a);
    assert!(poll_immediate(&mut a).await.is_none());
    let b = cache.get_with(1, async { "B" });
    println!("{}", b.await);
}

I was expecting the future returned by get_with to act more like futures::future::Shared, where polls on any of the shared futures can progress the inner future.

This is the most obvious bad behaviour of this function, but we hit this in the wild as the futures that wait on our cache get calls have a shorter timeout than it takes to resolve the value in the event of a cache miss. That means that the cache never fills because you get behaviour like the following:

Request A misses the cache => starts Future A to retrieve the value
Request B misses the cache => waits on the completion of Future A
Request A times out => Future A is dropped and Future B now starts getting awaited
Request C misses the cache => waits on the completion of Future B
Request B times out => Future B is dropped and Future C now starts getting awaited
...

The text was updated successfully, but these errors were encountered:

Skepfyr · 2023-03-08T11:49:53Z

I've just spotted #59 which is strongly related.

tatsuya6502 · 2023-03-08T12:01:07Z

Thank you for reporting the issue with the repro.

If I understand correctly, we cannot workaround the both cases at the cache side:

The dead-lock case: get_with only resolves when the given init future is resolved.
- It is caller's responsibility to ensure the init feature to be resolved.
- We cannot do something like futures::future::Shared here. All Shared created from a future are handles to the same feature. For our case, every get_with call create different futures.
The repeating resolving the init future: get_with must return a V. So it can only return when a given init future is resolved to a V. If not, get_with cannot return and it has to repeat resolving different init futures.
- It is caller's responsibility to ensure the init feature to be resolved.

tatsuya6502 · 2023-03-08T12:22:59Z

As for the second case, maybe you can resolve the issue at the caller side by using try_get_with. I will create an example when I get back to my desk.

My current idea is:

Right before calling try_get_with, get the current Instant (time).
Call try_get_with with an init future having the Instant from 1.
Inside the init future, get the current Instant again.
Calculate the remaining time timeout - (instant3 - instant1).
If the remaining time is greater than an estimated time to resolve this init future, then do it.
Otherwise cancel this init future by returning an Err.

Skepfyr · 2023-03-08T18:37:10Z

I'm not convinced by your statement of:

We cannot do something like futures::future::Shared here. All Shared created from a future are handles to the same feature. For our case, every get_with call create different futures.

I don't think there's anything stopping Request B and C from polling A's future while it's still working, if it panic's then they should submit their own future to be used but they could poll each other's futures like Shared.

tatsuya6502 · 2023-03-08T21:33:34Z

I don't think there's anything stopping Request B and C from polling A's future while it's still working

No, request B and C cannot poll it.

Request A times out => Future A is dropped and Future B now starts getting awaited

Future A is already dropped here, so it cannot be polled.

Skepfyr · 2023-03-08T21:38:37Z

Is there any reason that dropping Request A has to drop Future A though? The way something like Shared works is that Request B picks up partial ownership of Future A so that even if Request A is dropped, Request B can carry on polling Future A.

tatsuya6502 · 2023-03-08T21:45:50Z

Future A is owned by Request A. In Rust, async fn or async block creates an anonymous struct implementing Future. Request A's anonymous struct owns Future A.

When Request A is dropped, Future A is also dropped.

tatsuya6502 · 2023-03-08T21:56:12Z

When I have time (maybe after work today. It is 6 AM now in my time zone), I will create a reproducing code for your original problem:

Request A misses the cache => starts Future A to retrieve the value
Request B misses the cache => waits on the completion of Future A
Request A times out => Future A is dropped and Future B now starts getting awaited
Request C misses the cache => waits on the completion of Future B
Request B times out => Future B is dropped and Future C now starts getting awaited
...

Then I will check if there is any way to workaround the issue at Cache side. I do not think there is, but I will double check.

tatsuya6502 · 2023-03-08T22:23:18Z

Future A is owned by Request A.

Maybe we can avoid dropping Future A when Request A is dropped. I will try to move the ownership of Future A from Request A to an internal hash table (called waiters) used by get_with.

tatsuya6502 · 2023-03-08T22:37:11Z

But to do so, I will have to remove the workaround for #212. It is an optimization issue in rustc, which is not resolved yet (1.70.0-nightly 2023-03-07).

Skepfyr · 2023-03-08T23:46:21Z

I've been thinking some more and while I would like that as a solution, it does have downsides. Specifically, I believe it will be a breaking change because the future passed to get_with would have to be 'static. It's possible the best solution is just documenting everything carefully and doing tokio::spawn(async move { cache.get_with(key, future).await}).await but that feels like a massive shame, and I'm sure it's possible to do better.

tatsuya6502 · 2023-03-11T15:13:44Z

I started some experiments here:
https://gitlab.com/moka-labs/moka-gh240-shared-init-future/moka/-/compare/master...shared-init-future

It wraps the init future with futures_util::future::Shared, and stores it in the waiter hash table.

pub(crate) type SharedInit<V> =
    Shared<Pin<Box<dyn Future<Output = Result<V, ErrorObject>> + Send + 'static>>>;

type WaiterMap<K, V, S> = crate::cht::SegmentedHashMap<(Arc<K>, TypeId), SharedInit<V>, S>;

Specifically, I believe it will be a breaking change because the future passed to get_with would have to be 'static.

I believe so. It has to be 'static + Send.

Instead of breaking get_with, I added the following method to the entry API:

pub async fn or_insert_with_shared(
    self,
    shared_init: impl Future<Output = V> + Send + 'static,
) -> Entry<K, V>;

and I confirmed the deadlocking code no longer causes the deadlock if modified to use or_insert_with_shared:

use futures::{future::poll_immediate, pin_mut};
use moka::future::Cache;
use tokio::task::yield_now;

#[tokio::main]
async fn main() {
    let cache = Cache::builder().build();
    let a = cache.entry(1).or_insert_with_shared(async {
        yield_now().await;
        "A"
    });
    pin_mut!(a);
    assert!(poll_immediate(&mut a).await.is_none());
    let b = cache.entry(1).or_insert_with_shared(async { "B" });
    println!("{}", b.await.into_value()); // Should print "A".
}

It is not quite ready. I have the following TODOs on an internal method called try_init_or_read:

    // TODO:
    //
    // 1. Currently, all the owners of a shared future will call `insert` closure and
    //    return `Initialized`, which is wrong. Make only one of the owners to keep
    //    doing so, and others to return `ReadExisting`.
    //    - To realize this, we will need to copy `Shared` future code from
    //      `futures_util` and modify it.
    // 2. Decide what to do when all the owners of a shared future are dropped
    //    without completing it.
    //    - Do we want to remove the shared future from the waiter map?
    //    - Or, keep it for a configurable timeout and remove it after that?
    // 3. Decide what to do when a shared future is panicked.
    //    - We probably utilize the current `future_utils::future::Shared`
    //      implementation, which causes all the owners of it to panic? (No retrying)
    //

Skepfyr · 2023-03-12T23:01:12Z

Woah, that's really cool. Some thoughts on the TODOs:

Copying the Shared code is a tad sad. The only solution I can think of (without looking at the code) is to try and move the code you only want to execute once inside the Shared.
Removing it from the waiter map is likely the least surprising option, although in some way just leaving it there forever is more consistent with how futures work (although is probably kinda annoying).
Everyone panicking is probably fine, the other waiters likely are using the same init future (it'd be slightly weird to have more than one way to initialise the cache) so they likely wouldn't fair any better. Although if you wanted to be consistent with get_with you could just move on to the next waiter's future.

xjewer · 2024-05-01T19:02:06Z

Hey @tatsuya6502, do you have any updates on this issue?

gautamg795 · 2024-10-17T23:26:12Z

I think in the absence of a fix, it'd be great if it could be documented that the init future isn't internally polled or otherwise owned by the cache (which makes sense from reading this thread).
I unfortunately spent quite a while debugging a deadlock caused by requestors timing out before the init future could complete, and then this would keep happening repeatedly with no forward progress made.

The short term solution is, as mentioned above, to spawn a new task which handles running the init future to completion.

tatsuya6502 self-assigned this Mar 25, 2023

tatsuya6502 added the enhancement New feature or request label Mar 25, 2023

tatsuya6502 modified the milestones: v0.10.2, v0.10.3 Mar 25, 2023

tatsuya6502 modified the milestones: v0.10.3, v0.11.1 May 1, 2023

tatsuya6502 modified the milestones: v0.11.1, Backlog May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_with` deadlock #240

`get_with` deadlock #240

Skepfyr commented Mar 8, 2023

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023 •

edited

Loading

tatsuya6502 commented Mar 8, 2023

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023 •

edited

Loading

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 11, 2023

Skepfyr commented Mar 12, 2023

xjewer commented May 1, 2024

gautamg795 commented Oct 17, 2024

get_with deadlock #240

get_with deadlock #240

Comments

Skepfyr commented Mar 8, 2023

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023 • edited Loading

tatsuya6502 commented Mar 8, 2023

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023

tatsuya6502 commented Mar 8, 2023 • edited Loading

Skepfyr commented Mar 8, 2023

tatsuya6502 commented Mar 11, 2023

Skepfyr commented Mar 12, 2023

xjewer commented May 1, 2024

gautamg795 commented Oct 17, 2024

`get_with` deadlock #240

`get_with` deadlock #240

tatsuya6502 commented Mar 8, 2023 •

edited

Loading

tatsuya6502 commented Mar 8, 2023 •

edited

Loading