Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_with deadlock #240

Open
Skepfyr opened this issue Mar 8, 2023 · 15 comments
Open

get_with deadlock #240

Skepfyr opened this issue Mar 8, 2023 · 15 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@Skepfyr
Copy link

Skepfyr commented Mar 8, 2023

The following code deadlocks. The idea is to simulate one task calling get_with but then not being polled for an arbitrarily long period of time (or even just forgotten), another task then comes along and appears to get blocked behind the first being polled to completion.

use futures::{future::poll_immediate, pin_mut};
use moka::future::Cache;
use tokio::task::yield_now;

#[tokio::main]
async fn main() {
    let cache = Cache::builder().build();
    let a = cache.get_with(1, async {
        yield_now().await;
        "A"
    });
    pin_mut!(a);
    assert!(poll_immediate(&mut a).await.is_none());
    let b = cache.get_with(1, async { "B" });
    println!("{}", b.await);
}

I was expecting the future returned by get_with to act more like futures::future::Shared, where polls on any of the shared futures can progress the inner future.

This is the most obvious bad behaviour of this function, but we hit this in the wild as the futures that wait on our cache get calls have a shorter timeout than it takes to resolve the value in the event of a cache miss. That means that the cache never fills because you get behaviour like the following:

Request A misses the cache => starts Future A to retrieve the value
Request B misses the cache => waits on the completion of Future A
Request A times out => Future A is dropped and Future B now starts getting awaited
Request C misses the cache => waits on the completion of Future B
Request B times out => Future B is dropped and Future C now starts getting awaited
...
@Skepfyr
Copy link
Author

Skepfyr commented Mar 8, 2023

I've just spotted #59 which is strongly related.

@tatsuya6502
Copy link
Member

tatsuya6502 commented Mar 8, 2023

Thank you for reporting the issue with the repro.

If I understand correctly, we cannot workaround the both cases at the cache side:

  1. The dead-lock case: get_with only resolves when the given init future is resolved.
    • It is caller's responsibility to ensure the init feature to be resolved.
    • We cannot do something like futures::future::Shared here. All Shared created from a future are handles to the same feature. For our case, every get_with call create different futures.
  2. The repeating resolving the init future: get_with must return a V. So it can only return when a given init future is resolved to a V. If not, get_with cannot return and it has to repeat resolving different init futures.
    • It is caller's responsibility to ensure the init feature to be resolved.

@tatsuya6502
Copy link
Member

As for the second case, maybe you can resolve the issue at the caller side by using try_get_with. I will create an example when I get back to my desk.

My current idea is:

  1. Right before calling try_get_with, get the current Instant (time).
  2. Call try_get_with with an init future having the Instant from 1.
  3. Inside the init future, get the current Instant again.
  4. Calculate the remaining time timeout - (instant3 - instant1).
  5. If the remaining time is greater than an estimated time to resolve this init future, then do it.
  6. Otherwise cancel this init future by returning an Err.

@Skepfyr
Copy link
Author

Skepfyr commented Mar 8, 2023

I'm not convinced by your statement of:

We cannot do something like futures::future::Shared here. All Shared created from a future are handles to the same feature. For our case, every get_with call create different futures.

I don't think there's anything stopping Request B and C from polling A's future while it's still working, if it panic's then they should submit their own future to be used but they could poll each other's futures like Shared.

@tatsuya6502
Copy link
Member

I don't think there's anything stopping Request B and C from polling A's future while it's still working

No, request B and C cannot poll it.

Request A times out => Future A is dropped and Future B now starts getting awaited

Future A is already dropped here, so it cannot be polled.

@Skepfyr
Copy link
Author

Skepfyr commented Mar 8, 2023

Is there any reason that dropping Request A has to drop Future A though? The way something like Shared works is that Request B picks up partial ownership of Future A so that even if Request A is dropped, Request B can carry on polling Future A.

@tatsuya6502
Copy link
Member

Future A is owned by Request A. In Rust, async fn or async block creates an anonymous struct implementing Future. Request A's anonymous struct owns Future A.

When Request A is dropped, Future A is also dropped.

@tatsuya6502
Copy link
Member

When I have time (maybe after work today. It is 6 AM now in my time zone), I will create a reproducing code for your original problem:

Request A misses the cache => starts Future A to retrieve the value
Request B misses the cache => waits on the completion of Future A
Request A times out => Future A is dropped and Future B now starts getting awaited
Request C misses the cache => waits on the completion of Future B
Request B times out => Future B is dropped and Future C now starts getting awaited
...

Then I will check if there is any way to workaround the issue at Cache side. I do not think there is, but I will double check.

@tatsuya6502
Copy link
Member

Future A is owned by Request A.

Maybe we can avoid dropping Future A when Request A is dropped. I will try to move the ownership of Future A from Request A to an internal hash table (called waiters) used by get_with.

@tatsuya6502
Copy link
Member

tatsuya6502 commented Mar 8, 2023

But to do so, I will have to remove the workaround for #212. It is an optimization issue in rustc, which is not resolved yet (1.70.0-nightly 2023-03-07).

@Skepfyr
Copy link
Author

Skepfyr commented Mar 8, 2023

I've been thinking some more and while I would like that as a solution, it does have downsides. Specifically, I believe it will be a breaking change because the future passed to get_with would have to be 'static. It's possible the best solution is just documenting everything carefully and doing tokio::spawn(async move { cache.get_with(key, future).await}).await but that feels like a massive shame, and I'm sure it's possible to do better.

@tatsuya6502
Copy link
Member

I started some experiments here:
https://gitlab.com/moka-labs/moka-gh240-shared-init-future/moka/-/compare/master...shared-init-future

It wraps the init future with futures_util::future::Shared, and stores it in the waiter hash table.

pub(crate) type SharedInit<V> =
    Shared<Pin<Box<dyn Future<Output = Result<V, ErrorObject>> + Send + 'static>>>;

type WaiterMap<K, V, S> = crate::cht::SegmentedHashMap<(Arc<K>, TypeId), SharedInit<V>, S>;

Specifically, I believe it will be a breaking change because the future passed to get_with would have to be 'static.

I believe so. It has to be 'static + Send.

Instead of breaking get_with, I added the following method to the entry API:

pub async fn or_insert_with_shared(
    self,
    shared_init: impl Future<Output = V> + Send + 'static,
) -> Entry<K, V>;

and I confirmed the deadlocking code no longer causes the deadlock if modified to use or_insert_with_shared:

use futures::{future::poll_immediate, pin_mut};
use moka::future::Cache;
use tokio::task::yield_now;

#[tokio::main]
async fn main() {
    let cache = Cache::builder().build();
    let a = cache.entry(1).or_insert_with_shared(async {
        yield_now().await;
        "A"
    });
    pin_mut!(a);
    assert!(poll_immediate(&mut a).await.is_none());
    let b = cache.entry(1).or_insert_with_shared(async { "B" });
    println!("{}", b.await.into_value()); // Should print "A".
}

It is not quite ready. I have the following TODOs on an internal method called try_init_or_read:

    // TODO:
    //
    // 1. Currently, all the owners of a shared future will call `insert` closure and
    //    return `Initialized`, which is wrong. Make only one of the owners to keep
    //    doing so, and others to return `ReadExisting`.
    //    - To realize this, we will need to copy `Shared` future code from
    //      `futures_util` and modify it.
    // 2. Decide what to do when all the owners of a shared future are dropped
    //    without completing it.
    //    - Do we want to remove the shared future from the waiter map?
    //    - Or, keep it for a configurable timeout and remove it after that?
    // 3. Decide what to do when a shared future is panicked.
    //    - We probably utilize the current `future_utils::future::Shared`
    //      implementation, which causes all the owners of it to panic? (No retrying)
    //

@Skepfyr
Copy link
Author

Skepfyr commented Mar 12, 2023

Woah, that's really cool. Some thoughts on the TODOs:

  1. Copying the Shared code is a tad sad. The only solution I can think of (without looking at the code) is to try and move the code you only want to execute once inside the Shared.
  2. Removing it from the waiter map is likely the least surprising option, although in some way just leaving it there forever is more consistent with how futures work (although is probably kinda annoying).
  3. Everyone panicking is probably fine, the other waiters likely are using the same init future (it'd be slightly weird to have more than one way to initialise the cache) so they likely wouldn't fair any better. Although if you wanted to be consistent with get_with you could just move on to the next waiter's future.

@tatsuya6502 tatsuya6502 self-assigned this Mar 25, 2023
@tatsuya6502 tatsuya6502 added the enhancement New feature or request label Mar 25, 2023
@tatsuya6502 tatsuya6502 modified the milestones: v0.10.2, v0.10.3 Mar 25, 2023
@tatsuya6502 tatsuya6502 modified the milestones: v0.10.3, v0.11.1 May 1, 2023
@tatsuya6502 tatsuya6502 modified the milestones: v0.11.1, Backlog May 27, 2023
@xjewer
Copy link

xjewer commented May 1, 2024

Hey @tatsuya6502, do you have any updates on this issue?

@gautamg795
Copy link

I think in the absence of a fix, it'd be great if it could be documented that the init future isn't internally polled or otherwise owned by the cache (which makes sense from reading this thread).
I unfortunately spent quite a while debugging a deadlock caused by requestors timing out before the init future could complete, and then this would keep happening repeatedly with no forward progress made.

The short term solution is, as mentioned above, to spawn a new task which handles running the init future to completion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

4 participants