-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The tokio::time::timeout()
does not return after timeout.
#6682
Comments
There is not only one reason why runtime get stuck, refer to #6315. |
Can we get the stack of worker threads when the runtime get stuck? |
Attached is the backtrace of all threads. This one is from M2 MacBook Air with tokio We've also observed that on Linux suspending the program with |
I've been hunting this issue with @jtt. Today I ran a test between two consecutive commits of tokio: 1st: e62c3e9 and 2nd: 1914e1e:
The hang happens really fast on the 2nd commit. But doesn't appear on 1st commit (but it could of course need more time running). The 2nd commit is the merge of #6534 implementing the sharded locks on Timers. We keep running with 1st commit to see if we can reproduce the hang on it also. |
Can you change the tokio dependency of |
If I understand it correctly, this hang should not be reproduced on this commit. |
Thanks! I'll do a test run with this branch.
|
We've now been running with this branch for ~20 hours on Windows, Linux and Mac OS and so far we have not observed any issues and things have been running smoothly. We'll keep testing for few days at least and let you know of any issues, but looking good so far. |
Thank you for your feedback! It seems that there is some data race here. I will try to fix it. |
Bumps tokio from 1.38.0 to 1.38.1. Release notes Sourced from tokio's releases. Tokio v1.38.1 1.38.1 (July 16th, 2024) This release fixes the bug identified as (#6682), which caused timers not to fire when they should. Fixed time: update wake_up while holding all the locks of sharded time wheels (#6683) #6682: tokio-rs/tokio#6682 #6683: tokio-rs/tokio#6683 Commits 14b9f71 chore: release Tokio v1.38.1 (#6688) 24344df time: fix race condition leading to lost timers (#6683) See full diff in compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase. Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: @dependabot rebase will rebase this PR @dependabot recreate will recreate this PR, overwriting any edits that have been made to it @dependabot merge will merge this PR after your CI passes on it @dependabot squash and merge will squash and merge this PR after your CI passes on it @dependabot cancel merge will cancel a previously requested merge and block automerging @dependabot reopen will reopen this PR if it is closed @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
Is 1.37.0 affected? Which version introduced this bug? |
@viciousstar My understanding is that 1.37.0 is not affected. The bug was introduced in 1.38.0 (with #6534) and then fixed in 1.38.1. |
Version
We have been able to reproduce this issue with tokio 1.38.0 and
main
as of c8f3539.Tokio 1.37.0 does not seem to be affected (but since this is really sensitive to timing, can not be sure).
Platform
Linux 6.1.0-22-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux
23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112 arm64
Description
Tokio 1.38.0 seems to have introduced a bug which affects at least
tokio::time::timeout()
, but is likely related totokio::time::sleep::Sleep
.We have a program which calls
tokio::time::timeout()
in a loop:where
rx
is atokio::sync::mpsc::Receiver
. After updating to tokio 1.38.0 we observed that sometimes the program would get stuck ontimeout()
, that is, thetimeout().await
never returned.The bug seems to trigger when
a) There has been data on the channel and
rx.recv()
has returned a value and thus the underlyingSleep
has been droppedb) We have called the
timeout()
in just the right time and therx
channel is empty. TheDuration
we calltimeout()
with varies from few hundred milliseconds to few seconds.It looks like a some kind on race where we call
timeout()
just right after the previous call has returnedOk()
. Reproducing this sometimes happens after running for few minutes and sometimes the program needs to run for hours before getting stuck. Trying to debug this by addingprintln!()
etc. makes the bug harder to trigger. Same applies to adding debugs into the internals oftokio::runtime::time
What I have observed is that
tokio::runtime::time::Driver::park_internal()
calculates theexpiration_time
asNone
just after we calltimeout()
and for some reason the newly addedSleep
does not affect that and thus thetimeout().await
never returns.Unfortunately I have not been able to minimise the program into something that would reliably (or at all) trigger the bug, but please let me know if any additional information is needed and I'll try to provide that.
The text was updated successfully, but these errors were encountered: