tokio_threadpool::blocking locking up tasks via NotReady #726

sfackler · 2018-10-29T20:10:55Z

Version

tokio: 0.1.11
tokio-threadpool: 0.1.4
tokio-codec: 0.1.1
tokio-current-thread: 0.1.3
tokio-io: 0.1.9
tokio-fs: 0.1.3
tokio-timer: 0.2.7
tokio-executor: 0.1.5
tokio-uds: 0.2.2
tokio-udp: 0.1.2
tokio-reactor: 0.1.6
tokio-signal: 0.2.5
tokio-tcp: 0.1.2

Platform

Linux 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Subcrates

tokio-threadpool

Description

This is a continuation/revival of #525. At the time we thought it was introduced by upgrading from tokio-threadpool 0.1.4 to 0.1.5, but didn't know much more than that. The issue has popped back up for us, and I believe we've narrowed it down to the use of tokio_threadpool::blocking, and not the version of tokio_threadpool. We were specifically switching from use of a futures_cpupool to tokio_threadpool::blocking for DNS resolution in our HTTP client. We still saw the issue after upgrading tokio-threadpool all the way to the newest release.

The bad state we encounter is that at some point under high load, the runtime gets into a bad state where blocking calls return WouldBlock while there isn't actually any blocking work being performed. Trace-level logging just shows this:

{
  "message": "found work while draining; signal_work",
  "thread": "chatter-8",
  "target": "tokio_threadpool::worker",
  "time": "2018-10-25T14:31:29.560345825Z"
}
{
  "message": "signal_work -- spawn; idx=15",
  "thread": "chatter-8",
  "target": "tokio_threadpool::pool",
  "time": "2018-10-25T14:31:29.560385811Z"
}
{
  "message": "Task::run; state=Running",
  "thread": "chatter-8",
  "target": "tokio_threadpool::task",
  "time": "2018-10-25T14:31:29.560428187Z"
}
{
  "message": "checkout waiting for idle connection: (\"https://<REDACTED>:8643\", Http1)",
  "thread": "chatter-8",
  "target": "hyper::client::pool",
  "time": "2018-10-25T14:31:29.560468989Z"
}
{
  "message": "    -> not ready",
  "thread": "chatter-8",
  "target": "tokio_threadpool::task",
  "time": "2018-10-25T14:31:29.560521350Z"
}
{
  "message": "Worker::sleep; worker=WorkerId(15)",
  "thread": "chatter-7",
  "target": "tokio_threadpool::worker",
  "time": "2018-10-25T14:31:29.560530974Z"
}
{
  "message": "Worker::sleep; worker=WorkerId(7)",
  "thread": "chatter-8",
  "target": "tokio_threadpool::worker",
  "time": "2018-10-25T14:31:29.560566537Z"
}
{
  "message": "  sleeping -- push to stack; idx=7",
  "thread": "chatter-8",
  "target": "tokio_threadpool::worker",
  "time": "2018-10-25T14:31:29.560572269Z"
}
{
  "message": "  sleeping -- push to stack; idx=15",
  "thread": "chatter-7",
  "target": "tokio_threadpool::worker",
  "time": "2018-10-25T14:31:29.560566713Z"
}
{
  "message": "    -> starting to sleep; idx=15",
  "thread": "chatter-7",
  "target": "tokio_threadpool::worker",
  "time": "2018-10-25T14:31:29.560581947Z"
}
{
  "message": "    -> starting to sleep; idx=7",
  "thread": "chatter-8",
  "target": "tokio_threadpool::worker",
  "time": "2018-10-25T14:31:29.560584743Z"
}

We use a custom Hyper Connect implementation, and the first async operation it performs is this:

        let mut addrs = try_ready!(tokio_threadpool::blocking(|| {
            debug!("resolving addresses", safe: { host: start.host, port: start.port });
            (&*start.host, start.port).to_socket_addrs()
        }))?;

Since we never see that debug log message, it seems like the WouldBlock logged by tokio_threadpool is coming from the blocking call itself. Thread dumps indicate that there aren't any threads blocked in getaddrinfo, so it seems like the threadpool is somehow losing track of how many outstanding blocking calls are happening.

Unfortunately, I've been completely unable to reproduce this anywhere other than production :(

The text was updated successfully, but these errors were encountered:

carllerche · 2018-10-29T21:05:46Z

@stjepang Interested in your thoughts.

carllerche · 2018-11-21T17:07:03Z

@sfackler you mean Async::NotReady instead of WouldBlock?

sfackler · 2018-11-21T17:49:41Z

Er, yeah.

carllerche · 2018-11-30T05:40:10Z

I believe I have discovered a bug.

Background

The threadpool allows for up to N tasks to be concurrently in a blocking state. This is implemented using a sephamore. The sephamore implementation was done directly in the guts of the threadpool. As part of an effort to ensure correctness of Tokio's various components, I have been working on a (brute force) concurrency checker. It is still in an early state, but can be found here.

I extracted the sephamore component and have been testing it in isolation. The longer term goal is to break up the threadpool implementation to make it more manageable.

Doing so has exposed (I believe) the bug.

Sephamore

The future aware sephamore roughly acts by having a queue of permits. When a task releases the permit, it is assigned to any pending task. Each task struct has a flag tracking whether or not it holds the permit.

The bug

The bug arises from a conflation of a few factors.

The permit is "consumed" here.
The permit polling fn doesn't check if the permit has already been assigned.

The logic of not checking if the permit was assigned is that it is already checked once before entering the task. If the permit is assigned while the task is running, it should run again and then consume the permit the second time around.

The problem is that consuming the permit happens before the task is transitioned to a running state. This is where the race arises:

Task tries to consume, but does not obtain permit.
Permit is assigned to the task and the task is notified.
Task enters "run" state, consuming the notification.
poll_blocking_capacity does not see the permit because it does not check the flag on the task...

Feels like a 🐛

carllerche · 2018-11-30T05:43:04Z

cc @stjepang

carllerche · 2018-11-30T05:43:46Z

Ideally, the way to solve this is to rip out the semaphore impl from threadpool in favor of the one being worked on in syncbox, which should have a saner API.

carllerche · 2019-02-26T17:02:29Z

@sfackler Have you been able to confirm that this is fixed?

sfackler · 2019-02-27T04:08:34Z

Oh, thanks for the reminder. I'll cut a new release of our internal library that switches back to blocking and see how it goes.

ghost · 2019-04-15T19:57:41Z

@sfackler Just a ping in case there is any news on this issue since the last time.

sfackler · 2019-04-15T20:11:09Z

I just got around to switching our implementation back over to using blocking - it should roll out over the next week or so.

sfackler · 2019-04-19T23:29:27Z

This has been running without issue for a couple of days now, so it seems fixed!

carllerche · 2019-04-20T16:29:27Z

Yay!

sfackler · 2019-04-29T11:40:15Z

Sad update: we saw what looks like this issue happen again 😢

carllerche · 2019-04-29T14:00:15Z

Oh no :-(. Are we sure it is blocking related? If you run with the threadpool but not use the blocking fn, do things work fine?

sfackler · 2019-04-29T14:33:12Z

Yeah - we've been using futures-cpupool for DNS for the last couple of months, and switched back to blocking a week or so ago. Same symptoms as before I think, though I didn't go in and turn on trace logging/etc.

Rather than just rolling back off of blocking entirely, I think I'll add some instrumentation (e.g. a gauge tracking how many threads are actually in the blocking closure) and some logic to fall back to a futures-cpupool if we wait more than N seconds for a blocking thread slot.

carllerche · 2019-11-21T21:22:30Z

This is rewritten.

carllerche added the C-bug Category: This is a bug. label Oct 29, 2018

carllerche assigned ghost Oct 29, 2018

sfackler changed the title ~~tokio_threadpool::blocking locking up tasks via WouldBlock~~ tokio_threadpool::blocking locking up tasks via NotReady Nov 21, 2018

ghost mentioned this issue Dec 12, 2018

threadpool: fix semaphore deadlock #795

Merged

quininer mentioned this issue Feb 26, 2019

blocking = hang #931

Closed

sfackler closed this as completed Apr 19, 2019

sfackler reopened this Apr 29, 2019

stuhood mentioned this issue May 3, 2019

Suggestions for blocking API around read_dir #369

Closed

ghost removed their assignment Jun 30, 2019

carllerche closed this as completed Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokio_threadpool::blocking locking up tasks via NotReady #726

tokio_threadpool::blocking locking up tasks via NotReady #726

sfackler commented Oct 29, 2018

carllerche commented Oct 29, 2018

carllerche commented Nov 21, 2018

sfackler commented Nov 21, 2018

carllerche commented Nov 30, 2018

carllerche commented Nov 30, 2018

carllerche commented Nov 30, 2018

carllerche commented Feb 26, 2019

sfackler commented Feb 27, 2019

ghost commented Apr 15, 2019

sfackler commented Apr 15, 2019

sfackler commented Apr 19, 2019

carllerche commented Apr 20, 2019

sfackler commented Apr 29, 2019

carllerche commented Apr 29, 2019

sfackler commented Apr 29, 2019

carllerche commented Nov 21, 2019

tokio_threadpool::blocking locking up tasks via NotReady #726

tokio_threadpool::blocking locking up tasks via NotReady #726

Comments

sfackler commented Oct 29, 2018

Version

Platform

Subcrates

Description

carllerche commented Oct 29, 2018

carllerche commented Nov 21, 2018

sfackler commented Nov 21, 2018

carllerche commented Nov 30, 2018

Background

Sephamore

The bug

carllerche commented Nov 30, 2018

carllerche commented Nov 30, 2018

carllerche commented Feb 26, 2019

sfackler commented Feb 27, 2019

ghost commented Apr 15, 2019

sfackler commented Apr 15, 2019

sfackler commented Apr 19, 2019

carllerche commented Apr 20, 2019

sfackler commented Apr 29, 2019

carllerche commented Apr 29, 2019

sfackler commented Apr 29, 2019

carllerche commented Nov 21, 2019