Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate improving scheduler work-stealing efficiency #11730

Closed
alexcrichton opened this issue Jan 22, 2014 · 32 comments
Closed

Investigate improving scheduler work-stealing efficiency #11730

alexcrichton opened this issue Jan 22, 2014 · 32 comments
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority

Comments

@alexcrichton
Copy link
Member

Right now the libgreen scheduler implements a work stealing algorithm in tandem with a global sleeper list in order to fan out work among all schedulers when there is work to do.

Sadly, this scheme has pathological performance in simple situations like lots of threads contending on a mutex. The reason for this is that every time a thread gets off a mutex a remote scheduler is woken up to run the thread. This remote scheduler is woken up with a syscall. The remote scheduler will run the task, but immediately go back to sleep as the task goes back to sleep on the mutex.

This process of waking up a remote scheduler is very expensive and shows painfully in benchmarks. This problem essentially boils down to a problem where all task enqueues require a syscall. The reason for this is that on some workloads there's almost always a scheduler that is asleep as the system isn't fully loaded. If the system is loaded with work, none of this is a problem because no remote scheduler needs to be woken up.

Right now, there are two ideas for mitigating this cost of a syscall.

  1. Exponential backoff when sleeping. Right now whenever a scheduler goes to sleep, it falls back into epoll and gets woken up via write. Additionally, it puts a handle on a global sleeper list for other schedulers to grab and then wake up. Instead, a scheduler could have some sort of exponential backoff of sleeping. Only after the backoff was completed would a handle be placed on the sleeper list. The goal for this would be to avoid write syscalls because the handles wouldn't actually be on the list, but other schedulers would still successfully steal work. This exponential backoff would need to be tuneable, and it probably shouldn't happen if there are active I/O handles.
  2. Schedulers should only wake up two other schedulers. Right now there is a global sleeper list where any scheduler can wake up any other scheduler. In addition to being a point of contention, this means that if any scheduler is sleeping that a task enqueue will be a syscall. On systems of lots of cores, I would imagine it's fairly likely that small programs don't fill up all the cores. In order to mitigate this problem, we would arrange the pool of schedulers as a ring of schedulers. Each scheduler would be responsible for waking up its left and right neighbors, but that's it. With this scheme, if one scheduler had lots of serial work, it would occasionally wake up its neighbors, but that's it. Additionally, its neighbors would likely be in the exponential backoff stage before they went to sleep, so there would be no syscalls.

I'm guessing that combining these two strategies will allow us to still distribute work to all cores quickly if there is a truly parallel workload, and also work well on many-task workloads that are mostly sequential (due to a mutex or some such). This is just a guess though, and definitely needs investigation.

@alexcrichton
Copy link
Member Author

Nominating.

I recently investigated into the rust-http benchmarks which show the RPS for libnative is over 2x larger than libgreen. I found that with RUST_THREADS=1 libgreen was 7% faster than libnative (and libnative had the benefit of being concurrent).

In my opinion, this is a serious obstacle to widespread acceptance of libgreen. Until this issue is fixed, I feel that the common workloads will be much slower than they need to be. Another example of a workload showing this pathological performance is a highly contended mutex.

@pnkfelix
Copy link
Member

Assigning P-high since it seems promising, but this should not block 1.0.

@derekchiang
Copy link
Contributor

This sounds like a very interesting project to work on and I'd like to look into this. @alexcrichton, are there any benchmarks I can use to test green's performance under the kind of workload you described?

@alexcrichton
Copy link
Member Author

One example of the pathological code is the example in this gist: https://gist.github.com/8813400. On my OSX laptop, this code runs in about 7s on libgreen and in 3s on libnative. I would expect libgreen to be much faster in this case.

Another example would be the benchmarks inside of https://github.com/chris-morgan/rust-http/. I found that libgreen was by default 2x slower than libnative, but with RUST_THREADS=1 libgreen was about 10% faster than libnative (and libnative was parallelized while libgreen wasnt). I never got the benchmarks to run reliably on any platform other than linux.

@alexcrichton
Copy link
Member Author

Ah I should also mention that the gist I gave takes a command-line parameter of the number of threads to contend each other, and the numbers I gave were for 4 threads of contention. The numbers are pretty much the same for 1 thread, and it gets worse from there when more threads are added.

@derekchiang
Copy link
Contributor

@alexcrichton do you have any suggestions regarding implementing the ring, ideally with a lock-free structure?

One approach would be to use the same SleepList structure (which uses an mpmc queue), except that instead of simply popping a scheduler and waking it up, we keep popping and re-pushing schedulers until we find a scheduler that is our neighbour, in which case we wake it up. A neighbour can just be defined as a scheduler whose id is our id +1 or -1.

The downside would be that waking up a scheduler now takes O(n). But since the total number of schedulers is probably small, this might be fine.

@derekchiang
Copy link
Contributor

Well, I just realize schedulers don't have ids. So if the approach above were to be used, we would need schedulers to have ids as well. The implementation would be similar to pool_id.

@derekchiang
Copy link
Contributor

Well this is very exciting indeed. I did this and, using the benchmark @alexcrichton provided, the time needed to run with 8 threads dropped from ~18 seconds to ~4 seconds on my machine :)

The code can be found here. I don't want to submit a PR just yet since there are obviously still many design issues to be sorted out.

@alexcrichton
Copy link
Member Author

Those are some very promising results!

My idea for this would be to have some sort of a handle to your left and right neighbors. This handle might be a SchedHandle, but it may also be some other handle. One of the great parts about only caring about the left and right neighbors is that you avoid any global state, so that's something that I think that we'd definitely want to move towards (not having a global sleeper list).

I don't think that the handles will be so much of a lock-free structure as a wakeup mechanism for schedulers. They'll probably have a uv async handle along with a flag as to whether the scheduler is sleeping or not. Note that the crucial aspect is that the flag is false while the scheduler is performing exponential backoff before sleeping, and it's true only when the scheduler has completely blocked.

Those are just some scattered thoughts, though.

@derekchiang
Copy link
Contributor

@alexcrichton Could you explain what you mean by exponential backoff of sleeping? Is it like, rather than directly going to sleep, a scheduler would first wait for like 1ms and try to run its current task again. If still blocked, it waits for like 2ms and tries again. Then 4ms, 8ms, until it hits a threshold, in which case it actually goes to sleep.

Also, I imagine the sleep flag will need to be an atomic boolean?

@alexcrichton
Copy link
Member Author

Right now, when a scheduler goes to sleep (executes this code), it will go back to libuv, and the OS thread will be blocked in epoll. Waking up a scheduler from epoll is pretty expensive, so we were thinking that a scheduler should expoentially backoff in sleeping before hitting epoll.

The idea is that while doing this exponential backoff, the scheduler won't receive messages to wake up (it doesn't look like it's asleep). In doing so, when we have a high volume of serial work across many tasks, one scheduler will be doing all the work and fanning out the unblocked tasks to its neighbors. These neighbors would finish the work quickly, and then block the task again. These neighbor schedulers would never hit epoll and the main scheduler would never wake them up. The "I'm asleep" flag would indeed probably be an atomic, and it would be false during this backoff.

A complication is that I don't think that exponential backoff should happen while there are active I/O handles. Otherwise, this exponential backoff will starve any tasks which have an I/O completion callback to run.

The exact tuning of parameters is something I'm not quite sure about. The idea is something along the lines of wait for N units of time, then wait for 2N units of time, all the way up to 2^M * N units of time at which point you hit epoll. The values of N and M will likely need lots of tuning to get right.

@derekchiang
Copy link
Contributor

@alexcrichton I might be missing something obvious, but how do you tell a scheduler to "wait" for N units of time, without making it sleep?

Also, any thoughts on how we might detect whether there are active I/O handles?

@alexcrichton
Copy link
Member Author

Ah sorry, I should have clarified. When in backoff, the scheduler just plain old sleeps. It doesn't hit epoll, it just calls the equivalent of libc::usleep (or something like that).

I wouldn't worry too much about the I/O handle use-case right now. I had a patch for it awhile back but it got pretty hairy. I'd worry first about getting the work stealing under control, and then we can worry about the I/O.

@derekchiang
Copy link
Contributor

@alexcrichton could you tell me exactly how to make it sleep like you described? I can't find any sleep code in Rust that does not rely on the runtime itself.

@alexcrichton
Copy link
Member Author

I would start out with libc::usleep

@derekchiang
Copy link
Contributor

Isn't that platform-dependent? Anyway, I tried using it to implement the exponential backoff and so far the results seem promising. The green benchmarks are beating the native benchmarks consistently by 2x or more when running 4 OS threads on my 4-core machine. Will experiment more.

@alexcrichton
Copy link
Member Author

I think you'll have to find something different on windows (take a look at timer_win32.rs inside of libnative), but those numbers certainly sound promising!

@brson
Copy link
Contributor

brson commented Feb 11, 2014

I have some concern about blocking sleeps for the backoff strategy since that prevents the reception of messages from other schedulers. This may not be a huge problem since the primary impact is adding a small delay to I/O requests that need to be homed.

@derekchiang
Copy link
Contributor

@brson this actually turned out to be a real problem. I was getting a bug where some schedulers just wouldn't shutdown; turned out they didn't receive SHUTDOWN message because they were sleeping. Once I disabled exponential backoff, everything worked again. @alexcrichton, any thoughts on how to deal with this?

@alexcrichton
Copy link
Member Author

It depends on how you implement the exponential backoff. Schedulers should never infinitely sleep, they should stop sleeping and fall back to libuv after a "brief moment of time". When falling back to libuv, the async callback should trigger and the scheduler should get the shutdown message.

@derekchiang
Copy link
Contributor

Yeah that's what's confusing me a bit. I thought when a scheduler wakes up, it will still get all messages sent to it when it was sleeping, since the messages were simply put in a queue. But apparently that's not the case. Maybe I'm doing something wrong. I'm just going to go ahead and open a PR so that people can criticize my code.

bors added a commit that referenced this issue Feb 14, 2014
These commits pick off some low-hanging fruit which were slowing down spawning green threads. The major speedup comes from fixing a bug in stack caching where we never used any cached stacks!

The program I used to benchmark is at the end. It was compiled with `rustc --opt-level=3 bench.rs --test` and run as `RUST_THREADS=1 ./bench --bench`. I chose to use `RUST_THREADS=1` due to #11730 as the profiles I was getting interfered too much when all the schedulers were in play (and shouldn't be after #11730 is fixed). All of the units below are in ns/iter as reported by `--bench` (lower is better).

|               | green | native | raw    |
| ------------- | ----- | ------ | ------ |
| osx before    | 12699 | 24030  | 19734  |
| linux before  | 10223 | 125983 | 122647 |
| osx after     |  3847 | 25771  | 20835  |
| linux after   |  2631 | 135398 | 122765 |

Note that this is *not* a benchmark of spawning green tasks vs native tasks. I put in the native numbers just to get a ballpark of where green tasks are. This is benchmark is *clearly* benefiting from stack caching. Also, OSX is clearly not 5x faster than linux, I think my VM is just much slower.

All in all, this ended up being a nice 4x speedup for spawning a green task when you're using a cached stack.

```rust
extern mod extra;
extern mod native;
use std::rt::thread::Thread;

#[bench]
fn green(bh: &mut extra::test::BenchHarness) {
    let (p, c) = SharedChan::new();
    bh.iter(|| {
        let c = c.clone();
        spawn(proc() {
            c.send(());
        });
        p.recv();
    });
}

#[bench]
fn native(bh: &mut extra::test::BenchHarness) {
    let (p, c) = SharedChan::new();
    bh.iter(|| {
        let c = c.clone();
        native::task::spawn(proc() {
            c.send(());
        });
        p.recv();
    });
}

#[bench]
fn raw(bh: &mut extra::test::BenchHarness) {
    bh.iter(|| {
        Thread::start(proc() {}).join()
    });
}
```
@toddaaro
Copy link
Contributor

toddaaro commented Jul 6, 2014

What is the status of this issue? I like the fanout idea. It synergizes well with the "friend scheduler" concept already in there. When a scheduler has a task it isn't supposed to execute it is sent to the friend. Right now there is only one friend, but if we added a second and made an explicit ring that could replace the sleeper list, and also add fanout to the pinned task usecase. It probably still has the pathalogical case of a scheduler only allowed to run pinned tasks spawning many unpinned tasks, which are then all sent to the same friend.

@alexcrichton
Copy link
Member Author

The has currently been no progress on this issue beyond brainstorming.

@toddaaro
Copy link
Contributor

toddaaro commented Jul 6, 2014

I took a stab at getting rid of the sleeper list, and this code appears to pass the tests. Based very heavily on @derekchiang's work.

The shutdown strategy is really fishy though, and I'm fairly certain there is a bad race condition where shutting down a subset of the schedulers will leave it in a broken state. When a scheduler shuts down it notifies neighbors that they have new neighbors, but if one of those neighbors is shutting down too the receiver discards the new neighbor request instead of forwarding it, as it can't forward it because it already gave away its neighbor handles. The still-extant neighbors also then get handles to the shutting down schedulers as their neighbors, keeping them alive. It is a little more complicated too because I have the right neighbor responsible for updating the left neighbor, haven't worked out all the details of how this plays out.

Is the "shut down some of your schedulers and continue" usecase supported by libgreen?

The only dynamic scheduler set stuff I could find was for spawning schedulers to run a single task and exit. So I left spawn_sched schedulers out of the ring, as they don't need to be woken up that way. (I think?) Should spawn_sched join the ring to allow for dynamically scaling up the thread pool? I'm currently ignoring that use case because I haven't seen it used, and that is questionable.

toddaaro@23a8057

(And warning, the code is messy/uncleaned)

@alexcrichton
Copy link
Member Author

The current SchedPool interface allows insertion of a new scheduler, but it currently doesn't allow for removal. The reason for this is that there is no set point in time where all I/O handles to a scheduler have gone away, and the scheduler needs to stay alive as long as there are I/O handles.

It would be a little sad to add another reason that schedulers couldn't be shut down, but that's not necessarily the end of the world. I do think, however, that adding a new scheduler should insert it into the ring of schedulers rather than putting it on the side.

@toddaaro
Copy link
Contributor

toddaaro commented Jul 7, 2014

@alexcrichton seems sensible, I'll think about how to get it into the ring. That is actually fairly tricky to get right due to the concurrency issues with additions, if anyone has protocol ideas that would be useful.

@toddaaro
Copy link
Contributor

toddaaro commented Jul 7, 2014

Storing the neighbor handles in a concurrent doubly-linked list would be nice, then it is just an insert. Eliminates the need to store neighbors in the scheduler struct directly, and makes it easy to do shutdown as it would just be a remove.

Adding additional schedulers is already such a heavy operation implementing such a structure by just putting it behind a lock might not be bad at all.

@alexcrichton
Copy link
Member Author

Yes, I expect modifications to the scheduler ring to be quite rare. Additionally, scheduler management only ever happens through the SchedPool type which has all of its methods as &mut self so you're guaranteed that only one modification at a time is happening.

@toddaaro
Copy link
Contributor

toddaaro commented Jul 7, 2014

Ah, that actually makes it fairly easy, I hadn't caught that detail yet.

@toddaaro
Copy link
Contributor

I've thought about the scheduler ring some more, and making it work well is tricky. There seem to be many situations in which a workload won't actually fan out properly, and I don't think it is reasonable for a user to think about how to avoid these cases.

A simple one is if there is one task spawning all the work. A listening socket that spawns workers based on an input for example. As work comes in and tasks are spawned, this scheduler would wake up both its neighbors so they could process work. It could be the case though that these tasks don't ever need to spawn additional tasks, so they would never wake up their neighbors. As a result there would only ever be 3 schedulers working in this system.

I think making the ring work requires a fallback of some sort to ensure that proper fanout does happen. I'm not sure what exactly would be best for that.

@alexcrichton
Copy link
Member Author

I vaguely remember also having a concern but ending up coming with a fix for it, but I don't remember what that was (if it happened) at this point. We could possibly tweak the amount of tasks that are stolen to help alleviate it, or possibly just implement the exponential backoff now to prevent hitting epoll.

@thestinger
Copy link
Contributor

#17325 means this is no longer relevant to the standard libraries

flip1995 pushed a commit to flip1995/rust that referenced this issue Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code. P-medium Medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants