Scheduling using stateless co-groups algorithm #7394

gjoseph92 · 2022-12-13T02:12:03Z

This is a rough/minimal implementation of using the stateless co-assignment algorithm in dask/dask#9755 for scheduling while queuing is active, for evaluating #7298 (see that issue for more general takeaways and discussion).

One nice thing here is a (nearly) static definition of is_rootish. At least, it no longer depends on cluster size or task group size.

To preserve co-assignment, we submit all root-ish tasks in a co-group to worker at once, even if it means oversaturating the worker. Unfortunately, this means that the tendency of the algorithm to make too large of groups can cause us to assign far too many tasks at once, causing root task oversaturation. To avoid this, we try to ignore cogroups that look "too big" by a very rough heuristic. Of course, that also means we lose co-assignment for those groups.

Another issue was how the algorithm likes to co-group tasks like split-shuffle or rechunk-split—the opposite of what we'd want, since it's critical to run those on the same worker as the input task, and not transfer the large input.

Grouping of a task-based shuffle

We work around that with a little hack in is_rootish that skips tasks with dependencies that aren't tiny.

Overall

If the cogroup algorithm were more predictable and could guarantee it wouldn't make too large of groups, this would be pretty reasonable and not too invasive to add.

Opening this as a PR just for posterity and future reference; not planning to merge (and please don't review).

Closes dask#7274

obviously profoundly slow; want to figure out what logic works before we think about what to pre-compute. note that I doubt this works at all without being able to oversaturate workers with a family. otherwise, with single-threaded workers, we'll just keep jumping along to a new worker for every task.

this probably isn't quite the right way to do it; only works if root tasks really are in priority order (which I guess they are??)?

hoping this makes `getitem`s in `test_anom_mean` not be queued anymore? (not actually sure if it matter that they were queued though, fwiw)

This reverts commit 796eaf5.

need to figure out good strategies/helpers for assertions. currently not trying to abstract too much before we know what we need. would like tests to be relatively flexible / insensitive to changes in cogrouping behavior. just assert no transfers and even load. (even load is harder; haven't figured that out yet.)

adding replicas doesn't transition state, of course.

`_remove_key` could take a ton of time. list wasn't necessary; we weren't using ordering for anything.

This reverts commit 7b0abf9.

Closes dask#7123

big mess

gjoseph92 added 20 commits December 5, 2022 13:29

pass dependencies into dask order

f97826a

store cogroups on scheduler

5c3d81d

static is_rootish based on cogroups

ba846ad

Closes dask#7274

hack: allow oversaturation

796eaf5

this probably isn't quite the right way to do it; only works if root tasks really are in priority order (which I guess they are??)?

linear chains off a priority jump aren't rootish

a05ee75

hoping this makes `getitem`s in `test_anom_mean` not be queued anymore? (not actually sure if it matter that they were queued though, fwiw)

less busy worker as tiebreaker

25e1f1b

Revert "hack: allow oversaturation"

b59df80

This reverts commit 796eaf5.

allow oversaturation only within cogroup

c851491

plugin didn't work; new strategy

99e8a27

adding replicas doesn't transition state, of course.

switch cogroups to set for fast removal

285a116

`_remove_key` could take a ton of time. list wasn't necessary; we weren't using ordering for anything.

WIP

dd43c21

WIP eagerly release tasks

7b0abf9

Revert "WIP eagerly release tasks"

9647c9e

This reverts commit 7b0abf9.

Add transfer_outgoing_bytes_total metric

928f91f

Closes dask#7123

add test_basic_sum

6d7544b

HACK rootish fix

3d24655

test_double_diff_store asserts something (fails)

d4f3241

big mess

rough unreasonable cogroup size check

b6c76b3

gjoseph92 closed this Dec 13, 2022

gjoseph92 mentioned this pull request Dec 14, 2022

Validate stateless co-assignment algorithm #7298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling using stateless co-groups algorithm #7394

Scheduling using stateless co-groups algorithm #7394

gjoseph92 commented Dec 13, 2022 •

edited

Loading

Scheduling using stateless co-groups algorithm #7394

Scheduling using stateless co-groups algorithm #7394

Conversation

gjoseph92 commented Dec 13, 2022 • edited Loading

Overall

gjoseph92 commented Dec 13, 2022 •

edited

Loading