Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor occupancy #7030

Closed
wants to merge 14 commits into from
Closed

Refactor occupancy #7030

wants to merge 14 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Sep 12, 2022

This is an implementation of the suggestion in #7027

Pros

  • It updates occupancy in real time. There is no more delay due to reevaluate_occupancy callback
  • We no longer need to perform a very expensive reevaluate_occupancy
  • This has an important impact particularly for work stealing, unknown tasks and rapid upscaling scenarios

Cons

  • Our detection ability for tasks with unusual runtimes is slightly degraded. Previously this was detected in a reevaluation cycle based on executing time duration submitted by the heartbeat and we'd deal with the outlier individually. Since we're now basing everything on prefixes, once we detect such an outlier this affects the entire prefix. I believe we could make this logic smarter but I don't know how common it actually is
  • A slight stealing regression for extremely fast keys, see Refactor occupancy #7030 (comment) This is techically also a problem for less extreme cases but the steal_time_ratio is basically just a perf optimized sort so for most keys the "steal priority" is only affected. extremely fast keys may not be stolen at all even if we detect later on that they are not that fast after all.

Benchmarks: pending. Early results do not show a negative impact on scheduler performance.

distributed/scheduler.py Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Sep 12, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±    0         15 suites  ±0   6h 30m 45s ⏱️ + 41m 46s
  3 093 tests  -   14    2 992 ✔️  -   17    88 💤  -   9  13 +12 
22 885 runs   - 126  21 884 ✔️  - 105  929 💤  - 92  72 +71 

For more details on these failures, see this check.

Results for commit a98049c. ± Comparison against base commit e892d0b.

♻️ This comment has been updated with latest results.

distributed/scheduler.py Outdated Show resolved Hide resolved
@fjetter
Copy link
Member Author

fjetter commented Sep 13, 2022

Very early preliminary results

  • As already suspected, redefining total_occupancy by summing over all workers is too expensive. It's called too often, after all (mostly in check_idle_saturated). This has to be fixed.
  • Otherwise occupancy does not show up in a (dask) server profile
  • Stealing is completely out of control. That's already indicated by test. If I disable this, performance looks good

Comment on lines 477 to 479
duration = self.scheduler.get_task_duration(
ts
) + self.scheduler.get_comm_cost(ts, ts.processing_on)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the stealing fixes here are preliminary. I suspect we want to get #7026 done first

Comment on lines +285 to +286
# TODO: occupancy no longer concats linearily so we can't easily
# assume that the network cost would go down by that much
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in different terms: "Occupancy by task" is no longer constant and we'd need to recompute it if this should be used for any decision.

distributed/stealing.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
Comment on lines +1295 to 1297
@pytest.mark.skip("executing heartbeats not considered yet")
@gen_cluster(client=True, nthreads=[("127.0.0.1", 1)] * 3)
async def test_correct_bad_time_estimate(c, s, *workers):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one functionality I couldn't restore so far. The problem is that upon every reevaluate_occupancy, we didn't only reevaluate the occupancy but if we detect a significant shift of occupancy, we'd recalculate the steal time ratio for all tasks in processing.
With this PR there is no place any more to reevaluate occupancy so this is no longer possible. More natural would be to recalculate tasks whenever a task group/prefix duration drifts but we'd need to track tasks of a taskgroup to make this work.
I'm currently not fully convinced that this is worth doing. Particularly since this only affects tasks with large network transfers and small occupancy. As it stands right now, this would only affects tasks with a transfer time to occupancy ratio of more than 257 which is typically only possible for lightning fast tasks anyhow.

Before engaging on this I would like to get #7026 or a version of it done

@fjetter fjetter marked this pull request as ready for review September 15, 2022 13:21
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this general change quite a bit! It should give us some more useful occupancy estimates and the way reevaluate_occupancy worked was messy. I have some nits and a concern regarding the handling of adding/removing replicas. Apart from that, I'd love to see an A/B test for this since it's hard to judge if this has a negative impact on runtimes. The regression in https://github.com/dask/distributed/pull/7030/files#r971979808 feels fine and we should be able to find good ways of tackling this should the need arise.

@@ -497,6 +492,12 @@ class WorkerState:
# The unique server ID this WorkerState is referencing
server_id: str

# Reference to scheduler task_groups
scheduler_ref: weakref.ref[SchedulerState] | None
task_groups_count: dict[str, int]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task_groups_count: dict[str, int]
task_groups_count: defaultdict[str, int]

@@ -464,6 +460,7 @@ def maybe_move_task(
if (
ts not in self.key_stealable
or ts.processing_on is not victim
or not ts.processing_on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this condition implicit in ts.processing_on is not victim?

distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
assert self.scheduler_ref and (scheduler := self.scheduler_ref())
nbytes = ts.get_nbytes()
if ts in self.needs_what:
del self.needs_what[ts]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be an issue with the removal of self.needs_what[ts] here and only incrementing it by one on remove_replica.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs_what is a counter for how many tasks assigned to this worker require this particular key. As soon as we call add_replica this counter drops immediately to zero.

needs_what keys are disjoint with has_what

distributed/scheduler.py Outdated Show resolved Hide resolved
@hendrikmakait hendrikmakait self-assigned this Sep 27, 2022
@hendrikmakait hendrikmakait mentioned this pull request Sep 27, 2022
2 tasks
@fjetter fjetter closed this Oct 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants