Refactor occupancy #7030

fjetter · 2022-09-12T17:18:21Z

This is an implementation of the suggestion in #7027

Pros

It updates occupancy in real time. There is no more delay due to reevaluate_occupancy callback
We no longer need to perform a very expensive reevaluate_occupancy
This has an important impact particularly for work stealing, unknown tasks and rapid upscaling scenarios

Cons

Our detection ability for tasks with unusual runtimes is slightly degraded. Previously this was detected in a reevaluation cycle based on executing time duration submitted by the heartbeat and we'd deal with the outlier individually. Since we're now basing everything on prefixes, once we detect such an outlier this affects the entire prefix. I believe we could make this logic smarter but I don't know how common it actually is
A slight stealing regression for extremely fast keys, see Refactor occupancy #7030 (comment) This is techically also a problem for less extreme cases but the steal_time_ratio is basically just a perf optimized sort so for most keys the "steal priority" is only affected. extremely fast keys may not be stolen at all even if we detect later on that they are not that fast after all.

Benchmarks: pending. Early results do not show a negative impact on scheduler performance.

distributed/scheduler.py

github-actions · 2022-09-12T18:23:14Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 6h 30m 45s ⏱️ + 41m 46s
  3 093 tests -   14   2 992 ✔️ -   17   88 💤 -   9 13 ❌ +12
22 885 runs - 126 21 884 ✔️ - 105 929 💤 - 92 72 ❌ +71

For more details on these failures, see this check.

Results for commit a98049c. ± Comparison against base commit e892d0b.

♻️ This comment has been updated with latest results.

distributed/scheduler.py

fjetter · 2022-09-13T14:41:17Z

Very early preliminary results

As already suspected, redefining total_occupancy by summing over all workers is too expensive. It's called too often, after all (mostly in check_idle_saturated). This has to be fixed.
Otherwise occupancy does not show up in a (dask) server profile
Stealing is completely out of control. That's already indicated by test. If I disable this, performance looks good

distributed/tests/test_scheduler.py

fjetter · 2022-09-14T08:44:27Z

distributed/stealing.py

+                        duration = self.scheduler.get_task_duration(
+                            ts
+                        ) + self.scheduler.get_comm_cost(ts, ts.processing_on)


All the stealing fixes here are preliminary. I suspect we want to get #7026 done first

fjetter · 2022-09-14T08:45:54Z

distributed/stealing.py

+            # TODO: occupancy no longer concats linearily so we can't easily
+            # assume that the network cost would go down by that much


in different terms: "Occupancy by task" is no longer constant and we'd need to recompute it if this should be used for any decision.

distributed/stealing.py

distributed/scheduler.py

fjetter · 2022-09-15T13:21:27Z

distributed/tests/test_steal.py

+@pytest.mark.skip("executing heartbeats not considered yet")
 @gen_cluster(client=True, nthreads=[("127.0.0.1", 1)] * 3)
 async def test_correct_bad_time_estimate(c, s, *workers):


This is the one functionality I couldn't restore so far. The problem is that upon every reevaluate_occupancy, we didn't only reevaluate the occupancy but if we detect a significant shift of occupancy, we'd recalculate the steal time ratio for all tasks in processing.
With this PR there is no place any more to reevaluate occupancy so this is no longer possible. More natural would be to recalculate tasks whenever a task group/prefix duration drifts but we'd need to track tasks of a taskgroup to make this work.
I'm currently not fully convinced that this is worth doing. Particularly since this only affects tasks with large network transfers and small occupancy. As it stands right now, this would only affects tasks with a transfer time to occupancy ratio of more than 257 which is typically only possible for lightning fast tasks anyhow.

Before engaging on this I would like to get #7026 or a version of it done

hendrikmakait

I like this general change quite a bit! It should give us some more useful occupancy estimates and the way reevaluate_occupancy worked was messy. I have some nits and a concern regarding the handling of adding/removing replicas. Apart from that, I'd love to see an A/B test for this since it's hard to judge if this has a negative impact on runtimes. The regression in https://github.com/dask/distributed/pull/7030/files#r971979808 feels fine and we should be able to find good ways of tackling this should the need arise.

hendrikmakait · 2022-09-16T11:19:47Z

distributed/scheduler.py

@@ -497,6 +492,12 @@ class WorkerState:
    # The unique server ID this WorkerState is referencing
    server_id: str

+    # Reference to scheduler task_groups
+    scheduler_ref: weakref.ref[SchedulerState] | None
+    task_groups_count: dict[str, int]


Suggested change

task_groups_count: dict[str, int]

task_groups_count: defaultdict[str, int]

hendrikmakait · 2022-09-16T12:14:54Z

distributed/stealing.py

@@ -464,6 +460,7 @@ def maybe_move_task(
                        if (
                            ts not in self.key_stealable
                            or ts.processing_on is not victim
+                            or not ts.processing_on


Isn't this condition implicit in ts.processing_on is not victim?

distributed/scheduler.py

hendrikmakait · 2022-09-16T12:35:10Z

distributed/scheduler.py

+        assert self.scheduler_ref and (scheduler := self.scheduler_ref())
+        nbytes = ts.get_nbytes()
+        if ts in self.needs_what:
+            del self.needs_what[ts]


I think there might be an issue with the removal of self.needs_what[ts] here and only incrementing it by one on remove_replica.

needs_what is a counter for how many tasks assigned to this worker require this particular key. As soon as we call add_replica this counter drops immediately to zero.

needs_what keys are disjoint with has_what

distributed/scheduler.py

Refactor occupancy

1c3fcb4

fjetter requested review from gjoseph92 and hendrikmakait September 12, 2022 17:18

fjetter commented Sep 12, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

gjoseph92 reviewed Sep 13, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

fjetter added 3 commits September 13, 2022 14:38

fix test_balance

0190c46

Fix long running

588ab76

more fixes

ccfd1f1

gjoseph92 mentioned this pull request Sep 14, 2022

Track CPU and network occupancy separately #7020

Closed

2 tasks

Skip a couple of tests

8ab749d

fjetter commented Sep 14, 2022

View reviewed changes

distributed/tests/test_scheduler.py Show resolved Hide resolved

fjetter commented Sep 14, 2022

View reviewed changes

distributed/stealing.py Outdated Show resolved Hide resolved

fjetter commented Sep 14, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

fjetter added 3 commits September 14, 2022 14:31

normalize network occ

fca0e09

skip one more test

7d00c9e

Ensure long running tasks do not generate negative counts

c8743b5

fjetter commented Sep 15, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

fjetter added 4 commits September 15, 2022 12:43

Ensure heartbeats are considered

b20fd63

review

47770a7

Delete test

007e1cb

Ensure leaving worker resets total occupancy

35a86b8

fjetter commented Sep 15, 2022

View reviewed changes

fjetter marked this pull request as ready for review September 15, 2022 13:21

fjetter mentioned this pull request Sep 16, 2022

No longer double count transfer cost in stealing #7036

Merged

2 tasks

Merge remote-tracking branch 'origin/main' into occupancy_refactor

d4a595b

hendrikmakait reviewed Sep 16, 2022

View reviewed changes

Review comments

a98049c

hendrikmakait self-assigned this Sep 27, 2022

hendrikmakait mentioned this pull request Sep 27, 2022

Refactor occupancy #7075

Merged

2 tasks

fjetter closed this Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor occupancy #7030

Refactor occupancy #7030

fjetter commented Sep 12, 2022 •

edited

Loading

github-actions bot commented Sep 12, 2022 •

edited

Loading

fjetter commented Sep 13, 2022

fjetter Sep 14, 2022

fjetter Sep 14, 2022

fjetter Sep 15, 2022

hendrikmakait left a comment

hendrikmakait Sep 16, 2022

hendrikmakait Sep 16, 2022

hendrikmakait Sep 16, 2022

fjetter Sep 16, 2022

		# TODO: occupancy no longer concats linearily so we can't easily
		# assume that the network cost would go down by that much

	task_groups_count: dict[str, int]
	task_groups_count: defaultdict[str, int]

Refactor occupancy #7030

Refactor occupancy #7030

Conversation

fjetter commented Sep 12, 2022 • edited Loading

github-actions bot commented Sep 12, 2022 • edited Loading

Unit Test Results

fjetter commented Sep 13, 2022

fjetter Sep 14, 2022

Choose a reason for hiding this comment

fjetter Sep 14, 2022

Choose a reason for hiding this comment

fjetter Sep 15, 2022

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait Sep 16, 2022

Choose a reason for hiding this comment

hendrikmakait Sep 16, 2022

Choose a reason for hiding this comment

hendrikmakait Sep 16, 2022

Choose a reason for hiding this comment

fjetter Sep 16, 2022

Choose a reason for hiding this comment

fjetter commented Sep 12, 2022 •

edited

Loading

github-actions bot commented Sep 12, 2022 •

edited

Loading