Track CPU and network occupancy separately #7020

gjoseph92 · 2022-09-07T23:33:46Z

Closes #7004.

Step towards #7003. This notes, but does not fix, the places where we're currently double-counting occupancy. First we should get this PR in and verify it doesn't change any behavior. Then, we can benchmark how a few lines of changes fixing the double-counting affects performance. At the same time, we can update the dashboard to display both measures separately.

Tests added / passed
Passes pre-commit run --all-files

May end up wanting to move away from the dataclass if we don't like the breaking API change or the many places that will have to change become too onerous

Circular imports from work-stealing

github-actions · 2022-09-08T00:16:43Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 19m 55s ⏱️ + 9m 12s
  3 102 tests +1   3 016 ✔️ +2   85 💤 ±0 1 ❌ - 1
22 961 runs +7 22 052 ✔️ +6 906 💤 ±0 3 ❌ +1

For more details on these failures, see this check.

Results for commit f3de3b9. ± Comparison against base commit 1fd07f0.

♻️ This comment has been updated with latest results.

gjoseph92 · 2022-09-08T01:17:10Z

fjetter · 2022-09-08T10:07:51Z

distributed/scheduler.py

-        ws.processing[ts] = total_duration
-        self.total_occupancy += total_duration - old
-        ws.occupancy += total_duration - old
+        old = ws.processing.get(ts, Occupancy(0, 0))


nit: We're calling this method a lot. This will always initialize the empty class instance

Yeah, maybe worth a separate branch for the case that the key isn't there? Or allowing Occupancy to be incremented and decremented by plain ints?

fjetter · 2022-09-08T10:09:48Z

distributed/scheduler.py

-        queued_occupancy = len(self.queued) * self.UNKNOWN_TASK_DURATION
+        queued_occupancy: float = len(self.queued) * self.UNKNOWN_TASK_DURATION
+        # TODO: threads per worker
+        # TODO don't include network occupancy?


Interesting question (way out of scope): Should high network occupancy maybe even act as a suppressing factor?

fjetter

Love it. Conceptionally I like the approach of using a dataclass but I'm mildly concerned about performance since we're creating/destroying instances of this a lot.
If that's a problem a named tuple might be better suited. The update mechanics would obviously be more messy.

fjetter · 2022-09-08T10:17:40Z

distributed/stealing.py

-                    victim.occupancy = 0
+                    victim.occupancy.clear()


Coverage complains but on main it is covered. This is very likely an untested path and we're just hitting it implicitly.

gjoseph92 · 2022-09-08T14:57:34Z

I'm mildly concerned about performance since we're creating/destroying instances of this a lot.
If that's a problem a named tuple might be better suited

Also a little concerned about this. I started with a namedtuple, but then every time we update self.total_occupancy or ws.occupancy (happens very often), we're creating a new object (since tuples are immutable) instead of mutating the existing one.

I haven't actually microbenchmarked the performance of any of this though.

Another option could be a 2-element NumPy record array. That would also save the pointer chasing to the integers.

gjoseph92 · 2022-09-14T01:15:47Z

In [5]: %%timeit
   ...: o = 0
   ...: for _ in range(10_000):
   ...:     o += 1
   ...: 
403 µs ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [4]: %%timeit
   ...: o = Occupancy(0, 0)
   ...: for _ in range(10_000):
   ...:     o += Occupancy(1, 1)
   ...: 
5.53 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %%timeit
   ...: o = Occupancy(0, 0)
   ...: a = Occupancy(1, 1)
   ...: for _ in range(10_000):
   ...:     o += a
   ...: 
3.07 ms ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using the dataclass seems to be nearly a 10x slowdown compared to plain numbers.
Being a little more clever and avoiding unnecessary allocations does reduce that a little.

Same for a more realistic example. (I think typically, for a given task, processing[ts] would be updated ~3 times?)

In [9]: %%timeit
   ...: p = {i: 0 for i in range(1_000)}
   ...: t = 0
   ...: for _ in range(3):
   ...:     for i in range(len(p)):
   ...:         old = p[i]
   ...:         p[i] = new = 1
   ...:         delta = new - old
   ...:         t += delta
   ...: 
372 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %%timeit
   ...: p = {i: Occupancy(0, 0) for i in range(1_000)}
   ...: t = Occupancy(0, 0)
   ...: for _ in range(3):
   ...:     for i in range(len(p)):
   ...:         old = p[i]
   ...:         p[i] = new = Occupancy(1, 1)
   ...:         delta = new - old
   ...:         t += delta
   ...: 
3.45 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Added an `update` method which takes new values, and returns the delta.
# This saves having to construct a new object every time and throw out the old one,
# though it does still mean constructing the delta.
# This helps but not enormously.

In [2]: %%timeit
   ...: p = {i: Occupancy(0, 0) for i in range(1_000)}
   ...: t = Occupancy(0, 0)
   ...: for _ in range(3):
   ...:     for i in range(len(p)):
   ...:         delta = p[i].update(1, 1)
   ...:         t += delta
   ...: 
2.43 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Does this actually matter though? I'm not sure. The main issue is going to be _set_duration_estimate, which as I said is called ~3x per task I think.

That takes us from 372 nanoseconds per task using plain numbers, to 2.43 µs per task.

@fjetter depending on where we want to go with #7030, we may or may not want to get this in.

gjoseph92 added 6 commits September 7, 2022 17:18

Track network vs cpu occupancy

73e5767

May end up wanting to move away from the dataclass if we don't like the breaking API change or the many places that will have to change become too onerous

update dashboard

d29a576

Move Occupancy to separate file

f90720f

Circular imports from work-stealing

fix Occupancy._to_dict

8c8cde9

Occupancy tests

ea8ae87

Update tests

cc6089b

gjoseph92 requested a review from fjetter September 7, 2022 23:33

gjoseph92 added 3 commits September 7, 2022 18:01

don't log occupancy objects in stealing

9c9dcbc

update worker table

7346976

update test_secede_cancelled_or_resumed_scheduler

2eefe23

gjoseph92 mentioned this pull request Sep 8, 2022

Network occupancy dashboard #7021

Draft

2 tasks

full code coverage for Collection

d61063f

fjetter reviewed Sep 8, 2022

View reviewed changes

fjetter mentioned this pull request Sep 9, 2022

Accurate occupancy calculation / occupancy replacement #7027

Open

gjoseph92 added 2 commits September 13, 2022 18:26

Merge remote-tracking branch 'upstream/main' into network-occupancy

7a6d4dc

Match incorrect existing behavior

f3de3b9

fjetter closed this Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track CPU and network occupancy separately #7020

Track CPU and network occupancy separately #7020

gjoseph92 commented Sep 7, 2022

github-actions bot commented Sep 8, 2022 •

edited

Loading

gjoseph92 commented Sep 8, 2022

fjetter Sep 8, 2022

gjoseph92 Sep 8, 2022

fjetter Sep 8, 2022

fjetter left a comment

fjetter Sep 8, 2022

gjoseph92 commented Sep 8, 2022

gjoseph92 commented Sep 14, 2022

Track CPU and network occupancy separately #7020

Track CPU and network occupancy separately #7020

Conversation

gjoseph92 commented Sep 7, 2022

github-actions bot commented Sep 8, 2022 • edited Loading

Unit Test Results

gjoseph92 commented Sep 8, 2022

fjetter Sep 8, 2022

Choose a reason for hiding this comment

gjoseph92 Sep 8, 2022

Choose a reason for hiding this comment

fjetter Sep 8, 2022

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

fjetter Sep 8, 2022

Choose a reason for hiding this comment

gjoseph92 commented Sep 8, 2022

gjoseph92 commented Sep 14, 2022

github-actions bot commented Sep 8, 2022 •

edited

Loading