Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track CPU and network occupancy separately #7020

Closed
wants to merge 12 commits into from

Conversation

gjoseph92
Copy link
Collaborator

Closes #7004.

Step towards #7003. This notes, but does not fix, the places where we're currently double-counting occupancy. First we should get this PR in and verify it doesn't change any behavior. Then, we can benchmark how a few lines of changes fixing the double-counting affects performance. At the same time, we can update the dashboard to display both measures separately.

  • Tests added / passed
  • Passes pre-commit run --all-files

May end up wanting to move away from the dataclass if we don't like the breaking API change or the many places that will have to change become too onerous
Circular imports from work-stealing
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 19m 55s ⏱️ + 9m 12s
  3 102 tests +1    3 016 ✔️ +2    85 💤 ±0  1  - 1 
22 961 runs  +7  22 052 ✔️ +6  906 💤 ±0  3 +1 

For more details on these failures, see this check.

Results for commit f3de3b9. ± Comparison against base commit 1fd07f0.

♻️ This comment has been updated with latest results.

@gjoseph92 gjoseph92 mentioned this pull request Sep 8, 2022
2 tasks
ws.processing[ts] = total_duration
self.total_occupancy += total_duration - old
ws.occupancy += total_duration - old
old = ws.processing.get(ts, Occupancy(0, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We're calling this method a lot. This will always initialize the empty class instance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe worth a separate branch for the case that the key isn't there? Or allowing Occupancy to be incremented and decremented by plain ints?

queued_occupancy = len(self.queued) * self.UNKNOWN_TASK_DURATION
queued_occupancy: float = len(self.queued) * self.UNKNOWN_TASK_DURATION
# TODO: threads per worker
# TODO don't include network occupancy?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question (way out of scope): Should high network occupancy maybe even act as a suppressing factor?

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Conceptionally I like the approach of using a dataclass but I'm mildly concerned about performance since we're creating/destroying instances of this a lot.
If that's a problem a named tuple might be better suited. The update mechanics would obviously be more messy.

Comment on lines -357 to +361
victim.occupancy = 0
victim.occupancy.clear()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage complains but on main it is covered. This is very likely an untested path and we're just hitting it implicitly.

@gjoseph92
Copy link
Collaborator Author

I'm mildly concerned about performance since we're creating/destroying instances of this a lot.
If that's a problem a named tuple might be better suited

Also a little concerned about this. I started with a namedtuple, but then every time we update self.total_occupancy or ws.occupancy (happens very often), we're creating a new object (since tuples are immutable) instead of mutating the existing one.

I haven't actually microbenchmarked the performance of any of this though.

Another option could be a 2-element NumPy record array. That would also save the pointer chasing to the integers.

@gjoseph92
Copy link
Collaborator Author

In [5]: %%timeit
   ...: o = 0
   ...: for _ in range(10_000):
   ...:     o += 1
   ...: 
403 µs ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [4]: %%timeit
   ...: o = Occupancy(0, 0)
   ...: for _ in range(10_000):
   ...:     o += Occupancy(1, 1)
   ...: 
5.53 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: %%timeit
   ...: o = Occupancy(0, 0)
   ...: a = Occupancy(1, 1)
   ...: for _ in range(10_000):
   ...:     o += a
   ...: 
3.07 ms ± 32.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using the dataclass seems to be nearly a 10x slowdown compared to plain numbers.
Being a little more clever and avoiding unnecessary allocations does reduce that a little.

Same for a more realistic example. (I think typically, for a given task, processing[ts] would be updated ~3 times?)

In [9]: %%timeit
   ...: p = {i: 0 for i in range(1_000)}
   ...: t = 0
   ...: for _ in range(3):
   ...:     for i in range(len(p)):
   ...:         old = p[i]
   ...:         p[i] = new = 1
   ...:         delta = new - old
   ...:         t += delta
   ...: 
372 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [8]: %%timeit
   ...: p = {i: Occupancy(0, 0) for i in range(1_000)}
   ...: t = Occupancy(0, 0)
   ...: for _ in range(3):
   ...:     for i in range(len(p)):
   ...:         old = p[i]
   ...:         p[i] = new = Occupancy(1, 1)
   ...:         delta = new - old
   ...:         t += delta
   ...: 
3.45 ms ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Added an `update` method which takes new values, and returns the delta.
# This saves having to construct a new object every time and throw out the old one,
# though it does still mean constructing the delta.
# This helps but not enormously.

In [2]: %%timeit
   ...: p = {i: Occupancy(0, 0) for i in range(1_000)}
   ...: t = Occupancy(0, 0)
   ...: for _ in range(3):
   ...:     for i in range(len(p)):
   ...:         delta = p[i].update(1, 1)
   ...:         t += delta
   ...: 
2.43 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Does this actually matter though? I'm not sure. The main issue is going to be _set_duration_estimate, which as I said is called ~3x per task I think.

That takes us from 372 nanoseconds per task using plain numbers, to 2.43 µs per task.

@fjetter depending on where we want to go with #7030, we may or may not want to get this in.

@fjetter fjetter closed this Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Differentiate between compute and network based occupancy
2 participants