Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test_climatic_mean #253

Open
gjoseph92 opened this issue Aug 17, 2022 · 8 comments
Open

Failing test_climatic_mean #253

gjoseph92 opened this issue Aug 17, 2022 · 8 comments

Comments

@gjoseph92
Copy link
Contributor

gjoseph92 commented Aug 17, 2022

Copied from #243 (comment). test_climatic_mean is currently skipped since it fails in every CI run, but this should be investigated and the underlying cause fixed.

When I run this test on its own against coiled-runtime 0.0.4, it does fine. The dashboard looks as it should, tasks go quickly. There's a little bit of spillage but not much.

However, it's always failing in the full CI job. With a lot of pain and watching GitHub CI logs and clicking at random at clusters on the coiled dashboard, I managed to find the cluster that was running the test and watch it. The dashboard looked a bit worse, more data spilled to disk. Workers kept running out of memory and restarting. So progress was extremely slow, and kept rolling back every time a worker died.

Screen Shot 2022-08-16 at 7 53 06 PM

Theories for why it's failing:

  • On distributed==2022.6.0, MALLOC_TRIM_THRESHOLD_ hasn't been set yet by default. That might make the difference. Note though that the test passes even without it being set, if it's run on a fresh cluster. So that's clearly not the only problem. Plus, we're client.restart()-ing the workers before every test, so the workers should be in the same brand-new state regardless of whether the test is run on its own, or after others. However, client.restart() doesn't do that much to the scheduler, so maybe that's where the problem is.

  • We've know that every subsequent time you submit a workload to the scheduler, it runs slower and slower, and scheduler memory grows and grows: Are reference cycles a performance problem? dask/distributed#4987 (comment). (There's no reason to think things have changed since that research last year.)

    As the scheduler gets sluggish, it will be slower to both tell workers about data-consuimg downstream tasks to run (instead of the data-producing root tasks they've already been told to run), and it will be slower to allow them to delete keys that are completed and aren't needed anymore. Note that just because a worker runs a downstream task (like writing a chunk to zarr) doesn't mean the worker gets to immediately release the upstream data—it must be explicitly told by the scheduler to do so. If the scheduler is slow, the worker will go load even more data into memory while keeping around the chunks that have already been written to zarr and should have been released.

    Thus we see the double-whammy of root task overproduction: as soon as the delicate balance of scheduler latency is thrown off, workers will simultaneously produce memory faster than they should, and release memory slower than they should:

    Basically, I think this will only be fixed by Withhold root tasks [no co assignment] dask/distributed#6614, or by understanding and fixing whatever's causing the scheduler to slow down (which is further out) Are reference cycles a performance problem? dask/distributed#4987.

@gjoseph92 gjoseph92 changed the title Failing test_climatcic_mean Failing test_climatic_mean Aug 17, 2022
@ian-r-rose
Copy link
Contributor

I took a look at this, and a couple of notes:

  1. I am able to reproduce this behavior running the test suite locally. Running the test by itself succeeds in ~100 seconds, running it as part of the whole module suite times out after 15 minutes.
  2. I tried this on distributed main to get the new MALLOC_TRIM_THRESHOLD_ behavior, and saw the same failure. It seems unlikely to me that this is related.
  3. The fact that we are doing a client.restart() does suggest to me that the issue is the scheduler. But I looked at the scheduler system page while the test was performing badly (lots of spilling, workers restarting, lots of white on the task stream), and nothing was really very concerning:
    image
    The event loop also seemed normal. So if the scheduler was getting more and more sluggish, it wasn't highly apparent to me from the dashboard.

@fjetter
Copy link
Member

fjetter commented Aug 18, 2022

There are also some changes on Coiled coming up that will help us with memory management (coiled/feedback#185). We now set the memory limits both on OS level (cgroups) and on dask level (Worker.memory_limit) properly which affects how/when spill-to-disk is triggered, when workers are paused, etc.
That may already be sufficient to make the job pass.

I'm not implying that this is all there is, but I strongly suspect it will help. We've seen this being a game changer for spill-to-disk specific integration tests, see #229

@fjetter
Copy link
Member

fjetter commented Aug 26, 2022

I noticed that our runtime "regressions" we've seen lately were not resolved and I took another look at this test. Thanks to @ian-r-rose I could indeed reproduce this by running the entire module (makes sense, clusters have module-wide lifetime). In fact, it is sufficient to run test_anom_mean before. The reason is that these work loads are sharing a couple of task prefixes, i.e. measurements of anom_mean are leaking into climatic_mean and are biasing the scheduling decisions.

Specifically, work stealing is again responsible for this poor scheduling. This is an interesting example where work stealing basically drives root task overproduction by destroying co-assignment.

If I disable work stealing, I get mostly as expected scheduling. I think perfect scheduling would still use much less network transfer but this a good start.
image

Note how this task stream shows many more green tasks compared to the task stream of the OP. Green tasks are mean_chunk tasks, i.e. reducers / memory sinks.

Some of the more fundamental problems that are leading up to work stealing being a problem are described in dask/distributed#6573 (comment) and dask/distributed#6573 (comment)

I still need to verify but I have a very strong suspicion that this is due to TaskPrefix.duration_average and Scheduler.unknown_durations being pre-populated for the tasks mean_chunk and getitem

@gjoseph92
Copy link
Contributor Author

work stealing basically drives root task overproduction by destroying co-assignment

I'm not sure if root task overproduction is related here. This example was one of the success stories of co-assignment (it was basically unrunnable before, "gave up after 30min and 1.25TiB spilled to disk"): dask/distributed#2602 (comment).

I'd guess it's just that stealing un-does all the good co-assignment work, and re-assigns tasks to workers in ways that cause lots of data transfer, driving up memory and slowing down work?

I think perfect scheduling would still use much less network transfer

This is a groupby-mean, so transfer is to be expected with the mean. My guess is that what you see here is pretty good (though I'm surprised to see any spilling).

@fjetter
Copy link
Member

fjetter commented Aug 26, 2022

I'm not sure if root task overproduction is related here.

Well, I think there are a couple interpretations / causes for root task overproduction. Strictly speaking, I'm talking about "having way too many root tasks in memory than should be necessary". Or put differently, what I see is that the ratio between root tasks and reducers is unhealthy, i.e. we have much more root tasks in memory than we should.

I suspect that the driving force behind this "root task overprod" scenario is actually not the too slow scheduler assignment. The tasks are likely scheduled on the workers sufficiently fast but since they still need to fetch data, the worker decides to run more stuff first until the dependencies have arrived. I guess this is not the "classic" overproduction we typically talk about but at the same time, I believe this is much more severe.
That POV opens the door to interesting conversations for worker-side optimizations, of course.

co-assignment basically allows an incoming high-priority task to be scheduled asap without needing to fetch any data.

This is a groupby-mean, so transfer is to be expected with the mean. My guess is that what you see here is pretty good (though I'm surprised to see any spilling).

well, groupby-mean is basically a tree reduction, i.e. it should have little network (depending on the cardinality of the output, of course). Doesn't really matter though; I agree that what I'm seeing there is perfectly fine / pretty good.

@gjoseph92
Copy link
Contributor Author

I did a quick re-run of some of my benchmark scripts from dask/distributed#6560 (comment). I used the latest from dask/distributed#6614 with main merged in, so it has dask/distributed#6944 included. I run all 4 cases on the same cluster, restarting in between. Separate clusters for each workload.

First I run without work stealing, queued and un-queued. Then I add work stealing, and run queued and un-queued.

Stealing didn't seem to have a significant effect in either of these. (Ignore the varying start times, that's my home internet connection being slow.) In the anom-mean case, it's clearly queuing that decreases memory use by 8x; turning off stealing doesn't help. In the climatic-mean case, losing co-assigment (due to queuing) makes runtime 2x longer. But without queuing, stealing on vs off didn't seem to matter.

anom-mean-2022 7 1+164 g51dca31c
climactic-mean-2022 7 1+164 g51dca31c

Since this included dask/distributed#6944, I'm confused why we're seeing different behavior in these tests vs in the coiled-runtime benchmarks (besides the fact that I'm using a different data and cluster size, since I'm just re-using my old script). I thought that clearing out the task groups, etc. would clear whatever timing data stealing is using to make its decisions? Are you saying there's state inside of the WorkStealing extension that's not cleared between restarts and is affecting this?

groupby-mean is basically a tree reduction, i.e. it should have little network

fwiw, watching the dashboard for climatic-mean with co-assignment did have a little less red than your screenshot here on my cluster.

@gjoseph92
Copy link
Contributor Author

The tasks are likely scheduled on the workers sufficiently fast but since they still need to fetch data, the worker decides to run more stuff first until the dependencies have arrived

That's a good point. That's probably part of why co-assignment made such a big difference.

It's also probably why not having co-assignment in the queuing PR doesn't hurt as much as we expected: the root task overproduction can't happen while workers are waiting for dependencies.

It was always interesting to me that the co-assignment PR reduced the memory usage of some of these workloads a ton. But when we removed co-assignment and added queuing, the memory usage didn't go back up by the same amount—memory stayed about the same; runtime just increased.

I think that points to queuing being a more comprehensive solution to root task overproduction than co-assignment and disabling stealing.

@fjetter
Copy link
Member

fjetter commented Aug 26, 2022

I thought that clearing out the task groups, etc. would clear whatever timing data stealing is using to make its decisions? Are you saying there's state inside of the WorkStealing extension that's not cleared between restarts and is affecting this?

Don't know, yet. Need to dig a bit more, might've missed something. There are also other parts factoring in stealing decisions, e.g. total_occupancy to classify saturated (I noticed that I had negative total_occupancy after running anom_mean / before climatic_mean 🤦 ), then there is our bandwidth measurement and yes, there may also be leftover state in the extension as well.
I'll likely poke at this on Monday again for a couple of minutes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants