Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wall Clock (s) spikes for benchmarks/test_array.py::test_double_diff without CI failure #316

Closed
hendrikmakait opened this issue Sep 9, 2022 · 17 comments
Assignees

Comments

@hendrikmakait
Copy link
Member

We had a significant spike in Wall Clock (s) for benchmarks/test_array.py::test_double_diff but did not receive an alert.

Wall Clock (s) spiked

At the same time, Average Memory (GiB) dropped, which might suggest recent data transfer limiting to be the cause.
Average Memory (GiB) dropped

https://coiled.github.io/coiled-runtime/coiled-upstream-py3.9.html

I am investigating further.

@hendrikmakait hendrikmakait self-assigned this Sep 9, 2022
@hendrikmakait hendrikmakait changed the title Wall clock spikes for benchmarks/test_array.py::test_double_diff without CI failure Wall Clock (s) spikes for benchmarks/test_array.py::test_double_diff without CI failure Sep 9, 2022
@hendrikmakait
Copy link
Member Author

The suspicious run uses dask/dask@5f11ba9 and dask/distributed@f02d2f9.
The run before it uses dask/dask@e74888a and dask/distributed@c7d5ba7.

After benchmarkingtest_double_diff several times from my local machine comparing 2022.9.0 against the latest main, I have not been able to find anything suspicious. The numbers for both versions are comparable around 120s.

I suggest revisiting the benchmarking results on Monday to check whether this issue remains. cc @fjetter

@hendrikmakait
Copy link
Member Author

Update

Over the weekend, the wall time has been highly variable:
Wall Clock (s) has been highly variable

I'm not sure what to make of this. We have seen a somewhat similar spike, though less pronounced, around 08/31.

@fjetter
Copy link
Member

fjetter commented Sep 12, 2022

Just ran this on 2022.6.0 and got a runtime of ~110s

This run already looks a bit suspicious with a couple of idling workers in the middle. I bet that this looks drastically different on one of the spikes

For reference

image

It's interesting that this workload already stresses out the scheduler
Edit: Maybe not too interesting since there are 100k very fast tasks

image

perf-report-compressed.html.tar.gz

@ncclementi
Copy link
Contributor

Cross-referencing this here #299

@hendrikmakait the reason why we haven't got an alert is that in main we check if the last 3 runs are bigger than (mean + 2std) of the latest 10 values not-including the 3 we are looking at.

Since this test is super noisy, I'm expecting this kind of behavior.

That being said I reported seeing this in #299 but there was no followup.

@hendrikmakait
Copy link
Member Author

Update

There seem to be a few issues at the same time:

A bug with scaled_array_shape causes the number of tasks to get blown up by a factor of 4x for a particularly bad run. Suddenly we have 400k tasks instead of 100k, so it's understandable that the runs take longer. @gjoseph92 is debugging this.

The runtime of test_double_diff still differs quite a bit even with a consistent number of tasks. To debug this, @ntabris is working on making high-resolution CloudWatch metrics available so that we can correlate those with performance reports and logs. In the logs, we also see that the scheduler is unresponsive for more than 10 seconds:
distributed.core - INFO - Event loop was unresponsive in Scheduler for 11.31s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability..
I am working on getting visibility into what's keeping the scheduler busy.

@ntabris
Copy link
Member

ntabris commented Sep 13, 2022

FYI you should be able to get 1s interval cpu/mem metrics on staging now. I'm starting to look at these myself in grafana and would be happy to do this w/ someone on a call.

@ntabris
Copy link
Member

ntabris commented Sep 13, 2022

I did 10 runs of double_diff, same cluster w/ client.restart between each test, scheduler was t3.xlarge, 10 t3.large workers.

Some runs were ~130s, some were >200s.

2022-09-13 10:20:47,186 - double_diff - INFO - Starting test run 0 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:23:35,371 - double_diff - INFO - Test run 0 took 168.18530666595325 seconds

2022-09-13 10:23:48,736 - double_diff - INFO - Starting test run 1 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:26:03,423 - double_diff - INFO - Test run 1 took 134.6871834998019 seconds

2022-09-13 10:26:16,248 - double_diff - INFO - Starting test run 2 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:28:35,412 - double_diff - INFO - Test run 2 took 139.16386704239994 seconds

2022-09-13 10:28:48,304 - double_diff - INFO - Starting test run 3 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:31:07,336 - double_diff - INFO - Test run 3 took 139.03207729198039 seconds

2022-09-13 10:31:19,935 - double_diff - INFO - Starting test run 4 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:35:37,806 - double_diff - INFO - Test run 4 took 257.87163662491366 seconds

2022-09-13 10:35:50,653 - double_diff - INFO - Starting test run 5 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:37:49,975 - double_diff - INFO - Test run 5 took 119.32307670917362 seconds

2022-09-13 10:38:02,997 - double_diff - INFO - Starting test run 6 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:41:40,731 - double_diff - INFO - Test run 6 took 217.73480700002983 seconds

2022-09-13 10:41:53,458 - double_diff - INFO - Starting test run 7 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:45:47,499 - double_diff - INFO - Test run 7 took 234.04223141726106 seconds

2022-09-13 10:46:00,465 - double_diff - INFO - Starting test run 8 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:48:26,480 - double_diff - INFO - Test run 8 took 146.01530083315447 seconds

2022-09-13 10:48:39,064 - double_diff - INFO - Starting test run 9 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:51:26,785 - double_diff - INFO - Test run 9 took 167.72131154080853 seconds

image

@ntabris
Copy link
Member

ntabris commented Sep 13, 2022

@hendrikmakait
Copy link
Member Author

After diving into the logs, it looks like the worker event loop gets unresponsive while spilling. For example, take a look at the last annotated run on the Grafana dashboard between 16:39:00 UTC and 16:42:00:
Worker logs

Sep 13 16:39:53 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:39:53,499 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.72s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:39:56 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:39:56,601 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.10s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:02 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:02,809 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.94s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:24 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:24,838 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.45s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:29 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:29,285 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.61s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:52 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:52,322 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.65s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:56 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:56,971 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.65s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:41:04 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:41:04,049 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.17s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:41:09 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:41:09,489 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.96s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

@gjoseph92
Copy link
Contributor

it looks like the worker event loop gets unresponsive while spilling

This is currently expected, but known to be problematic: dask/distributed#4424

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Sep 13, 2022

This is currently expected, but known to be problematic: dask/distributed#4424

Yeah, I think dask/distributed#4424 and the general fact that we're spilling so much might be the problem we see here. Disabling work-stealing might be worth a shot, I wouldn't be surprised if it were overly aggressive on some runs and causing the spilling.

@fjetter
Copy link
Member

fjetter commented Sep 14, 2022

If a worker is blocked by IO while spilling/unspilling an important task that frees up parallelism that would also explain why we're seeing gaps in the task stream, i.e. the scheduler event loop block may be yet another red herring (but also smth to investigate since it definitely could cause a similar problem; at the very least task overproduction)

@ntabris
Copy link
Member

ntabris commented Sep 14, 2022

I'm pretty convinced from the hardware metrics that IO blocking consistently explains the difference in test time. For example, I just ran another batch of double_diff test this morning, here's comparison between faster and slower run:

image

Here are a couple other individual test runs where I annotated start of test (single line) and scheduler event loop being blocked (range). The annotation points aren't exact since I have 5s granularity, but it's enough to see that scheduler event loop is blocked near beginning of test, right around the time that worker CPU jumps up and when scheduler starts receiving messages:

image

(For reference, I'm using t3.xlarge scheduler which has 2 physical cores, 4 "vCPUs".)

The scheduler event loop pretty consistently is becoming unresponsive at the start of each double_diff test, but it's also pretty consistently ~10s and doesn't explain variance in test times

Scheduler logs showing event loop unresponsive:

Sep 14 12:29:08 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:29:08,015 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.27s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:31:03 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:31:03,446 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.46s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:33:02 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:33:02,116 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.63s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:34:56 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:34:56,724 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 10.74s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:37:11 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:37:11,629 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.94s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:39:09 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:39:09,151 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 10.78s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Local client logs showing test run times:

2022-09-14 06:28:27,429 - Starting test run 0 on cluster double-diff-9a7d0972
2022-09-14 06:30:10,391 - Test run 0 took 102.96056583384052 seconds

2022-09-14 06:30:23,117 - Starting test run 1 on cluster double-diff-9a7d0972
2022-09-14 06:32:09,539 - Test run 1 took 106.42107912525535 seconds

2022-09-14 06:32:22,013 - Starting test run 2 on cluster double-diff-9a7d0972
2022-09-14 06:34:02,671 - Test run 2 took 100.65662850020453 seconds

2022-09-14 06:34:15,045 - Starting test run 3 on cluster double-diff-9a7d0972
2022-09-14 06:36:18,574 - Test run 3 took 123.52084666723385 seconds

2022-09-14 06:36:31,033 - Starting test run 4 on cluster double-diff-9a7d0972
2022-09-14 06:38:15,393 - Test run 4 took 104.35821366589516 seconds

2022-09-14 06:38:27,569 - Starting test run 5 on cluster double-diff-9a7d0972
2022-09-14 06:40:49,635 - Test run 5 took 142.0649689170532 seconds

@gjoseph92
Copy link
Contributor

The scheduler event loop pretty consistently is becoming unresponsive at the start of each double_diff test, but it's also pretty consistently ~10s

I'd guess that's dask.order primarily, which blocks the event loop. Also deserialization, and creating and scheduling of the initial tasks.

When profiling the scheduler, it's common to see receiving the graph take a nontrivial amount of the total runtime, nearly all with the event loop blocked.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Sep 14, 2022

After the changes in #328, I have not seen any major differences in runtime of test_double_diff except for an initial delay before the cluster starts becoming busy which is likely due to issues with the client-scheduler connections, which spans the Atlantic Ocean in my case. It is also worth noting that the major discrepancies observed by @ntabris in #316 (comment) were using coiled-runtime==0.1.0 which already shows a pretty wild behavior on CI. With the latest upstream I was only able to observe minor influences of spilling onto the runtime of the tests. I suggest we monitor the CI behavior now that #328 has been merged and close this issue if that's the case.

@ntabris
Copy link
Member

ntabris commented Sep 14, 2022

I also just ran the test 5 times on dask/distributed 2022.9.0 (t3.xlarge scheduler, 10 t3.large workers).

Test times were lower and consistent (103–105s), spill was significantly lower:

image

@hendrikmakait
Copy link
Member Author

The latest two runs look pretty good, closing this for now.

Improved wall clock time for test_double_diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants