test_basic_sum occasionally takes 340% time and 160% memory to complete #315

crusaderky · 2022-09-08T10:13:10Z

benchmarks/test_array.py::test_basic_sum usually runs in ~80s wall clock time, ~21GiB average memory, and 27~33GiB peak memory.
Once in a while, however, it takes ~270s wall clock time, 32~35GiB average memory, and ~46GiB peak memory.
Both sets of measures are internally very consistent - it's almost exactly always one or the other.

I can't imagine what could possibly happen to trigger a "bad" run.
Both the test and the algorithm being tested are extremely simple.
Time measures start when all workers are up and running and stop before shutting them down.
There should not be any spilling involved; network transfers should be very mild.
Even in the event of a CPU and/or network slowdown, there should not be an increase in memory usage.

Screenshots from coiled 0.1.0 (dask 2022.6.0), but I've observed the same behaviour on 2022.8.1 as well:

The text was updated successfully, but these errors were encountered:

fjetter · 2022-09-08T10:22:04Z

very interesting.
as with most things lately I’m typically asking if this is reproducible without work stealing.

Have you tried reproducing it? Just by counting it looks like ~5% of all runs are affected (assuming there is no infrastructure issue)

crusaderky · 2022-09-08T10:47:01Z

Have you tried reproducing it?

Not yet

fjetter · 2022-09-08T11:37:54Z

If stealing is not the culprit, my next guess would be reevaluate_occupancy which recomputes occupancies on a round robin basis but may be skipped if CPU load on the scheduler is too high. See also dask/distributed#6573 (comment)

At this point this is merely guessing about non-deterministic parts of our scheduling logic

ntabris · 2022-09-08T13:16:21Z

Slow test_basic_sum: https://github.com/coiled/coiled-runtime/runs/8245244815?check_suite_focus=true#step:6:179

Fast test_basic_sum: https://github.com/coiled/coiled-runtime/runs/8218741702?check_suite_focus=true#step:6:190

Do we have a way to find the clusters these ran on? Or other data about these runs beyond just the wall-clock times?

gjoseph92 · 2022-09-08T15:22:41Z

Do we have a way to find the clusters these ran on?

Improve traceability between currently-running tests and Coiled resources they create #267

Or other data about these runs beyond just the wall-clock times?

So short answer to both is no. (We have peak and avg memory use as well as wall-clock time, but in general nothing with more granularity.)

ian-r-rose · 2022-09-12T14:59:06Z

So short answer to both is no. (We have peak and avg memory use as well as wall-clock time, but in general nothing with more granularity.)

This is mostly true, but note that we are also tracking compute, transfer, and disk-spill time, it's just not visualized at the moment. So if the compute time stayed roughly constant while the wall clock time spiked, I would suspect something went wrong with scheduling.

fjetter · 2022-09-13T12:06:53Z

FYI @hendrikmakait got his hands on a performance report of a slow run.

the task stream shows wide white gaps and we see that the scheduler event loop is stuck for a while (one time up to ~46s). No GC warnings. Not sure but highly suspect this aligns with the white gaps.

There are a couple of "Connection from tls://77.20.250.112:30608 closed before handshake completed" messages following this 46s tick. I suspect this is a heartbeat? Can't find any corresponding logs on any of the workers.
The IP range is a bit funny since all registered workers are using a 10.X.X.X IP

It happened on https://cloud.coiled.io/dask-engineering/clusters/68284/details

cloudwatch logs

@ntabris do you know why we're seeing different IP addresses here? Should this concern us?

ntabris · 2022-09-13T12:14:36Z

77.20.250.112 is Vodafone Germany, so presumably client IP.

I'll take a look at the logs later this morning and see if I can make anything of them.

fjetter · 2022-09-13T12:16:01Z

I'll take a look at the logs later this morning and see if I can make anything of them.

Thanks. That should already help clear things up. I don't think you'll find anything useful in the logs. I think this is our problem ;)

fjetter · 2022-09-13T12:19:00Z

If the above are client side connection attempts, this may be related to us trying to fetch performance reports, etc. If nothing failed client side, I suspect smth like Client._update_scheduler_info ot be causing this

crusaderky · 2022-09-13T12:52:05Z

the task stream shows wide white gaps and we see that the scheduler event loop is stuck for a while (one time up to ~46s). No GC warnings. Not sure but highly suspect this aligns with the white gaps.

Do we have a measure of CPU seconds of the scheduler process?

The whole VM could be temporarily frozen; if that were the case you'd have e.g. 46s wall time 0.1s CPU time.
Another thing to investigate is the user/sys CPU split. A very high sys CPU time could again highlight something wrong at VM level

hendrikmakait · 2022-09-13T16:04:51Z

While root-causing #316, we have discovered a bug in scaled_array_shape which likely affects test_basic_sum as well.

gjoseph92 · 2022-09-13T17:20:19Z

Looks like it's not a bug in scaled_array_shape, but rather in dask.array.core.normalize_chunks. For certain shape arguments, it will return chunk sizes that are wildly different from the requested chunk size, but if the shape is 1 larger or smaller, it will do as requested.

I'm looking into it and will open an issue over there.

gjoseph92 · 2022-09-14T00:13:21Z

Chunks with size literals ("20 MiB") can result in significantly different chunk sizes than requested dask/dask#9488

Not clear if this is the only thing causing the variation, but it's certainly not helping.

crusaderky · 2022-10-20T00:55:14Z

From the raw db dump I think I'm reading that, in the "bad" runs, there are many many more network transfers (transfer_time) than in the "good" ones and substantially more memory duplication - which in turn causes spilling (spill_time).

It looks like co-assignment is occasionally and randomly falling apart for some reason?
Progressing further with investigation.

crusaderky · 2022-10-20T01:54:28Z

Healthy run:

dump: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_array-c2d95249/benchmarks.test_array.py.test_basic_sum.msgpack.gz
logs: https://cloud.coiled.io/dask-engineering/clusters/95408/details
grafana: http://35.86.202.18:3000/d/eU1bT-nVz/cluster-metrics-prometheus?from=1665880543885&to=1665880635687&var-cluster=test_array-c2d95249

Bad run:

dump: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_array-c6668c2c/benchmarks.test_array.py.test_basic_sum.msgpack.gz
logs: https://cloud.coiled.io/dask-engineering/clusters/95114/details
grafana: http://35.86.202.18:3000/d/eU1bT-nVz/cluster-metrics-prometheus?from=1665794041518&to=1665794299449&var-cluster=test_array-c6668c2c

gjoseph92 · 2022-10-20T09:30:36Z

Good to confirm that it seems like co-assignment is going wrong. Can we see if a worker has left at any point? This seems unlikely but would definitely cause co-assignment to be thrown off. Or is it possible the initial task assignment happens before all workers have arrived? Otherwise, I’d suspect work stealing.

…

On Thu, Oct 20, 2022 at 3:54 AM crusaderky ***@***.***> wrote: Healthy run: [image: image] <https://user-images.githubusercontent.com/6213168/196837094-323c2360-8d75-40ad-99f9-f463a5e292d3.png> Bad run: [image: image] <https://user-images.githubusercontent.com/6213168/196837143-6df4ca2f-2ca7-42c7-ab0a-4d1f67ab697a.png> — Reply to this email directly, view it on GitHub <#315 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZIB2UNQM2ZXFJGT6S4B3TWECQ55ANCNFSM6AAAAAAQHSO4D4> . You are receiving this because you commented.Message ID: ***@***.***>

crusaderky · 2022-10-20T23:09:59Z

Holy work stealing Batman!
There are 643 ready->released worker transitions in the good run, vs. 6419 ones in the bad one.
For scale, there are 4921 tasks in total.

In the good run, 13% of the tasks end up stolen (it feels quite high already).
In the bad run, each task is stolen 1.3 times on average!

start	recs	end	bad	good	delta
cancelled	memory	released	19	3	16
executing	memory	memory	4922	4921	1
fetch	flight	flight	2087	205	1882
fetch	released	released	15	0	15
flight	memory	memory	2068	202	1866
flight	released	cancelled	19	3	16
memory	released	released	6990	5123	1867
ready	executing	executing	4922	4921	1
ready	released	released	6419	643	5776
released	fetch	fetch	2102	205	1897
released	forgotten	forgotten	13466	5772	7694
released	waiting	waiting	11364	5567	5797
waiting	ready	ready	11341	5564	5777
waiting	released	released	23	3	20

Can we see if a worker has left at any point?

No workers left.

is it possible the initial task assignment happens before all workers have arrived?

No, I see that the first task transition is 2 seconds after the last worker joined the cluster.

crusaderky · 2022-10-21T00:01:05Z

It looks like work stealing stole equally from all workers - thus obtaining a net zero rebalancing effect!

Number of tasks lost to work stealing:

Note how, at every work stealing iteration, the beneficiaries of stealing (workers without a circle) become victims (workers with a circle) at the next iteration. Tasks are being needlessly ping-ponged around.

crusaderky · 2022-10-21T00:37:48Z

Task durations.
The first, fast one on the top left is the most important one, as it is the first aggregation layer (3688 tasks).
The second layer is 923 tasks, the third is 1/4th of the third, and so on.

It's interesting to see here how the bad run, which is spilling a lot more, has much longer "flares" of outliers. Those are all moments where the event loop of the worker was busy spilling; this impacted the measured duration, which in turn might have caused improper stealing choices.
However poor stealing decisions started much before the worker started spilling, so I don't think this is the cause.

crusaderky · 2022-10-21T11:10:12Z

I'm looking now at coiled-upstream (dask 2022.08.0 ~ 2022.10.0) vs. coiled-latest (2022.6.0) and coiled-0.1.0 (2022.6.0) and the noise is completely gone in the newer releases.
So I'm inclined to close this issue without further investigation.

upstream

latest

0.1.0

fjetter · 2022-10-21T12:05:02Z

If work stealing is under investigation, it's worth investigating the worker idleness detection. Work stealing should only affect workers that are flagged as idle. If this doens't work properly work stealing can cause weird things. This should be more reliable in latest releases but I still wouldn't be surprised to see bad things happening.

In later versions, stealing is using the worker_objective to determine a good thief but this still breaks co-assignment (We'd need smth like dask/distributed#7141 to not break coassignment)

hendrikmakait · 2022-10-21T14:56:50Z

I'm looking now at coiled-upstream (dask 2022.08.0 ~ 2022.10.0) vs. coiled-latest (2022.6.0) and coiled-0.1.0 (2022.6.0) and the noise is completely gone in the newer releases. So I'm inclined to close this issue without further investigation.

Good to see that the recent changes to work-stealing seem to have removed the erratic behavior. Some of the issues that were fixed (including work-stealing going overboard and stealing way too much) could explain the behavior that has been observed.

crusaderky mentioned this issue Sep 8, 2022

A/B test reports #309

Merged

gjoseph92 mentioned this issue Sep 14, 2022

Remove size literals from chunks= #328

Merged

crusaderky mentioned this issue Sep 16, 2022

test_parquet.py::test_download_throughput[pandas] peak memory randomly bounces #339

Open

crusaderky added the performance-volatility label Sep 16, 2022

crusaderky mentioned this issue Sep 16, 2022

Integration test that focus on AMM informing our decision to toggle this on by default #140

Closed

fjetter assigned crusaderky Sep 16, 2022

crusaderky closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_basic_sum occasionally takes 340% time and 160% memory to complete #315

test_basic_sum occasionally takes 340% time and 160% memory to complete #315

crusaderky commented Sep 8, 2022 •

edited

Loading

fjetter commented Sep 8, 2022

crusaderky commented Sep 8, 2022

fjetter commented Sep 8, 2022

ntabris commented Sep 8, 2022

gjoseph92 commented Sep 8, 2022

ian-r-rose commented Sep 12, 2022

fjetter commented Sep 13, 2022

ntabris commented Sep 13, 2022

fjetter commented Sep 13, 2022

fjetter commented Sep 13, 2022

crusaderky commented Sep 13, 2022

hendrikmakait commented Sep 13, 2022

gjoseph92 commented Sep 13, 2022

gjoseph92 commented Sep 14, 2022

crusaderky commented Oct 20, 2022

crusaderky commented Oct 20, 2022 •

edited

Loading

gjoseph92 commented Oct 20, 2022 via email

crusaderky commented Oct 20, 2022

crusaderky commented Oct 21, 2022 •

edited

Loading

crusaderky commented Oct 21, 2022

crusaderky commented Oct 21, 2022

fjetter commented Oct 21, 2022 •

edited

Loading

hendrikmakait commented Oct 21, 2022

test_basic_sum occasionally takes 340% time and 160% memory to complete #315

test_basic_sum occasionally takes 340% time and 160% memory to complete #315

Comments

crusaderky commented Sep 8, 2022 • edited Loading

fjetter commented Sep 8, 2022

crusaderky commented Sep 8, 2022

fjetter commented Sep 8, 2022

ntabris commented Sep 8, 2022

gjoseph92 commented Sep 8, 2022

ian-r-rose commented Sep 12, 2022

fjetter commented Sep 13, 2022

ntabris commented Sep 13, 2022

fjetter commented Sep 13, 2022

fjetter commented Sep 13, 2022

crusaderky commented Sep 13, 2022

hendrikmakait commented Sep 13, 2022

gjoseph92 commented Sep 13, 2022

gjoseph92 commented Sep 14, 2022

crusaderky commented Oct 20, 2022

crusaderky commented Oct 20, 2022 • edited Loading

Healthy run:

Bad run:

gjoseph92 commented Oct 20, 2022 via email

crusaderky commented Oct 20, 2022

crusaderky commented Oct 21, 2022 • edited Loading

crusaderky commented Oct 21, 2022

crusaderky commented Oct 21, 2022

fjetter commented Oct 21, 2022 • edited Loading

hendrikmakait commented Oct 21, 2022

crusaderky commented Sep 8, 2022 •

edited

Loading

crusaderky commented Oct 20, 2022 •

edited

Loading

crusaderky commented Oct 21, 2022 •

edited

Loading

fjetter commented Oct 21, 2022 •

edited

Loading