Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_basic_sum occasionally takes 340% time and 160% memory to complete #315

Closed
crusaderky opened this issue Sep 8, 2022 · 23 comments
Closed

Comments

@crusaderky
Copy link
Contributor

crusaderky commented Sep 8, 2022

benchmarks/test_array.py::test_basic_sum usually runs in ~80s wall clock time, ~21GiB average memory, and 27~33GiB peak memory.
Once in a while, however, it takes ~270s wall clock time, 32~35GiB average memory, and ~46GiB peak memory.
Both sets of measures are internally very consistent - it's almost exactly always one or the other.

I can't imagine what could possibly happen to trigger a "bad" run.
Both the test and the algorithm being tested are extremely simple.
Time measures start when all workers are up and running and stop before shutting them down.
There should not be any spilling involved; network transfers should be very mild.
Even in the event of a CPU and/or network slowdown, there should not be an increase in memory usage.

Screenshots from coiled 0.1.0 (dask 2022.6.0), but I've observed the same behaviour on 2022.8.1 as well:

Screenshot from 2022-09-08 11-01-02
Screenshot from 2022-09-08 11-04-50
Screenshot from 2022-09-08 11-01-15

@fjetter
Copy link
Member

fjetter commented Sep 8, 2022

very interesting.
as with most things lately I’m typically asking if this is reproducible without work stealing.

Have you tried reproducing it? Just by counting it looks like ~5% of all runs are affected (assuming there is no infrastructure issue)

@crusaderky
Copy link
Contributor Author

Have you tried reproducing it?

Not yet

@fjetter
Copy link
Member

fjetter commented Sep 8, 2022

If stealing is not the culprit, my next guess would be reevaluate_occupancy which recomputes occupancies on a round robin basis but may be skipped if CPU load on the scheduler is too high. See also dask/distributed#6573 (comment)

At this point this is merely guessing about non-deterministic parts of our scheduling logic

@ntabris
Copy link
Member

ntabris commented Sep 8, 2022

Slow test_basic_sum: https://github.com/coiled/coiled-runtime/runs/8245244815?check_suite_focus=true#step:6:179

Fast test_basic_sum: https://github.com/coiled/coiled-runtime/runs/8218741702?check_suite_focus=true#step:6:190

Do we have a way to find the clusters these ran on? Or other data about these runs beyond just the wall-clock times?

@gjoseph92
Copy link
Contributor

Do we have a way to find the clusters these ran on?

Or other data about these runs beyond just the wall-clock times?

So short answer to both is no. (We have peak and avg memory use as well as wall-clock time, but in general nothing with more granularity.)

@ian-r-rose
Copy link
Contributor

So short answer to both is no. (We have peak and avg memory use as well as wall-clock time, but in general nothing with more granularity.)

This is mostly true, but note that we are also tracking compute, transfer, and disk-spill time, it's just not visualized at the moment. So if the compute time stayed roughly constant while the wall clock time spiked, I would suspect something went wrong with scheduling.

@fjetter
Copy link
Member

fjetter commented Sep 13, 2022

FYI @hendrikmakait got his hands on a performance report of a slow run.

the task stream shows wide white gaps and we see that the scheduler event loop is stuck for a while (one time up to ~46s). No GC warnings. Not sure but highly suspect this aligns with the white gaps.

There are a couple of "Connection from tls://77.20.250.112:30608 closed before handshake completed" messages following this 46s tick. I suspect this is a heartbeat? Can't find any corresponding logs on any of the workers.
The IP range is a bit funny since all registered workers are using a 10.X.X.X IP

It happened on https://cloud.coiled.io/dask-engineering/clusters/68284/details

cloudwatch logs

@ntabris do you know why we're seeing different IP addresses here? Should this concern us?

@ntabris
Copy link
Member

ntabris commented Sep 13, 2022

77.20.250.112 is Vodafone Germany, so presumably client IP.

I'll take a look at the logs later this morning and see if I can make anything of them.

@fjetter
Copy link
Member

fjetter commented Sep 13, 2022

I'll take a look at the logs later this morning and see if I can make anything of them.

Thanks. That should already help clear things up. I don't think you'll find anything useful in the logs. I think this is our problem ;)

@fjetter
Copy link
Member

fjetter commented Sep 13, 2022

If the above are client side connection attempts, this may be related to us trying to fetch performance reports, etc. If nothing failed client side, I suspect smth like Client._update_scheduler_info ot be causing this

@crusaderky
Copy link
Contributor Author

the task stream shows wide white gaps and we see that the scheduler event loop is stuck for a while (one time up to ~46s). No GC warnings. Not sure but highly suspect this aligns with the white gaps.

Do we have a measure of CPU seconds of the scheduler process?

  • The whole VM could be temporarily frozen; if that were the case you'd have e.g. 46s wall time 0.1s CPU time.
  • Another thing to investigate is the user/sys CPU split. A very high sys CPU time could again highlight something wrong at VM level

@hendrikmakait
Copy link
Member

While root-causing #316, we have discovered a bug in scaled_array_shape which likely affects test_basic_sum as well.

@gjoseph92
Copy link
Contributor

Looks like it's not a bug in scaled_array_shape, but rather in dask.array.core.normalize_chunks. For certain shape arguments, it will return chunk sizes that are wildly different from the requested chunk size, but if the shape is 1 larger or smaller, it will do as requested.

I'm looking into it and will open an issue over there.

@gjoseph92
Copy link
Contributor

Not clear if this is the only thing causing the variation, but it's certainly not helping.

@crusaderky
Copy link
Contributor Author

Screenshot from 2022-10-20 01-49-36

From the raw db dump I think I'm reading that, in the "bad" runs, there are many many more network transfers (transfer_time) than in the "good" ones and substantially more memory duplication - which in turn causes spilling (spill_time).

It looks like co-assignment is occasionally and randomly falling apart for some reason?
Progressing further with investigation.

@crusaderky
Copy link
Contributor Author

crusaderky commented Oct 20, 2022

Healthy run:

dump: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_array-c2d95249/benchmarks.test_array.py.test_basic_sum.msgpack.gz
logs: https://cloud.coiled.io/dask-engineering/clusters/95408/details
grafana: http://35.86.202.18:3000/d/eU1bT-nVz/cluster-metrics-prometheus?from=1665880543885&to=1665880635687&var-cluster=test_array-c2d95249
image

Bad run:

dump: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_array-c6668c2c/benchmarks.test_array.py.test_basic_sum.msgpack.gz
logs: https://cloud.coiled.io/dask-engineering/clusters/95114/details
grafana: http://35.86.202.18:3000/d/eU1bT-nVz/cluster-metrics-prometheus?from=1665794041518&to=1665794299449&var-cluster=test_array-c6668c2c
image

@gjoseph92
Copy link
Contributor

gjoseph92 commented Oct 20, 2022 via email

@crusaderky
Copy link
Contributor Author

Holy work stealing Batman!
There are 643 ready->released worker transitions in the good run, vs. 6419 ones in the bad one.
For scale, there are 4921 tasks in total.

In the good run, 13% of the tasks end up stolen (it feels quite high already).
In the bad run, each task is stolen 1.3 times on average!

start recs end bad good delta
cancelled memory released 19 3 16
executing memory memory 4922 4921 1
fetch flight flight 2087 205 1882
fetch released released 15 0 15
flight memory memory 2068 202 1866
flight released cancelled 19 3 16
memory released released 6990 5123 1867
ready executing executing 4922 4921 1
ready released released 6419 643 5776
released fetch fetch 2102 205 1897
released forgotten forgotten 13466 5772 7694
released waiting waiting 11364 5567 5797
waiting ready ready 11341 5564 5777
waiting released released 23 3 20

Can we see if a worker has left at any point?

No workers left.

is it possible the initial task assignment happens before all workers have arrived?

No, I see that the first task transition is 2 seconds after the last worker joined the cluster.

@crusaderky
Copy link
Contributor Author

crusaderky commented Oct 21, 2022

It looks like work stealing stole equally from all workers - thus obtaining a net zero rebalancing effect!

Number of tasks lost to work stealing:
image

Note how, at every work stealing iteration, the beneficiaries of stealing (workers without a circle) become victims (workers with a circle) at the next iteration. Tasks are being needlessly ping-ponged around.
visualization

@crusaderky
Copy link
Contributor Author

Task durations.
The first, fast one on the top left is the most important one, as it is the first aggregation layer (3688 tasks).
The second layer is 923 tasks, the third is 1/4th of the third, and so on.

visualization (1)

It's interesting to see here how the bad run, which is spilling a lot more, has much longer "flares" of outliers. Those are all moments where the event loop of the worker was busy spilling; this impacted the measured duration, which in turn might have caused improper stealing choices.
However poor stealing decisions started much before the worker started spilling, so I don't think this is the cause.

@crusaderky
Copy link
Contributor Author

I'm looking now at coiled-upstream (dask 2022.08.0 ~ 2022.10.0) vs. coiled-latest (2022.6.0) and coiled-0.1.0 (2022.6.0) and the noise is completely gone in the newer releases.
So I'm inclined to close this issue without further investigation.

upstream
download

latest
download (1)

0.1.0
download (2)

@fjetter
Copy link
Member

fjetter commented Oct 21, 2022

If work stealing is under investigation, it's worth investigating the worker idleness detection. Work stealing should only affect workers that are flagged as idle. If this doens't work properly work stealing can cause weird things. This should be more reliable in latest releases but I still wouldn't be surprised to see bad things happening.

In later versions, stealing is using the worker_objective to determine a good thief but this still breaks co-assignment (We'd need smth like dask/distributed#7141 to not break coassignment)

@hendrikmakait
Copy link
Member

I'm looking now at coiled-upstream (dask 2022.08.0 ~ 2022.10.0) vs. coiled-latest (2022.6.0) and coiled-0.1.0 (2022.6.0) and the noise is completely gone in the newer releases. So I'm inclined to close this issue without further investigation.

Good to see that the recent changes to work-stealing seem to have removed the erratic behavior. Some of the issues that were fixed (including work-stealing going overboard and stealing way too much) could explain the behavior that has been observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants