Workers in a LocalCluster appear to stall at indeterminate time through computation #3878

KrishanBhasin · 2020-06-10T13:53:48Z

I am still trying to narrow down the source of my issue, but I am raising the issue now incase someone is able to jump in with some insight that helps me either solve it or find an MRE.

What happened:

I create a LocalCluster with between 32-90 workers (depending on what size server I am running on).

I create and submit a rather large task graph (~3million tasks) composed of entirely Dask Dataframe operations.

At indeterminate times through the computation, some of my workers abruptly drop
to 0-4% CPU usage. They still hold keys in memory, and still have tasks listed
in their processing queue, but they do not make any progress.

If I do not intervene, all other workers exhaust their processing queues (taking
new tasks from the scheduler & completing them) until
the only tasks remaining are those that depend on tasks currently in the queue
of the stalled worker(s). At this point, all workers in the cluster sit idle.

If I manually sigterm the offending worker(s), then the computation is able to finish successfully.

I originally thought I might be seeing an instance of #3761, but now I think my problem is different.

Often (but not always), the stalled workers have a status of closing. More
often (almost always, but not 100%), the workers emit a log line stating something like Stopping worker at <Worker 127.0.0.1:34334>, which tells me something is invoking the Worker.close() method.

I've noticed that in general for every worker that hits this state, there are
~3-4 other workers that close successfully.

After adjusting the connection timeouts
(distributed.comm.timeouts.tcp/.connect) to 90s each, I am yet to encounter
this problem again.

I have two problems:

I am unable to determine why the workers are being closed. I suspect it is
related to them timing out
The workers hang on to tasks, meaning that my compute stalls without human
intervention.

I can live with 1 as long as 2 isn't true, but ideally I'd understand why
1 was occurring to fix the source of the problem.

What you expected to happen:

The workers to close successfully, or not be closed at all.

Minimal Complete Verifiable Example:

I am yet to create a MRE for this, and am losing hope that I will succeed at doing so.

Anything else we need to know?:

I realise that this bug report might lack enough detail to work with - I am sharing it in the hope that someone may help point me in a direction to dig further. I will post updates as I uncover more.

Environment:

Dask/Distributed version: 2.18.0 (also encountered on 2.16.0, 2.17.0)
Python version: 3.7.7
Operating System: Linux
Install method (conda, pip, source): Conda

The text was updated successfully, but these errors were encountered:

quasiben · 2020-06-15T14:53:37Z

If you suspect #3761 have you tested with WorkStealing off ?

KrishanBhasin · 2020-09-21T15:57:11Z

Just a little update on this; I'm still seeing it occur with work stealing disabled.

I have a suspicion that it may be related to the fact that I create a client/local cluster long before I actually trigger a compute - if I break up the workflow to create a new client/local cluster right before I call ddf.to_csv() then this issue no longer seems to occur.

I am not sure how the workers being idle (mostly; apart from scanning input CSVs on read_csv() and the bits of work triggered by calling set_index() while creating the very large ddf) for long periods before computing could cause this.

GFleishman · 2021-04-20T20:42:28Z

Hi Krishan - You said this no longer occurs when you increase distributed.comm.timeouts.connect/tcp? Is that still the case or have you observed this even with very large values for those config settings? I have a similar issue: #4724. I have increased those comm values and will experiment with increasing them even more, but your experience here would help me to know about. Thanks!

KrishanBhasin · 2021-04-23T12:01:40Z

Hey @GFleishman,
Unfortunately our application went through several major changes which solved this problem for us as a side effect, so I'm unable to provide any useful advice to you; I think we ended up reverting back to defaults after a while as the problem stopped cropping up.
Apologies I can't be more helpful!

KrishanBhasin mentioned this issue Sep 28, 2020

Timed out trying to connect ... : connect() didn't finish in time #4080

Open

KrishanBhasin mentioned this issue Nov 18, 2020

Computation deadlocks after inter worker communication error #4133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers in a LocalCluster appear to stall at indeterminate time through computation #3878

Workers in a LocalCluster appear to stall at indeterminate time through computation #3878

KrishanBhasin commented Jun 10, 2020

quasiben commented Jun 15, 2020

KrishanBhasin commented Sep 21, 2020

GFleishman commented Apr 20, 2021

KrishanBhasin commented Apr 23, 2021

Workers in a LocalCluster appear to stall at indeterminate time through computation #3878

Workers in a LocalCluster appear to stall at indeterminate time through computation #3878

Comments

KrishanBhasin commented Jun 10, 2020

What happened:

What you expected to happen:

Minimal Complete Verifiable Example:

Anything else we need to know?:

Environment:

quasiben commented Jun 15, 2020

KrishanBhasin commented Sep 21, 2020

GFleishman commented Apr 20, 2021

KrishanBhasin commented Apr 23, 2021