Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers in a LocalCluster appear to stall at indeterminate time through computation #3878

Open
KrishanBhasin opened this issue Jun 10, 2020 · 4 comments

Comments

@KrishanBhasin
Copy link
Contributor

I am still trying to narrow down the source of my issue, but I am raising the issue now incase someone is able to jump in with some insight that helps me either solve it or find an MRE.

What happened:

I create a LocalCluster with between 32-90 workers (depending on what size server I am running on).

I create and submit a rather large task graph (~3million tasks) composed of entirely Dask Dataframe operations.

At indeterminate times through the computation, some of my workers abruptly drop
to 0-4% CPU usage. They still hold keys in memory, and still have tasks listed
in their processing queue, but they do not make any progress.

If I do not intervene, all other workers exhaust their processing queues (taking
new tasks from the scheduler & completing them) until
the only tasks remaining are those that depend on tasks currently in the queue
of the stalled worker(s). At this point, all workers in the cluster sit idle.

If I manually sigterm the offending worker(s), then the computation is able to finish successfully.

I originally thought I might be seeing an instance of #3761, but now I think my problem is different.

Often (but not always), the stalled workers have a status of closing. More
often (almost always, but not 100%), the workers emit a log line stating something like Stopping worker at <Worker 127.0.0.1:34334>, which tells me something is invoking the Worker.close() method.

I've noticed that in general for every worker that hits this state, there are
~3-4 other workers that close successfully.

After adjusting the connection timeouts
(distributed.comm.timeouts.tcp/.connect) to 90s each, I am yet to encounter
this problem again.

I have two problems:

  1. I am unable to determine why the workers are being closed. I suspect it is
    related to them timing out
  2. The workers hang on to tasks, meaning that my compute stalls without human
    intervention.

I can live with 1 as long as 2 isn't true, but ideally I'd understand why
1 was occurring to fix the source of the problem.

What you expected to happen:

The workers to close successfully, or not be closed at all.

Minimal Complete Verifiable Example:

I am yet to create a MRE for this, and am losing hope that I will succeed at doing so.

Anything else we need to know?:

I realise that this bug report might lack enough detail to work with - I am sharing it in the hope that someone may help point me in a direction to dig further. I will post updates as I uncover more.

Environment:

  • Dask/Distributed version: 2.18.0 (also encountered on 2.16.0, 2.17.0)
  • Python version: 3.7.7
  • Operating System: Linux
  • Install method (conda, pip, source): Conda
@quasiben
Copy link
Member

If you suspect #3761 have you tested with WorkStealing off ?

@KrishanBhasin
Copy link
Contributor Author

Just a little update on this; I'm still seeing it occur with work stealing disabled.

I have a suspicion that it may be related to the fact that I create a client/local cluster long before I actually trigger a compute - if I break up the workflow to create a new client/local cluster right before I call ddf.to_csv() then this issue no longer seems to occur.

I am not sure how the workers being idle (mostly; apart from scanning input CSVs on read_csv() and the bits of work triggered by calling set_index() while creating the very large ddf) for long periods before computing could cause this.

@GFleishman
Copy link

Hi Krishan - You said this no longer occurs when you increase distributed.comm.timeouts.connect/tcp? Is that still the case or have you observed this even with very large values for those config settings? I have a similar issue: #4724. I have increased those comm values and will experiment with increasing them even more, but your experience here would help me to know about. Thanks!

@KrishanBhasin
Copy link
Contributor Author

Hey @GFleishman,
Unfortunately our application went through several major changes which solved this problem for us as a side effect, so I'm unable to provide any useful advice to you; I think we ended up reverting back to defaults after a while as the problem stopped cropping up.
Apologies I can't be more helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants