Hardcoded time outs lead to complete teardown of communication layer #4118

jglaser · 2020-09-21T02:31:11Z

What happened:

A long-running calculation (TPC-X BB benchmark with dask/RAPIDS, query 2) with UCX always crashes in the distributed communication layer. I am using 162 workers. The error reproducibly shows up as

Task exception was never retrieved
future: <Task finished coro=<connect.<locals>._() done, defined at /gpfs/alpine/world-shared/bif128/rapids-env/lib/python3.7/site-packages/distributed-2.25.0+6.g73fa9bd-py3.7.egg/distributed/comm/core.py:288> exception=CommClosedError()>
Traceback (most recent call last):
  File "/gpfs/alpine/world-shared/bif128/rapids-env/lib/python3.7/site-packages/distributed-2.25.0+6.g73fa9bd-py3.7.egg/distributed/comm/core.py", line 297, in _
    handshake = await asyncio.wait_for(comm.read(), 1)
  File "/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/python-3.7.0-ei3mpdncii74xsn55t5kxpuc46i3oezn/lib/python3.7/asyncio/tasks.py", line 405, in wait_for
    await waiter
concurrent.futures._base.CancelledError

in the log files, followed by many pages of other errors.

What you expected to happen:

Successful completion of the calculation.

Minimal Complete Verifiable Example:

N/A

Anything else we need to know?:

After investigating, it turns out that distributed sets some hard-coded ad-hoc time outs in the code. When these expire, asyncio generates a CancelledError exception, which in turn takes down the UCX end point, and by a chain reaction, the entire cluster including the scheduler.

Example 1: (actually looks like a bug)

distributed/distributed/comm/core.py

Line 318 in ecaf140

_(), timeout=min(deadline - time(), retry_timeout_backoff)

Context:

                with suppress(TimeoutError):
                    comm = await asyncio.wait_for(
                        _(), timeout=min(deadline - time(), retry_timeout_backoff)
                    )

Here, the timeout is taken as the minimum between the remaining time and an ad-hoc generated random time between 1.4 and 1.6s (retry_timeout_backoff). I assume the maximum is correct here, like this

                        _(), timeout=max(deadline - time(), retry_timeout_backoff)

The current implementation limits the maximum waiting time to always less than two seconds instead of the configured time in distributed.comm.timeouts.connect.

Example 2

In the communication handshake, here and here, timeouts are hardcoded to one second. This causes the same problem and they should be made configurable. In order to workaround my issue, I multiplied these timeouts by a factor of 100, or set them to deadline - time(), respectively.

Environment:

Dask version: 2.25.0
Python version: 3.7.0
Operating System: Linux
Install method (conda, pip, source): source

The text was updated successfully, but these errors were encountered:

jglaser · 2020-09-21T02:34:05Z

See also a similar issue #4103

mrocklin · 2020-09-21T04:00:24Z

cc @quasiben

quasiben · 2020-09-21T12:48:58Z

I don't this we've seen this particular timeout in the past when running TPC benchmarks. We have run with the following settings:

export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT="100s"
export DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP="600s"
export DASK_DISTRIBUTED__COMM__RETRY__DELAY__MIN="1s"
export DASK_DISTRIBUTED__COMM__RETRY__DELAY__MAX="60s"

Do you know if this timeout occurred in the worker or scheduler ?

cc @beckernick in case he has other thoughts on timeouts

jglaser · 2020-09-21T14:31:40Z

@quasiben the timeout first appears in the workers

fjetter · 2020-10-21T15:40:35Z

I've seen similar errors for TCP/TLS comms and came up with #4176 See also #4167

mnarodovitch mentioned this issue Sep 29, 2020

Timed out trying to connect ... : connect() didn't finish in time #4080

Open

jcrist mentioned this issue Oct 22, 2020

Fix distributed comm connect timeout logic #4167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded time outs lead to complete teardown of communication layer #4118

Hardcoded time outs lead to complete teardown of communication layer #4118

jglaser commented Sep 21, 2020

jglaser commented Sep 21, 2020

mrocklin commented Sep 21, 2020

quasiben commented Sep 21, 2020

jglaser commented Sep 21, 2020

fjetter commented Oct 21, 2020

Hardcoded time outs lead to complete teardown of communication layer #4118

Hardcoded time outs lead to complete teardown of communication layer #4118

Comments

jglaser commented Sep 21, 2020

jglaser commented Sep 21, 2020

mrocklin commented Sep 21, 2020

quasiben commented Sep 21, 2020

jglaser commented Sep 21, 2020

fjetter commented Oct 21, 2020