-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardcoded time outs lead to complete teardown of communication layer #4118
Comments
See also a similar issue #4103 |
cc @quasiben |
I don't this we've seen this particular timeout in the past when running TPC benchmarks. We have run with the following settings:
Do you know if this timeout occurred in the worker or scheduler ? cc @beckernick in case he has other thoughts on timeouts |
@quasiben the timeout first appears in the workers |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What happened:
A long-running calculation (TPC-X BB benchmark with dask/RAPIDS, query 2) with UCX always crashes in the
distributed
communication layer. I am using 162 workers. The error reproducibly shows up asin the log files, followed by many pages of other errors.
What you expected to happen:
Successful completion of the calculation.
Minimal Complete Verifiable Example:
N/A
Anything else we need to know?:
After investigating, it turns out that
distributed
sets some hard-coded ad-hoc time outs in the code. When these expire,asyncio
generates aCancelledError
exception, which in turn takes down the UCX end point, and by a chain reaction, the entire cluster including the scheduler.Example 1: (actually looks like a bug)
distributed/distributed/comm/core.py
Line 318 in ecaf140
Context:
Here, the timeout is taken as the minimum between the remaining time and an ad-hoc generated random time between 1.4 and 1.6s (
retry_timeout_backoff
). I assume the maximum is correct here, like thisThe current implementation limits the maximum waiting time to always less than two seconds instead of the configured time in
distributed.comm.timeouts.connect
.Example 2
In the communication handshake, here and here, timeouts are hardcoded to one second. This causes the same problem and they should be made configurable. In order to workaround my issue, I multiplied these timeouts by a factor of 100, or set them to
deadline - time()
, respectively.Environment:
The text was updated successfully, but these errors were encountered: