You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran into a case today where repeatedly restarting my cluster ends up deadlocking (or at least taking long enough that things timeout). My actual use case is I'm running regularly scheduled Prefect tasks which, in part, restart a LocalCluster (this is to avoid buildup from a memory leak in another library).
which consistently raises the following error for me locally:
---------------------------------------------------------------------------TimeoutErrorTraceback (mostrecentcalllast)
CellIn[3], line31foriinrange(20):
2print(f"{i=}")
---->3c.restart()
File~/projects/dask/distributed/distributed/client.py:3648, inClient.restart(self, timeout, wait_for_workers)
3618defrestart(self, timeout=no_default, wait_for_workers=True):
3619""" 3620 Restart all workers. Reset local state. Optionally wait for workers to return. 3621 (...) 3646 Client.restart_workers 3647 """->3648returnself.sync(
3649self._restart, timeout=timeout, wait_for_workers=wait_for_workers3650 )
File~/projects/dask/distributed/distributed/utils.py:358, inSyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
356returnfuture357else:
-->358returnsync(
359self.loop, func, *args, callback_timeout=callback_timeout, **kwargs360 )
File~/projects/dask/distributed/distributed/utils.py:434, insync(loop, func, callback_timeout, *args, **kwargs)
431wait(10)
433iferrorisnotNone:
-->434raiseerror435else:
436returnresultFile~/projects/dask/distributed/distributed/utils.py:408, insync.<locals>.f()
406awaitable=wait_for(awaitable, timeout)
407future=asyncio.ensure_future(awaitable)
-->408result=yieldfuture409exceptExceptionasexception:
410error=exceptionFile~/mambaforge/envs/distributed/lib/python3.11/site-packages/tornado/gen.py:767, inRunner.run(self)
765try:
766try:
-->767value=future.result()
768exceptExceptionase:
769# Save the exception for later. It's important that770# gen.throw() not be called inside this try/except block771# because that makes sys.exc_info behave unexpectedly.772exc: Optional[Exception] =eFile~/projects/dask/distributed/distributed/client.py:3615, inClient._restart(self, timeout, wait_for_workers)
3612iftimeoutisnotNone:
3613timeout=parse_timedelta(timeout, "s")
->3615awaitself.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
3616returnselfFile~/projects/dask/distributed/distributed/core.py:1395, inPooledRPCCall.__getattr__.<locals>.send_recv_from_rpc(**kwargs)
1393prev_name, comm.name=comm.name, "ConnectionPool."+key1394try:
->1395returnawaitsend_recv(comm=comm, op=key, **kwargs)
1396finally:
1397self.pool.reuse(self.addr, comm)
File~/projects/dask/distributed/distributed/core.py:1179, insend_recv(comm, reply, serializers, deserializers, **kwargs)
1177_, exc, tb=clean_exception(**response)
1178assertexc->1179raiseexc.with_traceback(tb)
1180else:
1181raiseException(response["exception_text"])
File~/projects/dask/distributed/distributed/core.py:970, in_handle_comm()
968result=handler(**msg)
969ifinspect.iscoroutine(result):
-->970result=awaitresult971elifinspect.isawaitable(result):
972raiseRuntimeError(
973f"Comm handler returned unknown awaitable. Expected coroutine, instead got {type(result)}"974 )
File~/projects/dask/distributed/distributed/utils.py:832, inwrapper()
830asyncdefwrapper(*args, **kwargs):
831withself:
-->832returnawaitfunc(*args, **kwargs)
File~/projects/dask/distributed/distributed/scheduler.py:6292, inrestart()
6284if (n_nanny:=len(nanny_workers)) <n_workers:
6285msg+= (
6286f" The {n_workers-n_nanny} worker(s) not using Nannies were just shut "6287"down instead of restarted (restart is only possible with Nannies). If "
(...)
6290"will always time out. Do not use `Client.restart` in that case."6291 )
->6292raiseTimeoutError(msg) fromNone6293logger.info("Restarting finished.")
TimeoutError: Waitedfor4worker(s) toreconnectafterrestarting, butafter120s, only0havereturned. Consideralongertimeout, or`wait_for_workers=False`.
A few additional things to note:
Things don't hang consistently on the same for-loop iteration in the reproducer. But things do consistently hang within 20 iterations (at least for me locally).
I don't see the same behavior when using a coiled.Cluster (so far I've only used a LocalCluster and a coiled.Cluster)
I tried specifying a larger timeout="4 minutes", which also eventually timed out
The text was updated successfully, but these errors were encountered:
I ran into a case today where repeatedly restarting my cluster ends up deadlocking (or at least taking long enough that things timeout). My actual use case is I'm running regularly scheduled Prefect tasks which, in part, restart a
LocalCluster
(this is to avoid buildup from a memory leak in another library).Here's a reproducer:
which consistently raises the following error for me locally:
A few additional things to note:
for
-loop iteration in the reproducer. But things do consistently hang within 20 iterations (at least for me locally).coiled.Cluster
(so far I've only used aLocalCluster
and acoiled.Cluster
)timeout="4 minutes"
, which also eventually timed outThe text was updated successfully, but these errors were encountered: