Identify lack of scalability in gwas_linear_regression #390

eric-czech · 2020-11-17T18:47:54Z

It appears that this function does not scale well when run on a cluster.

Notes from my most recent attempt:

The code I ran is here: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/blob/4f862e31b8093d25fdaa8da7f841b9be8583cda4/scripts/gwas.py#L268
This works on a dataset with fewer variants (chr XY, which has ~8k variants compared to ~141k for chr 21)
The operation utilizes only one or two workers in a cluster of 20 n1-highmem-8 instances:

CPU utilization across worker VMs

Status Page

Drilling in on one of the workers that is running all the tasks, I see that the only not obviously parallelizable task it seems to be running is "solve-triangular":

The job ultimately failed with the error "ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs".

Several of the workers show log messages like this:

distributed.worker - INFO - Dependent not found: ('rechunk-merge-66cac011d34e1c66cde96678a9e011b5', 0, 21) 0 . Asking scheduler

Perhaps this is what happens when one node unexpectedly becomes unreachable? I'm not sure.

I will run this again on a smaller dataset that didn't fail to get a performance report and task graph screenshot (which doesn't work on this data because the UI won't render so many nodes).

eric-czech · 2020-11-17T19:13:41Z

Notes from a more detailed performance report resulting from running this for a smaller dataset (that succeeds):

(renamed to .txt to avoid github attachment filter)
gwas-height-performance-report.html.txt

This zoomed-out view of the task stream in the report doesn't strike me as very healthy:

Task graph:

@mrocklin (cc: @ravwojdyla) do you have any suggestions on how to identify why the work isn't being distributed well on larger datasets for this workflow?

mrocklin · 2020-11-17T19:58:50Z

Thank you for producing the performance report. If you want to publish these in the future then you may also want to look into gist.github.com and https://raw.githack.com/ .

I've only looked very briefly at it, but the thing that stands out the most is the 760s transfer times leading up to sum tasks. At expected bandwidths, these would be 150GB payloads, which I'm assuming is higher than you're expecting. It's also odd in how synchronized these transfers are, they all end within a few seconds of each other.

mrocklin · 2020-11-17T20:03:57Z

ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs

I'm curious, do worker logs report anything strange?

Also I'm curious, which version of distributed are you running? (adding this to the performance report here: dask/distributed#4249)

cc'ing @quasiben , who cares a bit about this space and has a lot of experience tracking down similar performance problems.

mrocklin · 2020-11-17T20:06:11Z

Also it looks like some of the workers are under memory pressure. I wonder if maybe they're struggling to accept a large-ish payload.

…

On Tue, Nov 17, 2020 at 11:13 AM Eric Czech ***@***.***> wrote: Notes from a more detailed performance report resulting from running this for a smaller dataset (that succeeds): (renamed to .txt to avoid github attachment filter) gwas-height-performance-report.html.txt <https://github.com/pystatgen/sgkit/files/5555640/gwas-height-performance-report.html.txt> This zoomed-out view of the task stream in the report doesn't strike me as very healthy: [image: Screen Shot 2020-11-17 at 2 07 57 PM] <https://user-images.githubusercontent.com/6130352/99435626-62c72080-28de-11eb-9627-560ead8131e8.png> Task graph: [image: Screen Shot 2020-11-17 at 1 54 52 PM] <https://user-images.githubusercontent.com/6130352/99435949-d406d380-28de-11eb-8097-2d3ba8e9bdaf.png> @mrocklin <https://github.com/mrocklin> (cc: @ravwojdyla <https://github.com/ravwojdyla>) do you have any suggestions on how to identify why the work isn't being distributed well on larger datasets for this workflow? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pystatgen/sgkit/issues/390#issuecomment-729142413>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTEPOAG7V7AMMB4M3GTSQLDPHANCNFSM4TY6UJFA> .

quasiben · 2020-11-17T20:38:54Z

When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure

mrocklin · 2020-11-17T20:43:16Z

Two minutes is a long time for a disk read/write. My initial guess is that there is some other external factor causing contention. That's just a wild guess though. Looking at the worker/thread ratio, I wonder if it would make more sense to have far more workers with fewer threads each. Perhaps try four threads per worker?

…

On Tue, Nov 17, 2020 at 12:39 PM Benjamin Zaitlen ***@***.***> wrote: When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure [image: Screen Shot 2020-11-17 at 3 32 36 PM] <https://user-images.githubusercontent.com/1403768/99444378-20a3dc00-28ea-11eb-8890-fe2cf70240fd.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pystatgen/sgkit/issues/390#issuecomment-729193609>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGNBX6YOZINCQP2AZDSQLNO3ANCNFSM4TY6UJFA> .

eric-czech · 2020-11-17T20:44:16Z

Thank you for producing the performance report. If you want to publish these in the future then you may also want to look into gist.github.com and https://raw.githack.com/ .

👍

I'm curious, do worker logs report anything strange?

I see messages like "Worker stream died during communication" so I'm sure a couple workers in the cluster had died. At the end I was down to 18 instead of 20. Full log from one worker is here jic.

When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure

Thanks Ben, what do you make of the "transfer-sub" tasks (the long red bars)? Do you have any intuition for what's happening in those?

Memory pressure does seem to be part of the problem -- I rechunked my input to 1/16th of the original chunk size and the whole job has progressed further. Parallel utilization is still pretty disappointing across the whole cluster:

CPU utilization across workers after rechunking input to 1/16th of original

eric-czech · 2020-11-17T21:04:54Z

Looking at the worker/thread ratio, I wonder if it would make more sense to
have far more workers with fewer threads each. Perhaps try four threads
per worker?

Alright, I can try that. Hey @quasiben, how do you set the number of workers per VM in Cloud Provider?

mrocklin · 2020-11-17T21:22:56Z

Maybe you can choose smaller VMs?

…

On Tue, Nov 17, 2020, 1:05 PM Eric Czech ***@***.***> wrote: Looking at the worker/thread ratio, I wonder if it would make more sense to have far more workers with fewer threads each. Perhaps try four threads per worker? Alright, I can try that. Hey @quasiben <https://github.com/quasiben>, how do you set the number of workers per VM in Cloud Provider? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pystatgen/sgkit/issues/390#issuecomment-729213000>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTDB4HWUPL3D2RSO7C3SQLQQJANCNFSM4TY6UJFA> .

quasiben · 2020-11-17T21:53:51Z

Alright, I can try that. Hey @quasiben, how do you set the number of workers per VM in Cloud Provider?

That's a good question. I don't think this is supported but shouldn't be too hard. You can control the number of threads with worker_options={"nthreads": 2 }. As @mrocklin suggest, this often is tuned with smaller VMs. Can I ask you to file an issue on dask-cloudprovider to support multiple workers per VM ?

eric-czech · 2020-11-17T22:00:26Z

Can I ask you to file an issue on dask-cloudprovider to support multiple workers per VM ?

You bet! dask/dask-cloudprovider#173

eric-czech · 2020-11-18T12:18:36Z

The larger dataset with smaller chunks did ultimately finish with no errors. Here are a couple readouts:

Performance report (26M): https://drive.google.com/file/d/1feWLKNrjQkslKDIZ7T39fPCNQQDBrQFs/view?usp=sharing

It doesn't seem like any of network, disk, or cpu are even close to being saturated so I assume there is some room for improvement.

As a very rough estimate, this task takes about 5 hrs on a single 64 vCPU VM and 3 hours on a cluster of 20 8 vCPU VMs (160 vCPUs). That would imply ~2 hours with perfect scaling so this should be an approximate ceiling for improvements.

Log: dask_gwas_chr21_log.txt

I will try again with smaller VMs and see if there are any major differences.

eric-czech · 2020-11-18T12:27:36Z

FYI @ravwojdyla and I have been talking a bit about some similar observations on a much simpler workflow in https://github.com/related-sciences/data-team/issues/38 (private). One conclusion there was that the individual objects on GCS are so small that API requests aren't efficient yet bigger chunk sizes would start to make the workloads fail. The GCS objects for a chunk are roughly 2MiB on disk but >100MB in memory. Another potential explanation for this behavior is that the GCS objects aren't being loaded asynchronously. We're still investigating both.

mrocklin · 2020-11-18T15:51:35Z

I don't know if you're aware of the work done by @martindurant on nicer async IO for remote storage and zarr, but he might be good to talk to.

It looks like you're still primarily blocked by a few oddly long transfers. They're less prominent now, which is good, but probably worth investigating further.

martindurant · 2020-11-18T16:14:53Z

(sorry, wrong thread)

mrocklin · 2020-11-18T16:15:35Z

@martindurant I suspect that this is not the issue for which you intended this comment :)

…

On Wed, Nov 18, 2020 at 8:14 AM Martin Durant ***@***.***> wrote: I had tracebacks off for bisect, but now it luckily failed on the third run :) (py37) /Users/mdurant/code/distributed ((dcb46d07...)|BISECTING)> pytest --count=20 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation -x ========================================================================================== test session starts =========================================================================================== platform darwin -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1 rootdir: /Users/mdurant/code/distributed, configfile: setup.cfg plugins: repeat-0.8.0, flaky-3.7.0, hypothesis-5.26.0, cov-2.10.1, profiling-1.7.0 collected 20 items distributed/tests/test_failed_workers.py ..F ================================================================================================ FAILURES ================================================================================================ ______________________________________________________________________________ test_broken_worker_during_computation[3-20] _______________________________________________________________________________ def test_func(): result = None workers = [] with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop: async def coro(): with dask.config.set(config): s = False for i in range(5): try: s, ws = await start_cluster( nthreads, scheduler, loop, security=security, Worker=Worker, scheduler_kwargs=scheduler_kwargs, worker_kwargs=worker_kwargs, ) except Exception as e: logger.error( "Failed to start gen_cluster, retrying", exc_info=True, ) await asyncio.sleep(1) else: workers[:] = ws args = [s] + workers break if s is False: raise Exception("Could not start cluster") if client: c = await Client( s.address, loop=loop, security=security, asynchronous=True, **client_kwargs, ) args = [c] + args try: future = func(*args) if timeout: future = asyncio.wait_for(future, timeout) result = await future if s.validate: s.validate_state() finally: if client and c.status not in ("closing", "closed"): await c._close(fast=s.status == Status.closed) await end_cluster(s, workers) await asyncio.wait_for(cleanup_global_workers(), 1) try: c = await default_client() except ValueError: pass else: await c._close(fast=True) def get_unclosed(): return [c for c in Comm._instances if not c.closed()] + [ c for c in _global_clients.values() if c.status != "closed" ] try: start = time() while time() < start + 5: gc.collect() if not get_unclosed(): break await asyncio.sleep(0.05) else: if allow_unclosed: print(f"Unclosed Comms: {get_unclosed()}") else: raise RuntimeError("Unclosed Comms", get_unclosed()) finally: Comm._instances.clear() _global_clients.clear() return result result = loop.run_sync( > coro, timeout=timeout * 2 if timeout else timeout ) distributed/utils_test.py:954: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../../conda/envs/py37/lib/python3.7/site-packages/tornado/ioloop.py:532: in run_sync return future_cell[0].result() distributed/utils_test.py:912: in coro result = await future _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ fut = <Task cancelled coro=<test_broken_worker_during_computation() done, defined at /Users/mdurant/code/distributed/distributed/tests/test_failed_workers.py:315>>, timeout = 60 async def wait_for(fut, timeout, *, loop=None): """Wait for the single Future or coroutine to complete, with timeout. Coroutine will be wrapped in Task. Returns result of the Future or coroutine. When a timeout occurs, it cancels the task and raises TimeoutError. To avoid the task cancellation, wrap it in shield(). If the wait is cancelled, the task is also cancelled. This function is a coroutine. """ if loop is None: loop = events.get_event_loop() if timeout is None: return await fut if timeout <= 0: fut = ensure_future(fut, loop=loop) if fut.done(): return fut.result() fut.cancel() raise futures.TimeoutError() waiter = loop.create_future() timeout_handle = loop.call_later(timeout, _release_waiter, waiter) cb = functools.partial(_release_waiter, waiter) fut = ensure_future(fut, loop=loop) fut.add_done_callback(cb) try: # wait until the future completes or the timeout try: await waiter except futures.CancelledError: fut.remove_done_callback(cb) fut.cancel() raise if fut.done(): return fut.result() else: fut.remove_done_callback(cb) # We must ensure that the task is not running # after wait_for() returns. # See https://bugs.python.org/issue32751 await _cancel_and_wait(fut, loop=loop) > raise futures.TimeoutError() E concurrent.futures._base.TimeoutError ../../conda/envs/py37/lib/python3.7/asyncio/tasks.py:449: TimeoutError ------------------------------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------------------------------ distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:52676 distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787 distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52677 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52677 distributed.worker - INFO - dashboard at: 127.0.0.1:52678 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 17.18 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/worker-_1zsf66b distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52679 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52679 distributed.worker - INFO - dashboard at: 127.0.0.1:52680 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 17.18 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/worker-cuz87qov distributed.worker - INFO - ------------------------------------------------- distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52677', name: 0, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52677 distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52679', name: 1, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52679 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Receive client connection: Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.core - INFO - Starting established connection distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:52685' distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52686 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52686 distributed.worker - INFO - dashboard at: 127.0.0.1:52687 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 2.86 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/dask-worker-space/worker-322ii8r5 distributed.worker - INFO - ------------------------------------------------- distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52686', name: tcp://127.0.0.1:52686, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52686 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - INFO - Run out-of-band function '_exit' distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52686', name: tcp://127.0.0.1:52686, memory: 10, processing: 80> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52686 distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:52684': in <closed TCP>: Stream is closed distributed.nanny - INFO - Worker process 71091 exited with status 1 distributed.nanny - WARNING - Restarting worker distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52714 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52714 distributed.worker - INFO - dashboard at: 127.0.0.1:52716 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 2.86 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/dask-worker-space/worker-nlpop5co distributed.worker - INFO - ------------------------------------------------- distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52714', name: tcp://127.0.0.1:52714, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52714 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52686 Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b03d250>: ConnectionRefusedError: [Errno 61] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/worker.py", line 1993, in gather_dep self.rpc, deps, worker, who=self.address File "/Users/mdurant/code/distributed/distributed/worker.py", line 3196, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 390, in retry_operation operation=operation, File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 370, in retry return await coro() File "/Users/mdurant/code/distributed/distributed/worker.py", line 3173, in _get_data comm = await rpc.connect(worker) File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect **self.connection_args, File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b03d250>: ConnectionRefusedError: [Errno 61] Connection refused distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52686 Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b36e210>: ConnectionRefusedError: [Errno 61] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/worker.py", line 1993, in gather_dep self.rpc, deps, worker, who=self.address File "/Users/mdurant/code/distributed/distributed/worker.py", line 3196, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 390, in retry_operation operation=operation, File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 370, in retry return await coro() File "/Users/mdurant/code/distributed/distributed/worker.py", line 3173, in _get_data comm = await rpc.connect(worker) File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect **self.connection_args, File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b36e210>: ConnectionRefusedError: [Errno 61] Connection refused distributed.core - ERROR - Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/core.py", line 528, in handle_comm result = await result File "/Users/mdurant/code/distributed/distributed/scheduler.py", line 3040, in broadcast [send_message(address) for address in addresses if address is not None] File "/Users/mdurant/code/distributed/distributed/utils.py", line 230, in All result = await tasks.next() File "/Users/mdurant/code/distributed/distributed/scheduler.py", line 3031, in send_message comm = await self.rpc.connect(addr) File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect **self.connection_args, File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused distributed.scheduler - INFO - Remove client Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.scheduler - INFO - Remove client Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.scheduler - INFO - Close client connection: Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52677 distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52679 distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52677', name: 0, memory: 0, processing: 0> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52677 distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52679', name: 1, memory: 0, processing: 0> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52679 distributed.scheduler - INFO - Scheduler closing... distributed.scheduler - INFO - Scheduler closing all comms distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52714', name: tcp://127.0.0.1:52714, memory: 0, processing: 0> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52714 distributed.scheduler - INFO - Lost all workers distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52714 distributed.nanny - INFO - Worker closed distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:52685' — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pystatgen/sgkit/issues/390#issuecomment-729785585>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTD2GWBR2GGYIOVSBRTSQPXF3ANCNFSM4TY6UJFA> .

martindurant · 2020-11-18T16:18:35Z

Indeed - but I did want to comment here too.

Yes, gcsfs allows fetching of many objects from the store with a single call, and they will be processed concurrently, returning back the set of bytes objects. In the case that these are to be expanded in memory, you would of course have to deal with the blocks of bytes one at a time, to not exceed RAM. For zarr, this does mean transiently higher memory usage during decompression, depending on the number of storage chunks per dask partition - the best tradeoff would be workload-dependent.

eric-czech · 2020-11-18T17:33:57Z

In trying this on a cluster of 40 4 vCPU machines (instead of 20 8 vCPUs), the workflow failed with some similar errors to my first attempt on this issue. It doesn't look like memory pressure was an issue this time but I'm not certain. I didn't see log messages about it anyhow, and this was with chunks 1/16th the size of the original, or ~6MB in memory. The client-side error I hit was again:

ValueError: Could not find dependent ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 130, 105). Check worker logs

Here are some worker logs:

Worker 1

distributed.worker - INFO - Start worker at: tcp://10.142.15.198:37407

distributed.worker - INFO - Listening to: tcp://10.142.15.198:37407

distributed.worker - INFO - dashboard at: 10.142.15.198:38769

distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Threads: 4

distributed.worker - INFO - Memory: 27.34 GB

distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-jst4aey2

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data with tcp://10.142.15.198:37407 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 0, 13)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 0, 13) 0 . Asking scheduler

Worker 2

distributed.worker - INFO - Start worker at: tcp://10.142.15.201:43081

distributed.worker - INFO - Listening to: tcp://10.142.15.201:43081

distributed.worker - INFO - dashboard at: 10.142.15.201:43701

distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Threads: 4

distributed.worker - INFO - Memory: 27.34 GB

distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-9plktoms

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 33, 40)

distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 988, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1169, in write_to_fd return self.socket.send(data) # type: ignore ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1281, in get_data compressed = await comm.write(msg, serializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 256, in write convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 32, 40)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 34, 40)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 32, 40) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 33, 40) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 35, 40)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 34, 40) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 35, 40) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 137, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 136, 22)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 136, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 172, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 138, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 173, 22)

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 173, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 175, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 148, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 149, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 225, 32)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 174, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 225, 32) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 174, 22)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 150, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 244, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 151, 94)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 151, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 244, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 150, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - INFO - Stopping worker at tcp://10.142.15.201:43081