version 1.25.3 in docker env stuck in "Waiting to connect to: ..." #2519

haraldschilly · 2019-02-09T12:27:55Z

I've a situation, where I'm basically running the scheduler and workers in a docker environment (it's a project at https://cocalc.com ). Version 1.25.2 it works fine, but starting with 1.25.3 I only get

distributed.worker - INFO - Waiting to connect to:       tcp://localhost:8786

and well, it never does. There are only few changes, though… so maybe this help to narrow down this regression.

scheduler started via dask-scheduler and worker via dask-worker tcp://localhost:8786 --local-directory ~/dask-worker --nthreads 1 --nprocs 1 --memory-limit 256M

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-02-09T16:09:29Z

Interesting. I just went through all of those changes (thanks for the convenient link) and there really wasn't much that was changed in how things connect. I don't immediately see anything that would explain the change in behavior that you're seeing.

The next thing I recommend trying would be using git bisect to better identify the cause of the change in behavior.

mrocklin · 2019-02-09T16:47:24Z

See also #2506

also cc'ing @danpf if you have any suggestions here it looks like the previous work is affecting a lot of people. Do you have any thoughts on how to address this quickly?

haraldschilly · 2019-02-09T17:32:55Z

Ha! Thank you for pointing out git bisect, I haven't used it in at least a year. It's the first commit after 1.25.2, where it no longer connects for me: 70c5129

edit → this points to #2403

in case it's relevant for the CPython thread pooling stuff, I'm using python3 --version: Python 3.6.7 here, i.e. the one that comes with ubuntu 18.04

haraldschilly · 2019-02-09T17:40:17Z

ok, after reading #2506 I think we can close my ticket here. I am just confirming it independently.

danpf · 2019-02-09T17:42:20Z

I know you're using docker so it might be a little hard to get at, but would you be able to try:

upping this number to something like 3, 6, or 10?
70c5129#diff-c853b5c26508bfb36840c2e32e303886R326

haraldschilly · 2019-02-09T17:48:55Z

No, that's not a problem, I work interactively inside of it. I changed it to 10 and 100 (based on 1.25.3) and it still doesn't connect. So, just upping the thread pool doesn't work.

haraldschilly · 2019-02-09T17:59:02Z

Ok, I played a bit around, and I have no idea if that defeats the purpose of PR #2403 or not, but with that change on top of 1.25.3 it connects again:

diff --git a/distributed/comm/tcp.py b/distributed/comm/tcp.py
index 5af63b16..30652cd6 100644
--- a/distributed/comm/tcp.py
+++ b/distributed/comm/tcp.py
@@ -327,16 +327,16 @@ class BaseTCPConnector(Connector, RequireEncryptionMixin):
                                              executor=_executor)
     else:
         _resolver = None
-    client = TCPClient(resolver=_resolver)
 
     @gen.coroutine
     def connect(self, address, deserialize=True, **connection_args):
         self._check_encryption(address, connection_args)
         ip, port = parse_host_port(address)
         kwargs = self._get_connect_args(**connection_args)
+        client = TCPClient(resolver=BaseTCPConnector._resolver)
 
         try:
-            stream = yield BaseTCPConnector.client.connect(ip, port,
+            stream = yield client.connect(ip, port,
                                           max_buffer_size=MAX_BUFFER_SIZE,
                                           **kwargs)

haraldschilly · 2019-02-09T18:01:51Z

screenshot from my project at cocalc, as a proof of the above:

danpf · 2019-02-09T18:13:53Z

So the problem I was trying to solve was that every time you make a TCPClient, you fork a thread.

The outer loop of this function (what calls this) is a while True loop that iterates over and over until you hit the limit of the maximum number of of threads on your machine (if it has trouble connecting). If you move the client back out of the class global definitions, that will still happen.

I'm not really sure what the best course of action is. I can't reproduce this locally, so I can't try any other implementations

danpf · 2019-02-09T19:30:11Z

one fix seems to be tornado 5.0.0, but i'm gonna try to find a solution that isn't that.
big thanks to @haraldschilly for helping me with this.

danpf · 2019-02-09T20:30:22Z

@mrocklin
Maybe we should revert my commit until I can figure this out. It seems like this is going to take more time than I thought.

My PR is incompatible with tornado 4.5.1

When I'm recreating these issues pip/conda is just installing tornado 5+ and it works fine, but if I specify 4.5.1 - 4.5.3 it doesn't work and I can recreate what these people are seeing.

Would you like me to make a PR for reverting?
also, really sorry I didn't intend for this to cause so many issues : (

mrocklin · 2019-02-09T21:21:50Z

Maybe we should add that version check in the condition when making the ThreadPoolExecutor?

…

On Sat, Feb 9, 2019 at 12:30 PM Daniel Farrell ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> Maybe we should revert my commit until I can figure this out. It seems like this is going to take more time than I thought. *My PR is incompatible with tornado 4.5.1* When I'm recreating these issues pip/conda is just installing tornado 5+ and it works fine, but if I specify 4.5.1 - 4.5.3 it doesn't work and I can recreate what these people are seeing. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2519 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszCajAvN-Ani6eeSu79fLua93dJQEks5vLy_egaJpZM4ayhRq> .

danpf · 2019-02-09T23:01:15Z

So this is confusing but I think this is the best i can do to recreate this, @mrocklin

i made a docker image with distributed before my commit, and tornado 4.5.1
you can get it via
docker run -it danpfuw/mypubs:torn4.5.1_dist_pre_thread /bin/bash
/home/distributed is where the git for distributed is
/tornado is where the git for tornado is
both are pip -e installed

I go into that docker image which is 4.5.1 tornado, and then i use tmux to make two terminals
in one i run dask-scheduler in the other i run dask-worker tcp://localhost:8786

'1.25.1+10.ga0d0ed21' is commit a0d0ed21a87622ea35bfe1e686b52a8f221c5bca
which is right before my commit (this is where the installed distributed is at)

everything appears to work fine, but after ~10-20 seconds heartbeats are broken and I get the trace at the bottom printed to my terminal

I discovered this trying to add code that would check for the tornado version like you suggested... basically I started getting this error and I was pretty confused. I can't seem to explain why these projects havent seen errors like this before though.

distributed.core - INFO - Starting established connection
distributed.worker - WARNING - Heartbeat to scheduler failed
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f7c28342d08>, <tornado.concurrent.Future object at 0x7f7c283b0c18>)
Traceback (most recent call last):
  File "/tornado/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/tornado/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/tornado/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/distributed/distributed/worker.py", line 640, in heartbeat
    metrics=self.get_metrics()
  File "/tornado/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/distributed/distributed/core.py", line 642, in send_recv_from_rpc
    result = yield send_recv(comm=comm, op=key, **kwargs)
  File "/tornado/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/distributed/distributed/core.py", line 455, in send_recv
    yield comm.write(msg, serializers=serializers, on_error='raise')
  File "/tornado/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/distributed/distributed/comm/tcp.py", line 232, in write
    stream.write(b)
  File "/tornado/tornado/iostream.py", line 406, in write
    self._handle_write()
  File "/tornado/tornado/iostream.py", line 872, in _handle_write
    del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized

mrocklin · 2019-02-20T16:50:54Z

@haraldschilly can you update to Tornado 5.0 and see if things go away? I'm also curious to know if updating to 5.0 a concern for you in some way.

haraldschilly · 2019-02-20T17:28:06Z

@mrocklin I already updated to 5.1.1, and that works well on cocalc. here is yet another screenshot, as a follow up to the above. The root issue was that updating ubuntu caused it's python3 tornado distribution package to replace files from an already existing (newer) tornado installation. So, updating again to >= 5 was all I had to do.

GenevieveBuckley · 2021-10-19T03:25:23Z

Glad to hear the update fixed things for you. Thanks for looking into this @haraldschilly

TomAugspurger mentioned this issue Sep 25, 2020

Fixed stuck BatchedSend comm #4128

Closed

GenevieveBuckley closed this as completed Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

version 1.25.3 in docker env stuck in "Waiting to connect to: ..." #2519

version 1.25.3 in docker env stuck in "Waiting to connect to: ..." #2519

haraldschilly commented Feb 9, 2019

mrocklin commented Feb 9, 2019

mrocklin commented Feb 9, 2019

haraldschilly commented Feb 9, 2019 •

edited

Loading

haraldschilly commented Feb 9, 2019

danpf commented Feb 9, 2019

haraldschilly commented Feb 9, 2019

haraldschilly commented Feb 9, 2019

haraldschilly commented Feb 9, 2019 •

edited

Loading

danpf commented Feb 9, 2019

danpf commented Feb 9, 2019

danpf commented Feb 9, 2019 •

edited

Loading

mrocklin commented Feb 9, 2019 via email

danpf commented Feb 9, 2019

mrocklin commented Feb 20, 2019

haraldschilly commented Feb 20, 2019

GenevieveBuckley commented Oct 19, 2021

version 1.25.3 in docker env stuck in "Waiting to connect to: ..." #2519

version 1.25.3 in docker env stuck in "Waiting to connect to: ..." #2519

Comments

haraldschilly commented Feb 9, 2019

mrocklin commented Feb 9, 2019

mrocklin commented Feb 9, 2019

haraldschilly commented Feb 9, 2019 • edited Loading

haraldschilly commented Feb 9, 2019

danpf commented Feb 9, 2019

haraldschilly commented Feb 9, 2019

haraldschilly commented Feb 9, 2019

haraldschilly commented Feb 9, 2019 • edited Loading

danpf commented Feb 9, 2019

danpf commented Feb 9, 2019

danpf commented Feb 9, 2019 • edited Loading

mrocklin commented Feb 9, 2019 via email

danpf commented Feb 9, 2019

mrocklin commented Feb 20, 2019

haraldschilly commented Feb 20, 2019

GenevieveBuckley commented Oct 19, 2021

haraldschilly commented Feb 9, 2019 •

edited

Loading

haraldschilly commented Feb 9, 2019 •

edited

Loading

danpf commented Feb 9, 2019 •

edited

Loading