Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

version 1.25.3 in docker env stuck in "Waiting to connect to: ..." #2519

Closed
haraldschilly opened this issue Feb 9, 2019 · 16 comments
Closed

Comments

@haraldschilly
Copy link

I've a situation, where I'm basically running the scheduler and workers in a docker environment (it's a project at https://cocalc.com ). Version 1.25.2 it works fine, but starting with 1.25.3 I only get

distributed.worker - INFO - Waiting to connect to:       tcp://localhost:8786

and well, it never does. There are only few changes, though… so maybe this help to narrow down this regression.

scheduler started via dask-scheduler and worker via dask-worker tcp://localhost:8786 --local-directory ~/dask-worker --nthreads 1 --nprocs 1 --memory-limit 256M

@mrocklin
Copy link
Member

mrocklin commented Feb 9, 2019

Interesting. I just went through all of those changes (thanks for the convenient link) and there really wasn't much that was changed in how things connect. I don't immediately see anything that would explain the change in behavior that you're seeing.

The next thing I recommend trying would be using git bisect to better identify the cause of the change in behavior.

@mrocklin
Copy link
Member

mrocklin commented Feb 9, 2019

See also #2506

also cc'ing @danpf if you have any suggestions here it looks like the previous work is affecting a lot of people. Do you have any thoughts on how to address this quickly?

@haraldschilly
Copy link
Author

haraldschilly commented Feb 9, 2019

Ha! Thank you for pointing out git bisect, I haven't used it in at least a year. It's the first commit after 1.25.2, where it no longer connects for me: 70c5129

edit → this points to #2403

in case it's relevant for the CPython thread pooling stuff, I'm using python3 --version: Python 3.6.7 here, i.e. the one that comes with ubuntu 18.04

@haraldschilly
Copy link
Author

ok, after reading #2506 I think we can close my ticket here. I am just confirming it independently.

@danpf
Copy link
Contributor

danpf commented Feb 9, 2019

I know you're using docker so it might be a little hard to get at, but would you be able to try:

upping this number to something like 3, 6, or 10?
70c5129#diff-c853b5c26508bfb36840c2e32e303886R326

@haraldschilly
Copy link
Author

No, that's not a problem, I work interactively inside of it. I changed it to 10 and 100 (based on 1.25.3) and it still doesn't connect. So, just upping the thread pool doesn't work.

@haraldschilly
Copy link
Author

Ok, I played a bit around, and I have no idea if that defeats the purpose of PR #2403 or not, but with that change on top of 1.25.3 it connects again:

diff --git a/distributed/comm/tcp.py b/distributed/comm/tcp.py
index 5af63b16..30652cd6 100644
--- a/distributed/comm/tcp.py
+++ b/distributed/comm/tcp.py
@@ -327,16 +327,16 @@ class BaseTCPConnector(Connector, RequireEncryptionMixin):
                                              executor=_executor)
     else:
         _resolver = None
-    client = TCPClient(resolver=_resolver)
 
     @gen.coroutine
     def connect(self, address, deserialize=True, **connection_args):
         self._check_encryption(address, connection_args)
         ip, port = parse_host_port(address)
         kwargs = self._get_connect_args(**connection_args)
+        client = TCPClient(resolver=BaseTCPConnector._resolver)
 
         try:
-            stream = yield BaseTCPConnector.client.connect(ip, port,
+            stream = yield client.connect(ip, port,
                                           max_buffer_size=MAX_BUFFER_SIZE,
                                           **kwargs)

@haraldschilly
Copy link
Author

haraldschilly commented Feb 9, 2019

screenshot from my project at cocalc, as a proof of the above:

screenshot from 2019-02-09 19-01-03

@danpf
Copy link
Contributor

danpf commented Feb 9, 2019

So the problem I was trying to solve was that every time you make a TCPClient, you fork a thread.

The outer loop of this function (what calls this) is a while True loop that iterates over and over until you hit the limit of the maximum number of of threads on your machine (if it has trouble connecting). If you move the client back out of the class global definitions, that will still happen.

I'm not really sure what the best course of action is. I can't reproduce this locally, so I can't try any other implementations

@danpf
Copy link
Contributor

danpf commented Feb 9, 2019

one fix seems to be tornado 5.0.0, but i'm gonna try to find a solution that isn't that.
big thanks to @haraldschilly for helping me with this.

@danpf
Copy link
Contributor

danpf commented Feb 9, 2019

@mrocklin
Maybe we should revert my commit until I can figure this out. It seems like this is going to take more time than I thought.

My PR is incompatible with tornado 4.5.1

When I'm recreating these issues pip/conda is just installing tornado 5+ and it works fine, but if I specify 4.5.1 - 4.5.3 it doesn't work and I can recreate what these people are seeing.

Would you like me to make a PR for reverting?
also, really sorry I didn't intend for this to cause so many issues : (

@mrocklin
Copy link
Member

mrocklin commented Feb 9, 2019 via email

@danpf
Copy link
Contributor

danpf commented Feb 9, 2019

So this is confusing but I think this is the best i can do to recreate this, @mrocklin

i made a docker image with distributed before my commit, and tornado 4.5.1
you can get it via
docker run -it danpfuw/mypubs:torn4.5.1_dist_pre_thread /bin/bash
/home/distributed is where the git for distributed is
/tornado is where the git for tornado is
both are pip -e installed

I go into that docker image which is 4.5.1 tornado, and then i use tmux to make two terminals
in one i run dask-scheduler in the other i run dask-worker tcp://localhost:8786

'1.25.1+10.ga0d0ed21' is commit a0d0ed21a87622ea35bfe1e686b52a8f221c5bca
which is right before my commit (this is where the installed distributed is at)

everything appears to work fine, but after ~10-20 seconds heartbeats are broken and I get the trace at the bottom printed to my terminal

I discovered this trying to add code that would check for the tornado version like you suggested... basically I started getting this error and I was pretty confused. I can't seem to explain why these projects havent seen errors like this before though.

distributed.core - INFO - Starting established connection
distributed.worker - WARNING - Heartbeat to scheduler failed
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f7c28342d08>, <tornado.concurrent.Future object at 0x7f7c283b0c18>)
Traceback (most recent call last):
  File "/tornado/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/tornado/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/tornado/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/distributed/distributed/worker.py", line 640, in heartbeat
    metrics=self.get_metrics()
  File "/tornado/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/distributed/distributed/core.py", line 642, in send_recv_from_rpc
    result = yield send_recv(comm=comm, op=key, **kwargs)
  File "/tornado/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/distributed/distributed/core.py", line 455, in send_recv
    yield comm.write(msg, serializers=serializers, on_error='raise')
  File "/tornado/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/distributed/distributed/comm/tcp.py", line 232, in write
    stream.write(b)
  File "/tornado/tornado/iostream.py", line 406, in write
    self._handle_write()
  File "/tornado/tornado/iostream.py", line 872, in _handle_write
    del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized

@mrocklin
Copy link
Member

@haraldschilly can you update to Tornado 5.0 and see if things go away? I'm also curious to know if updating to 5.0 a concern for you in some way.

@haraldschilly
Copy link
Author

@mrocklin I already updated to 5.1.1, and that works well on cocalc. here is yet another screenshot, as a follow up to the above. The root issue was that updating ubuntu caused it's python3 tornado distribution package to replace files from an already existing (newer) tornado installation. So, updating again to >= 5 was all I had to do.

screenshot from 2019-02-20 18-25-09

@GenevieveBuckley
Copy link
Contributor

Glad to hear the update fixed things for you. Thanks for looking into this @haraldschilly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants