Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues starting workers #2506

Closed
bpmweel opened this issue Feb 5, 2019 · 16 comments
Closed

Issues starting workers #2506

bpmweel opened this issue Feb 5, 2019 · 16 comments
Labels
needs info Needs further information from the user

Comments

@bpmweel
Copy link

bpmweel commented Feb 5, 2019

I'm trying to start a dask cluster on my local machine. I can use the LocalCluster just fine, but starting the scheduler and client from the command line I run into issues:

$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:  tcp://145.90.225.10:8786
distributed.scheduler - INFO -       bokeh at:                     :8787
distributed.scheduler - INFO - Local Directory: /var/folders/h6/ck24x_854wd94jzwy9r7gvl40000gn/T/scheduler-gjtgzvz2
distributed.scheduler - INFO - -----------------------------------------------
$ dask-worker 145.90.225.10:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://145.90.225.10:51273'
distributed.worker - INFO -       Start worker at:  tcp://145.90.225.10:51274
distributed.worker - INFO -          Listening to:  tcp://145.90.225.10:51274
distributed.worker - INFO -              nanny at:        145.90.225.10:51273
distributed.worker - INFO -              bokeh at:        145.90.225.10:51275
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   17.18 GB
distributed.worker - INFO -       Local Directory: /Users/bweel/worker-73osop8n
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
...etc

I have the feeling this has to do with having multiple interfaces and the nanny process. The following combination works on the command line:

$ dask-scheduler --interface en0
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://145.100.116.139:8786
distributed.scheduler - INFO -       bokeh at:      145.100.116.139:8787
distributed.scheduler - INFO - Local Directory: /var/folders/h6/ck24x_854wd94jzwy9r7gvl40000gn/T/scheduler-3evaz473
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://145.100.116.139:62873
distributed.scheduler - INFO - Starting worker compute stream, tcp://145.100.116.139:62873
distributed.core - INFO - Starting established connection
$ dask-worker 145.100.116.139:8786 --interface en0 --no-nanny
distributed.diskutils - INFO - Found stale lock file and directory '/Users/bweel/worker-o4mon_8t', purging
distributed.worker - INFO -       Start worker at: tcp://145.100.116.139:62873
distributed.worker - INFO -          Listening to: tcp://145.100.116.139:62873
distributed.worker - INFO -              bokeh at:      145.100.116.139:62874
distributed.worker - INFO - Waiting to connect to: tcp://145.100.116.139:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   17.18 GB
distributed.worker - INFO -       Local Directory: /Users/bweel/worker-br866x7c
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://145.100.116.139:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

However, now I cannot connect to the scheduler from ipython:

from dask.distributed import Client
client = Client('tcp://145.100.116.139:8786')

---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, connection_args)
    203                                           future,
--> 204                                           quiet_exceptions=EnvironmentError)
    205         except FatalCommClosedError:

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

TimeoutError: Timeout

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-7-2c6e48226add> in <module>()
----> 1 client = Client('tcp://145.100.116.139:8786')

/usr/local/lib/python3.6/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, **kwargs)
    636             ext(self)
    637
--> 638         self.start(timeout=timeout)
    639
    640         from distributed.recreate_exceptions import ReplayExceptionClient

/usr/local/lib/python3.6/site-packages/distributed/client.py in start(self, **kwargs)
    759             self._started = self._start(**kwargs)
    760         else:
--> 761             sync(self.loop, self._start, **kwargs)
    762
    763     def __await__(self):

/usr/local/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

/usr/local/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    684         if value.__traceback__ is not tb:
    685             raise value.with_traceback(tb)
--> 686         raise value
    687
    688 else:

/usr/local/lib/python3.6/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/usr/local/lib/python3.6/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
    847         self.scheduler_comm = None
    848
--> 849         yield self._ensure_connected(timeout=timeout)
    850
    851         for pc in self._periodic_callbacks.values():

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/usr/local/lib/python3.6/site-packages/distributed/client.py in _ensure_connected(self, timeout)
    885         try:
    886             comm = yield connect(self.scheduler.address, timeout=timeout,
--> 887                                  connection_args=self.connection_args)
    888             if timeout is not None:
    889                 yield gen.with_timeout(timedelta(seconds=timeout),

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/usr/local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, connection_args)
    213                 _raise(error)
    214         except gen.TimeoutError:
--> 215             _raise(error)
    216         else:
    217             break

/usr/local/lib/python3.6/site-packages/distributed/comm/core.py in _raise(error)
    193         msg = ("Timed out trying to connect to %r after %s s: %s"
    194                % (addr, timeout, error))
--> 195         raise IOError(msg)
    196
    197     # This starts a thread

OSError: Timed out trying to connect to 'tcp://145.100.116.139:8786' after 10 s: connect() didn't finish in time
@bpmweel
Copy link
Author

bpmweel commented Feb 5, 2019

It appears the issue is with distributed 1.25.3 as mentioned in #2504

@mrocklin
Copy link
Member

mrocklin commented Feb 5, 2019

Interesting. Thanks for reporting this @bpmweel .

Unfortunately nothing jumps out at me as an obvious cause of this. From the issue that you refer to it sounds like this might have been a recent change. Do you have any interest in using git bisect to help identify the commit where this might have been introduced? That would go a long way towards finding and resolving the problem.

@bpmweel
Copy link
Author

bpmweel commented Feb 5, 2019

I did a git bisect and it seems it originates already in this commit: 70c5129

@mrocklin
Copy link
Member

mrocklin commented Feb 5, 2019

cc @danpf

@mrocklin
Copy link
Member

mrocklin commented Feb 5, 2019

Thanks for going through that process @bpmweel

@danpf
Copy link
Contributor

danpf commented Feb 5, 2019

darn... your examples appear to work on my local machine.

I'm looking into this!

@danpf
Copy link
Contributor

danpf commented Feb 5, 2019

I've tried python 3.[5,6,7] pip, and conda and i'm unable to recreate this :/

Could you try upping this number to something like 3, 6, or 10?
70c5129#diff-c853b5c26508bfb36840c2e32e303886R326

@mrocklin
Copy link
Member

mrocklin commented Feb 9, 2019

@bpmweel any response to @danpf request above?

@bpmweel
Copy link
Author

bpmweel commented Feb 9, 2019 via email

@bpmweel
Copy link
Author

bpmweel commented Feb 10, 2019

I tried upping the threadpoolexecuter count, but that did not fix anything. I tried wrapping the client creation in a init function and this does help:

324,330c324,332
<     if PY3:  # see github PR #2403 discussion for more info
<         _executor = ThreadPoolExecutor(2)
<         _resolver = netutil.ExecutorResolver(close_executor=False,
<                                              executor=_executor)
<     else:
<         _resolver = None
<     client = TCPClient(resolver=_resolver)
---
>     def __init__(self):
>         if not hasattr(BaseTCPConnector, 'client'):
>             if PY3:  # see github PR #2403 discussion for more info
>                 _executor = ThreadPoolExecutor(2)
>                 _resolver = netutil.ExecutorResolver(close_executor=False,
>                                                     executor=_executor)
>             else:
>                 _resolver = None
>             BaseTCPConnector.client = TCPClient(resolver=_resolver)

I'm not savvy enough in python to know if this is very fundamentally different from creating a class attribute in the previous manner.

@mrocklin
Copy link
Member

@danpf any thoughts on the above?

@danpf
Copy link
Contributor

danpf commented Feb 11, 2019

This would overwrite the class attribute each time we make a new object which probably isn't a good idea.
I tried it by using a self.client instead and in a docker with tornado 4.5.1 i'm still seeing heartbeats failing (similar to the effects in the other thread I mentioned)

This is what happens with suggested self with 4.5.1

distributed.core - INFO - Starting established connection
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7fa9e6bb7320> exception was never retrieved: Traceback (most recent call last):
  File "/tornado/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/distributed/distributed/core.py", line 357, in handle_comm
    yield comm.write(result, serializers=serializers)
  File "/tornado/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/tornado/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/tornado/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/home/distributed/distributed/comm/tcp.py", line 232, in write
    stream.write(b)
  File "/tornado/tornado/iostream.py", line 406, in write
    self._handle_write()
  File "/tornado/tornado/iostream.py", line 872, in _handle_write
    del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized

... this goes on for a while

This is in tornado 4.5.1 with the original (before my commit) method

distributed.core - INFO - Starting established connection
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f2b28b2cae8>, <tornado.concurrent.Future object at 0x7f2b28af0f98>)
Traceback (most recent call last):
  File "/tornado/tornado/ioloop.py", line 605, in _run_callback

I don't see theese errors in tornado 5 thoug

@bpmweel do you have any dependency that requires tornado <5? could you try explicitly installing a higher version of tornado? I don't see any problems when I use tornado 5.0.0 you can check this by doing

import tornado
print(tornado.version)

@mrocklin
something seems fishy with lower versions of tornado. do you see these kinds of errors as well?

@bpmweel
Copy link
Author

bpmweel commented Feb 11, 2019

Shouldn't the if not hasattr(BaseTCPConnector, 'client'): prevent overwriting?

That being said, I don't have any particular tornado dependencies, so I will try version 5 and report back.

@danpf
Copy link
Contributor

danpf commented Feb 11, 2019

oh you're right I missed that line when I copied it : p sorry!

your fix seems to work for ipython, but not for cli distributed, but i have 0 idea why. the client should be resolved when the class is compiled/parsed so.... (this is tornado 4.5.1)

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fab1c13a0d0>, <tornado.concurrent.Future object at 0x7fab1c3e6f60>)
Traceback (most recent call last):
  File "/tornado/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/tornado/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/tornado/tornado/ioloop.py", line 626, in _discard_future_result

@mrocklin
Copy link
Member

@bpmweel can you try updating to Tornado 5.0 or above and see if that resolves the problem?

@GenevieveBuckley GenevieveBuckley added the needs info Needs further information from the user label Oct 19, 2021
@jrbourbeau
Copy link
Member

Today Tornado 5 is the minimum version supported in distributed. Closing due to "I don't see theese errors in tornado 5 though" in this comment #2506 (comment). We can re-open as needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info Needs further information from the user
Projects
None yet
Development

No branches or pull requests

5 participants