Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

Closed
TheCodeCache opened this issue Jan 23, 2018 · 8 comments
Closed

Comments

@TheCodeCache
Copy link

I am running a dask-scheduler on one node and my dask-worker is running on another node.. And I submit a task to the dask-scheduler from a third node.

it sometimes throws the below error. I am using python 2.7, tornado 4.5.2, tensorflow 1.3.0

Following is the minimal script that can be used to reproduce the mentioned error which appears more often than not.

import os, sys
import subprocess

from dask.distributed import Variable, Client
#import psutil

import time, json, shlex

## the following function/task will be executed on the dask-worker node running on a separate node in the cluster.
def my_task(stop, is_alive):
  proc = None
  proc_started = False
  try:
    while(True):
      if stop.get():
        proc.terminate()
        return
      else:
        if not proc_started:
        
          ### The following train_image_classifier script which does the training on the set of images for classification,
          ### which is running smoothly when executing as a standalone script.
          
          ###starting a child process where the train_image_classifier i.e. training will run.
          proc = subprocess.Popen("python train_image_classifier.py") 
      is_alive.set(proc.poll())
  except:
    #####
    is_alive.set(proc.poll())
  finally:
    is_alive.set(proc.poll())
    return

### the following or this script will basically launch the task to run on the dask-worker through dask-scheduler
### dask-client script which will submit the task on dask-worker and listens the running status of the task and will send stop signal to dask-worker to stop the live task.
if __name__ == '__main__':

  client = Client("198.152.1.2:8786") # creating a dask client

  ### these two distributed variables will be used for two way communication between dask-client and the dask-worker
  stop = Variable("stop_", client = client)
  is_alive = Variable("is_alive_", client = client)

  future = client.submit(my_task)

  ###polling whether the running task is alive or not.
  ###if not then send the stop signal and come out of the execution
  while(True):
    if is_alive.get():
      stop.set(True)
      break;
  print("Execution over.! returning back to the caller..")

And here is the error description from the trace. it appears during the execution of the training process. it sometimes appear or sometimes not, letting the training completes..

distributed.utils - ERROR - Existing exports of data: object cannot be re-sized
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
result[0] = yield make_coro()

File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/variable.py", line 179, in _get
client=self.client.id)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/core.py", line 464, in send_recv_from_rpc
result = yield send_recv(comm=comm, op=key, **kwargs)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/core.py", line 348, in send_recv
yield comm.write(msg)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/usr/lib/python2.7/site-packages/distributed/comm/tcp.py", line 218, in write
future = stream.write(frame)
File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 406, in write
self._handle_write()
File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 872, in _handle_write
del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized

distributed.worker - WARNING - Compute Failed
Function: my_task
args: ({'upper': '1.4', 'trainable_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'checkpoint_path': '/home/mapr/mano/slim_data/flowers/model/inception/inception_v3.ckpt', 'log_every_n_steps': '1', 'dataset_split_name': 'train', 'learning_rate': '0.01', 'train_dir': '/home/mapr/mano/slim_data/flowers/train_dir/train_outs_19', 'clone_on_cpu': 'True', 'batch_size': '32', 'resize_method': '3', 'hue_max_delta': '0.3', 'lower': '0.6', 'trace_every_n_steps': '1', 'script_name': 'train_image_classifier.py', 'checkpoint_exclude_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'dataset_dir': '/home/mapr/mano/slim_data/flowers/slim_data_dir', 'max_number_of_steps': '4', 'model_name': 'inception_v3', 'dataset_name': 'flowers'})
kwargs: {}
Exception: BufferError('Existing exports of data: object cannot be re-sized',)

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/mapr/mano/slim_data/flowers/train_dir/train_outs_19/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 2.6281 (19.799 sec/step)
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 2: loss = nan (7.406 sec/step)
INFO:tensorflow:global step 3: loss = nan (6.953 sec/step)
INFO:tensorflow:global step 4: loss = nan (6.840 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

@TheCodeCache
Copy link
Author

Upon several run, what I've noticed is the error starts appearing atleast after this line in the above posted code.

proc = subprocess.Popen("python train_image_classifier.py")

Still not able to hit the root cause of this issue. : (

@TheCodeCache
Copy link
Author

Moreover, what did I notice is, even after trying with different script a sample script after the run with train_image_classifier, I still face the same issue.

@TheCodeCache TheCodeCache changed the title distributed.utils - ERROR - Existing exports of data: object cannot be re-sized BufferError: - ERROR - Existing exports of data: object cannot be re-sized Jan 23, 2018
@TheCodeCache
Copy link
Author

TheCodeCache commented Jan 24, 2018

The solution to this issue was I've upgraded the dask.distributed version which in turn upgraded the tornado version and that helped me to resolve the issue.. : )

pip install distributed --upgrade

please see the below link..
[(http://www.tornadoweb.org/en/stable/releases/v4.5.3.html#tornado-iostream)]

@TheCodeCache
Copy link
Author

Closing the issue..

@mrocklin
Copy link
Member

That's an interesting error. I'm not sure what was going on but am happy to hear that it is resolved for you. Thank you both for reporting and for posting your status once you resolved the problem. Hopefully this is helpful to others in the future.

@jackeown
Copy link

I'm having this problem now. Upgrading did not fix it.
When I google it, I get results for tornado issues.
I'll post more if I get a solution that works for me.

@jackeown
Copy link

Upgrading dask distributed didn't upgrade tornado, but when I manually upgraded tornado, it got rid of the error, so everything is good now!

@mrocklin
Copy link
Member

mrocklin commented Jan 17, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants