BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

TheCodeCache · 2018-01-23T08:45:34Z

I am running a dask-scheduler on one node and my dask-worker is running on another node.. And I submit a task to the dask-scheduler from a third node.

it sometimes throws the below error. I am using python 2.7, tornado 4.5.2, tensorflow 1.3.0

Following is the minimal script that can be used to reproduce the mentioned error which appears more often than not.

import os, sys
import subprocess

from dask.distributed import Variable, Client
#import psutil

import time, json, shlex

## the following function/task will be executed on the dask-worker node running on a separate node in the cluster.
def my_task(stop, is_alive):
  proc = None
  proc_started = False
  try:
    while(True):
      if stop.get():
        proc.terminate()
        return
      else:
        if not proc_started:
        
          ### The following train_image_classifier script which does the training on the set of images for classification,
          ### which is running smoothly when executing as a standalone script.
          
          ###starting a child process where the train_image_classifier i.e. training will run.
          proc = subprocess.Popen("python train_image_classifier.py") 
      is_alive.set(proc.poll())
  except:
    #####
    is_alive.set(proc.poll())
  finally:
    is_alive.set(proc.poll())
    return

### the following or this script will basically launch the task to run on the dask-worker through dask-scheduler
### dask-client script which will submit the task on dask-worker and listens the running status of the task and will send stop signal to dask-worker to stop the live task.
if __name__ == '__main__':

  client = Client("198.152.1.2:8786") # creating a dask client

  ### these two distributed variables will be used for two way communication between dask-client and the dask-worker
  stop = Variable("stop_", client = client)
  is_alive = Variable("is_alive_", client = client)

  future = client.submit(my_task)

  ###polling whether the running task is alive or not.
  ###if not then send the stop signal and come out of the execution
  while(True):
    if is_alive.get():
      stop.set(True)
      break;
  print("Execution over.! returning back to the caller..")

And here is the error description from the trace. it appears during the execution of the training process. it sometimes appear or sometimes not, letting the training completes..

distributed.utils - ERROR - Existing exports of data: object cannot be re-sized
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
result[0] = yield make_coro()
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/variable.py", line 179, in _get
client=self.client.id)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/core.py", line 464, in send_recv_from_rpc
result = yield send_recv(comm=comm, op=key, **kwargs)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/core.py", line 348, in send_recv
yield comm.write(msg)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/usr/lib/python2.7/site-packages/distributed/comm/tcp.py", line 218, in write
future = stream.write(frame)
File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 406, in write
self._handle_write()
File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 872, in _handle_write
del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized

distributed.worker - WARNING - Compute Failed
Function: my_task
args: ({'upper': '1.4', 'trainable_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'checkpoint_path': '/home/mapr/mano/slim_data/flowers/model/inception/inception_v3.ckpt', 'log_every_n_steps': '1', 'dataset_split_name': 'train', 'learning_rate': '0.01', 'train_dir': '/home/mapr/mano/slim_data/flowers/train_dir/train_outs_19', 'clone_on_cpu': 'True', 'batch_size': '32', 'resize_method': '3', 'hue_max_delta': '0.3', 'lower': '0.6', 'trace_every_n_steps': '1', 'script_name': 'train_image_classifier.py', 'checkpoint_exclude_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'dataset_dir': '/home/mapr/mano/slim_data/flowers/slim_data_dir', 'max_number_of_steps': '4', 'model_name': 'inception_v3', 'dataset_name': 'flowers'})
kwargs: {}
Exception: BufferError('Existing exports of data: object cannot be re-sized',)

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/mapr/mano/slim_data/flowers/train_dir/train_outs_19/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 2.6281 (19.799 sec/step)
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 2: loss = nan (7.406 sec/step)
INFO:tensorflow:global step 3: loss = nan (6.953 sec/step)
INFO:tensorflow:global step 4: loss = nan (6.840 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

The text was updated successfully, but these errors were encountered:

TheCodeCache · 2018-01-23T11:52:41Z

Upon several run, what I've noticed is the error starts appearing atleast after this line in the above posted code.

proc = subprocess.Popen("python train_image_classifier.py")

Still not able to hit the root cause of this issue. : (

TheCodeCache · 2018-01-23T12:27:02Z

Moreover, what did I notice is, even after trying with different script a sample script after the run with train_image_classifier, I still face the same issue.

TheCodeCache · 2018-01-24T13:24:12Z

The solution to this issue was I've upgraded the dask.distributed version which in turn upgraded the tornado version and that helped me to resolve the issue.. : )

pip install distributed --upgrade

please see the below link..
[(http://www.tornadoweb.org/en/stable/releases/v4.5.3.html#tornado-iostream)]

TheCodeCache · 2018-01-24T13:24:57Z

Closing the issue..

mrocklin · 2018-01-24T13:33:26Z

That's an interesting error. I'm not sure what was going on but am happy to hear that it is resolved for you. Thank you both for reporting and for posting your status once you resolved the problem. Hopefully this is helpful to others in the future.

jackeown · 2019-01-17T14:18:13Z

I'm having this problem now. Upgrading did not fix it.
When I google it, I get results for tornado issues.
I'll post more if I get a solution that works for me.

jackeown · 2019-01-17T15:13:27Z

Upgrading dask distributed didn't upgrade tornado, but when I manually upgraded tornado, it got rid of the error, so everything is good now!

mrocklin · 2019-01-17T15:26:19Z

Thanks for posting the update!

…

On Thu, Jan 17, 2019 at 7:13 AM Jack (John) McKeown < ***@***.***> wrote: Upgrading dask distributed didn't upgrade tornado, but when I manually upgraded tornado, it got rid of the error, so everything is good now! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1704 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszNe7HW9cVVJfcbBqW_clcO2wuL4zks5vEJMXgaJpZM4RpR9B> .

TheCodeCache changed the title ~~distributed.utils - ERROR - Existing exports of data: object cannot be re-sized~~ BufferError: - ERROR - Existing exports of data: object cannot be re-sized Jan 23, 2018

TheCodeCache closed this as completed Jan 24, 2018

jakirkham mentioned this issue Sep 9, 2020

Timed out trying to connect ... : connect() didn't finish in time #4080

Open

TomAugspurger mentioned this issue Sep 25, 2020

Fixed stuck BatchedSend comm #4128

Closed

lottopotato mentioned this issue Nov 12, 2023

jupyter notebook, BufferError: Existing exports of data: object cannot be re-sized lottopotato/note#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

TheCodeCache commented Jan 23, 2018

TheCodeCache commented Jan 23, 2018

TheCodeCache commented Jan 23, 2018

TheCodeCache commented Jan 24, 2018 •

edited

Loading

TheCodeCache commented Jan 24, 2018

mrocklin commented Jan 24, 2018

jackeown commented Jan 17, 2019

jackeown commented Jan 17, 2019

mrocklin commented Jan 17, 2019 via email

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

Comments

TheCodeCache commented Jan 23, 2018

TheCodeCache commented Jan 23, 2018

TheCodeCache commented Jan 23, 2018

TheCodeCache commented Jan 24, 2018 • edited Loading

TheCodeCache commented Jan 24, 2018

mrocklin commented Jan 24, 2018

jackeown commented Jan 17, 2019

jackeown commented Jan 17, 2019

mrocklin commented Jan 17, 2019 via email

TheCodeCache commented Jan 24, 2018 •

edited

Loading