You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, �[36mray::_PackActor.__init__()�[39m (pid=1305, ip=172.31.89.172, repr=<ray.tune.utils.file_transfer._PackActor object at 0x7f531d725210>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: �[36mray::_get_recursive_files_and_stats()�[39m (pid=1909, ip=172.31.82.95)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 218, in _get_recursive_files_and_stats
stat = os.lstat(os.path.join(path, key))
FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/ray_results/HorovodTrainer_2022-07-28_02-16-46/HorovodTrainer_00dc0_00002_2_lr=0.3000_2022-07-28_02-17-00/checkpoint_-00001/.tune_metadata'
This happens because the checkpoint is deleted (in a remote task) while we are in this loop:
files_stats = {}
for root, dirs, files in os.walk(path, topdown=False):
rel_root = os.path.relpath(root, path)
for file in files:
key = os.path.join(rel_root, file)
stat = os.lstat(os.path.join(path, key))
files_stats[key] = stat.st_mtime, stat.st_size
We should gracefully handle file not found errors here.
This happens because the checkpoint is deleted (in a remote task) while we are in this loop:
We should gracefully handle file not found errors here.
See #27165
The text was updated successfully, but these errors were encountered: