Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] long_running_horovod_tune_test errors with file not found error #27165

Closed
3 of 4 tasks
krfricke opened this issue Jul 28, 2022 · 5 comments
Closed
3 of 4 tasks
Assignees
Labels
bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release

Comments

@krfricke
Copy link
Contributor

krfricke commented Jul 28, 2022

What happened + What you expected to happen

This test fails with the error:

FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/ray_results/HorovodTrainer_2022-07-28_02-16-46/HorovodTrainer_00dc0_00002_2_lr=0.3000_2022-07-28_02-17-00/checkpoint_-00001/.tune_metadata'

this happens on trial to driver syncing, so was not solved by #26725

I think there are three things here:

https://buildkite.com/ray-project/release-tests-branch/builds/832#0182422c-f590-4f0e-b05c-a64e2c2341ec

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 838, in _wait_and_handle_event
    trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 962, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1046, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1105, in _process_trial_result
    result=result.copy(),
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 329, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
    sync_process.wait()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 127, in wait
    raise exception
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 108, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 70, in sync_dir_between_nodes
    return_futures=return_futures,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 174, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2247, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: �[36mray::_unpack_from_actor()�[39m (pid=1909, ip=172.31.82.95)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 385, in _unpack_from_actor
    for buffer in _iter_remote(pack_actor):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 346, in _iter_remote
    buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, �[36mray::_PackActor.__init__()�[39m (pid=1305, ip=172.31.89.172, repr=<ray.tune.utils.file_transfer._PackActor object at 0x7f531d725210>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: �[36mray::_get_recursive_files_and_stats()�[39m (pid=1909, ip=172.31.82.95)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 218, in _get_recursive_files_and_stats
    stat = os.lstat(os.path.join(path, key))
FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/ray_results/HorovodTrainer_2022-07-28_02-16-46/HorovodTrainer_00dc0_00002_2_lr=0.3000_2022-07-28_02-17-00/checkpoint_-00001/.tune_metadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tuner.py", line 234, in fit
    return self._local_tuner.fit()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 283, in fit
    analysis = self._fit_internal(trainable, param_space)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/impl/tuner_internal.py", line 381, in _fit_internal
    **args,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/tune.py", line 724, in run
    runner.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 870, in step
    self._wait_and_handle_event(next_trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 849, in _wait_and_handle_event
    raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 838, in _wait_and_handle_event
    trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 962, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1046, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/execution/trial_runner.py", line 1105, in _process_trial_result
    result=result.copy(),
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/callback.py", line 329, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 529, in on_trial_result
    self._sync_trial_dir(trial, force=False, wait=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 494, in _sync_trial_dir
    sync_process.wait()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 127, in wait
    raise exception
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/syncer.py", line 108, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 70, in sync_dir_between_nodes
    return_futures=return_futures,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 174, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2247, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: �[36mray::_unpack_from_actor()�[39m (pid=1909, ip=172.31.82.95)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 385, in _unpack_from_actor
    for buffer in _iter_remote(pack_actor):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 346, in _iter_remote
    buffer = ray.get(actor.next.remote())
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, �[36mray::_PackActor.__init__()�[39m (pid=1305, ip=172.31.89.172, repr=<ray.tune.utils.file_transfer._PackActor object at 0x7f531d725210>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: �[36mray::_get_recursive_files_and_stats()�[39m (pid=1909, ip=172.31.82.95)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/file_transfer.py", line 218, in _get_recursive_files_and_stats
    stat = os.lstat(os.path.join(path, key))
FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/ray_results/HorovodTrainer_2022-07-28_02-16-46/HorovodTrainer_00dc0_00002_2_lr=0.3000_2022-07-28_02-17-00/checkpoint_-00001/.tune_metadata'

Versions / Dependencies

Latest master / release 2.0.0 branch

Reproduction script

Run release test

Issue Severity

High: It blocks me from completing my task.

@krfricke krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) release-blocker P0 Issue that blocks the release and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 28, 2022
@krfricke krfricke assigned krfricke and scv119 and unassigned krfricke Jul 28, 2022
@krfricke
Copy link
Contributor Author

Please note that I believe only #27168 has to be picked to resolve the release blocker.

#27174 is a nice polish but it should be enough to pick this in 2.1. cc @richardliaw

@xwjiang2010
Copy link
Contributor

xwjiang2010 commented Jul 28, 2022

FYI, there is already a ticket here: #26724

btw, it was decided at one point to be not a release blocker.

@krfricke
Copy link
Contributor Author

Ah thanks! I believe the fix in #27168 is small enough to be worthwhile to be picked onto 2.0.0. cc @richardliaw

@zhe-thoughts
Copy link
Collaborator

@matthewdeng is making sure if this is still failing

@matthewdeng
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release
Projects
None yet
Development

No branches or pull requests

5 participants