-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903
Comments
Which version of Ray are you using? On my laptop (16 vCPU) the script above runs about 5 samples/second. You can speed this up drastically if you disable logging (~50samples/second on my laptop):
on your cluster you might want to disable syncing as well:
|
ray 1.6.0. Thanks, will try that out. |
The warning is still there, ray monitor still shows 52/52 CPUs in use whereas only 8 CPUs are used on the whole cluster (started with using all, after 30s less and less) |
@krfricke after running with 4-8 CPUs on the cluster for about 20h my script exits with: 2021-09-27 19:54:40,066 ERROR trial_runner.py:773 -- Trial trainable_c64c9714: Error processing event.
Traceback (most recent call last):
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 747, in _process_trial
decision = self._process_trial_result(trial, result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 800, in _process_trial_result
trial.trial_id, result=flat_result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 130, in on_trial_complete
trial_id=trial_id, result=result, error=error)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
self._ot_study.tell(ot_trial, val, state=ot_trial_state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
self._storage.set_trial_values(trial_id, values)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
self.check_trial_is_updatable(trial_id, trial.state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
"Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#23202 has already finished and can not be updated.
Traceback (most recent call last):
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 747, in _process_trial
decision = self._process_trial_result(trial, result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 800, in _process_trial_result
trial.trial_id, result=flat_result)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 130, in on_trial_complete
trial_id=trial_id, result=result, error=error)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
self._ot_study.tell(ot_trial, val, state=ot_trial_state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
self._storage.set_trial_values(trial_id, values)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
self.check_trial_is_updatable(trial_id, trial.state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
"Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#23202 has already finished and can not be updated.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "notebooks/factor/price_prediction.py", line 163, in <module>
reuse_actors=True
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/tune.py", line 532, in run
runner.step()
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 554, in step
self._process_events(timeout=timeout) # blocking
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 712, in _process_events
self._process_trial(trial)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 774, in _process_trial
self._process_trial_failure(trial, traceback.format_exc())
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 985, in _process_trial_failure
self._search_alg.on_trial_complete(trial.trial_id, error=True)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 130, in on_trial_complete
trial_id=trial_id, result=result, error=error)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
self._ot_study.tell(ot_trial, val, state=ot_trial_state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 664, in tell
self._storage.set_trial_state(trial_id, state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 223, in set_trial_state
self.check_trial_is_updatable(trial_id, trial.state)
File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
"Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#23202 has already finished and can not be updated. I assume this is a connected, not separate issue. |
This is still the case on ray 1.7.1 and nightly. Some additional data: number of reported trials goes from +150 to about 10 which makes Tune unusable (on 2 node cluster): == Status ==
Memory usage on this node: 10.4/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 16.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 7a3e1fea with tp=288254 and parameters={'window_diff_mean': 51, 'entry_roll_mean_threshold': -0.1, 'time_period': 13}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 418/infinite (1 PENDING, 16 RUNNING, 401 TERMINATED)
== Status ==
Memory usage on this node: 10.4/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 16.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 7f4de2c2 with tp=350092 and parameters={'window_diff_mean': 19, 'entry_roll_mean_threshold': -0.1, 'time_period': 10}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 599/infinite (1 PENDING, 16 RUNNING, 582 TERMINATED)
== Status ==
Memory usage on this node: 10.4/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 17.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 80bc5a94 with tp=380841 and parameters={'window_diff_mean': 7, 'entry_roll_mean_threshold': -0.10099999999999998, 'time_period': 6}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 729/infinite (17 RUNNING, 712 TERMINATED)
...
== Status ==
Memory usage on this node: 15.6/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 43.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 92491f4a with tp=390800 and parameters={'window_diff_mean': 5, 'entry_roll_mean_threshold': -0.1, 'time_period': 5}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 11663/infinite (1 PENDING, 43 RUNNING, 11619 TERMINATED)
== Status ==
Memory usage on this node: 15.8/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 43.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 92491f4a with tp=390800 and parameters={'window_diff_mean': 5, 'entry_roll_mean_threshold': -0.1, 'time_period': 5}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 11674/infinite (1 PENDING, 43 RUNNING, 11630 TERMINATED) The same issue as in #12352. Looks like Tune is busy suggesting parameters while workers in the cluster are idling: |
This is due to Optuna itself being backlogged by the number of results. Unfortunately, this is not something we can influence. With such a large number of trials, the underlying model requires a non-trivial amount of time to suggest new points. I'd consider using simple random search through Tune's Let me know if this helps! Feel free to reopen the issue in case of more questions. |
Search before asking
Ray Component
Ray Tune
What happened + What you expected to happen
Tune hangs (about 2 CPUs are utilized on the cluster of 52 CPUs). ray monitor reports all CPUs are in use.
Reproduction script
Anything else
The script above runs really slow considering how fast
easy_objective
is calculated: about 20 trials per 20s on a 52 CPU cluster.Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: