[Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903

jmakov · 2021-09-26T21:46:35Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

Tune hangs (about 2 CPUs are utilized on the cluster of 52 CPUs). ray monitor reports all CPUs are in use.

2021-09-26 23:12:54,064 WARNING ray_trial_executor.py:730 -- Over the last 60 seconds, the Tune event loop has been backlogged processing new results. Consider increasing your period of result reporting to improve performance.

Reproduction script

import ray
from ray import tune
from ray.tune.suggest import optuna


def evaluation_fn(step, width, height):
    return (0.1 + width * step / 100)**(-1) + height * 0.1


def easy_objective(config):
    width, height = config["width"], config["height"]

    for step in range(config["steps"]):
        # Iterative training function - can be any arbitrary training procedure
        intermediate_score = evaluation_fn(step, width, height)
        # Feed the score back back to Tune.
        tune.report(iterations=step, mean_loss=intermediate_score)


if __name__ == "__main__":
    ray.init(address='auto', _redis_password='5241590000000000')  # connect to head node in cluster
    
    analysis = tune.run(
        easy_objective,
        name="test",
        metric="mean_loss",
        mode="min",
        num_samples=-1,
        verbose=1,
        time_budget_s=3600*24*30,
        fail_fast=True,
        reuse_actors=True,
        config={
            "steps": 100,
            "width": tune.uniform(0, 20000),
            "height": tune.uniform(-100, 100000)
        }
    )

Anything else

The script above runs really slow considering how fast easy_objective is calculated: about 20 trials per 20s on a 52 CPU cluster.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

krfricke · 2021-09-27T07:54:10Z

Which version of Ray are you using?

On my laptop (16 vCPU) the script above runs about 5 samples/second.

You can speed this up drastically if you disable logging (~50samples/second on my laptop):

    os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"

on your cluster you might want to disable syncing as well:

    os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"

jmakov · 2021-09-27T10:44:44Z

ray 1.6.0. Thanks, will try that out.

jmakov · 2021-09-27T12:22:23Z

The warning is still there, ray monitor still shows 52/52 CPUs in use whereas only 8 CPUs are used on the whole cluster (started with using all, after 30s less and less)
.

jmakov · 2021-09-27T17:57:49Z

@krfricke after running with 4-8 CPUs on the cluster for about 20h my script exits with:

2021-09-27 19:54:40,066 ERROR trial_runner.py:773 -- Trial trainable_c64c9714: Error processing event.
Traceback (most recent call last):
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 747, in _process_trial
    decision = self._process_trial_result(trial, result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 800, in _process_trial_result
    trial.trial_id, result=flat_result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 130, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
    self._storage.set_trial_values(trial_id, values)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
    self.check_trial_is_updatable(trial_id, trial.state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
    "Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#23202 has already finished and can not be updated.
Traceback (most recent call last):
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 747, in _process_trial
    decision = self._process_trial_result(trial, result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 800, in _process_trial_result
    trial.trial_id, result=flat_result)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 130, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 662, in tell
    self._storage.set_trial_values(trial_id, values)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 330, in set_trial_values
    self.check_trial_is_updatable(trial_id, trial.state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
    "Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#23202 has already finished and can not be updated.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "notebooks/factor/price_prediction.py", line 163, in <module>
    reuse_actors=True
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/tune.py", line 532, in run
    runner.step()
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 554, in step
    self._process_events(timeout=timeout)  # blocking
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 712, in _process_events
    self._process_trial(trial)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 774, in _process_trial
    self._process_trial_failure(trial, traceback.format_exc())
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 985, in _process_trial_failure
    self._search_alg.on_trial_complete(trial.trial_id, error=True)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/search_generator.py", line 130, in on_trial_complete
    trial_id=trial_id, result=result, error=error)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/ray/tune/suggest/optuna.py", line 385, in on_trial_complete
    self._ot_study.tell(ot_trial, val, state=ot_trial_state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/study/study.py", line 664, in tell
    self._storage.set_trial_state(trial_id, state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_in_memory.py", line 223, in set_trial_state
    self.check_trial_is_updatable(trial_id, trial.state)
  File "/home/toaster/PROGS/miniconda3/envs/puma-lab/lib/python3.7/site-packages/optuna/storages/_base.py", line 723, in check_trial_is_updatable
    "Trial#{} has already finished and can not be updated.".format(trial.number)
RuntimeError: Trial#23202 has already finished and can not be updated.

I assume this is a connected, not separate issue.

jmakov · 2021-10-01T19:47:40Z

@krfricke @Yard1 do you need any other info from me?

jmakov · 2021-10-26T02:38:22Z

This is still the case on ray 1.7.1 and nightly. Some additional data: number of reported trials goes from +150 to about 10 which makes Tune unusable (on 2 node cluster):

== Status ==
Memory usage on this node: 10.4/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 16.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 7a3e1fea with tp=288254 and parameters={'window_diff_mean': 51, 'entry_roll_mean_threshold': -0.1, 'time_period': 13}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 418/infinite (1 PENDING, 16 RUNNING, 401 TERMINATED)

== Status ==
Memory usage on this node: 10.4/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 16.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 7f4de2c2 with tp=350092 and parameters={'window_diff_mean': 19, 'entry_roll_mean_threshold': -0.1, 'time_period': 10}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 599/infinite (1 PENDING, 16 RUNNING, 582 TERMINATED)

== Status ==
Memory usage on this node: 10.4/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 17.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 80bc5a94 with tp=380841 and parameters={'window_diff_mean': 7, 'entry_roll_mean_threshold': -0.10099999999999998, 'time_period': 6}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 729/infinite (17 RUNNING, 712 TERMINATED)

...

== Status ==
Memory usage on this node: 15.6/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 43.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 92491f4a with tp=390800 and parameters={'window_diff_mean': 5, 'entry_roll_mean_threshold': -0.1, 'time_period': 5}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 11663/infinite (1 PENDING, 43 RUNNING, 11619 TERMINATED)

== Status ==
Memory usage on this node: 15.8/31.3 GiB
Using FIFO scheduling algorithm.
Resources requested: 43.0/44 CPUs, 0/1 GPUs, 0.0/92.23 GiB heap, 0.0/43.52 GiB objects (0.0/1.0 accelerator_type:G)
Current best trial: 92491f4a with tp=390800 and parameters={'window_diff_mean': 5, 'entry_roll_mean_threshold': -0.1, 'time_period': 5}
Result logdir: /home/toaster/ray_results/calibrate_feature_EURUSD_DEMADiff
Number of trials: 11674/infinite (1 PENDING, 43 RUNNING, 11630 TERMINATED)

The same issue as in #12352. Looks like Tune is busy suggesting parameters while workers in the cluster are idling:

:

Yard1 · 2021-12-10T17:04:57Z

This is due to Optuna itself being backlogged by the number of results. Unfortunately, this is not something we can influence. With such a large number of trials, the underlying model requires a non-trivial amount of time to suggest new points. I'd consider using simple random search through Tune's BasicVariantGenerator as that will return a new suggestion in a constant time, no matter the amount of trials.

Let me know if this helps! Feel free to reopen the issue in case of more questions.

jmakov added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 26, 2021

krfricke added the tune Tune-related issues label Oct 5, 2021

Yard1 self-assigned this Oct 5, 2021

jmakov mentioned this issue Oct 9, 2021

[Bug] ray monitor reports negative usage #19237

Closed

2 tasks

jmakov mentioned this issue Oct 26, 2021

[Bug] Tune crashes with "RuntimeError: Trial#8161 has already finished and can not be updated" in ray 1.7.0 #19274

Closed

2 tasks

Yard1 closed this as completed Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903

[Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903

jmakov commented Sep 26, 2021 •

edited

Loading

krfricke commented Sep 27, 2021

jmakov commented Sep 27, 2021

jmakov commented Sep 27, 2021 •

edited

Loading

jmakov commented Sep 27, 2021 •

edited

Loading

jmakov commented Oct 1, 2021

jmakov commented Oct 26, 2021 •

edited

Loading

Yard1 commented Dec 10, 2021

[Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903

[Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903

Comments

jmakov commented Sep 26, 2021 • edited Loading

Search before asking

Ray Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

krfricke commented Sep 27, 2021

jmakov commented Sep 27, 2021

jmakov commented Sep 27, 2021 • edited Loading

jmakov commented Sep 27, 2021 • edited Loading

jmakov commented Oct 1, 2021

jmakov commented Oct 26, 2021 • edited Loading

Yard1 commented Dec 10, 2021

jmakov commented Sep 26, 2021 •

edited

Loading

jmakov commented Sep 27, 2021 •

edited

Loading

jmakov commented Sep 27, 2021 •

edited

Loading

jmakov commented Oct 26, 2021 •

edited

Loading