You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried running a simple RL training using rllib and set checkpoint_score_attr="evaluation/episode_reward_mean" in tune.run(). The training ran properly except the checkpoint saving. It showed this error messeges:
2022-08-09 07:20:15,138 ERROR checkpoint_manager.py:320 -- Result dict has no key: evaluation/episode_reward_mean. checkpoint_score_attr must be set to a key in the result dict. Valid keys are: ['evaluation', 'custom_metrics', 'episode_media', 'num_recreated_workers', 'info', 'sampler_results', 'episode_reward_max', 'episode_reward_min', 'episode_reward_mean', 'episode_len_mean', 'episodes_this_iter', 'policy_reward_min', 'policy_reward_max', 'policy_reward_mean', 'hist_stats', 'sampler_perf', 'num_faulty_episodes', 'num_healthy_workers', 'num_agent_steps_sampled', 'num_agent_steps_trained', 'num_env_steps_sampled', 'num_env_steps_trained', 'num_env_steps_sampled_this_iter', 'num_env_steps_trained_this_iter', 'timesteps_total', 'num_steps_trained_this_iter', 'agent_timesteps_total', 'timers', 'counters', 'done', 'episodes_total', 'training_iteration', 'trial_id', 'experiment_id', 'date', 'timestamp', 'time_this_iter_s', 'time_total_s', 'pid', 'hostname', 'node_ip', 'config', 'time_since_restore', 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time', 'perf', 'experiment_tag']
I saw that this behavior had been previously reported (#14374#14377) and resolved (#14375#14379) but it reoccurred. Apparently, this line didn't reflect the mentioned pull request somehow.
Versions / Dependencies
ray[rllib]
ray==2.0.0rc0
Python 3.8.10
Tested on headless server and used virtual display (probably irrelevant)
Reproduction script
I think the script in the old issue is still valid but I tested using this similar script:
import ray
from ray import tune
from pyvirtualdisplay import Display
if __name__ == "__main__":
ray.init()
config = {
"env": "CartPole-v1",
"framework": "torch",
"timesteps_per_iteration": 10,
"evaluation_interval": 1,
"evaluation_num_episodes": 1,
}
with Display(visible=False, size=(1400, 900)) as disp:
analysis = tune.run(
"DQN",
stop={"num_env_steps_trained": 2000},
config=config,
num_samples=1,
checkpoint_freq=1,
keep_checkpoints_num=1,
checkpoint_score_attr="evaluation/episode_reward_mean"
)
ray.shutdown()
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
Juno-T
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Aug 9, 2022
xwjiang2010
added
tune
Tune-related issues
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Aug 9, 2022
What happened + What you expected to happen
I tried running a simple RL training using rllib and set
checkpoint_score_attr="evaluation/episode_reward_mean"
intune.run()
. The training ran properly except the checkpoint saving. It showed this error messeges:I saw that this behavior had been previously reported (
#14374#14377) and resolved (#14375#14379) but it reoccurred. Apparently, this line didn't reflect the mentioned pull request somehow.Versions / Dependencies
Tested on headless server and used virtual display (probably irrelevant)
Reproduction script
I think the script in the old issue is still valid but I tested using this similar script:
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: