Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Error saving checkpoint based on nested metric score #27701

Closed
Juno-T opened this issue Aug 9, 2022 · 2 comments · Fixed by #27715
Closed

[tune] Error saving checkpoint based on nested metric score #27701

Juno-T opened this issue Aug 9, 2022 · 2 comments · Fixed by #27715
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@Juno-T
Copy link

Juno-T commented Aug 9, 2022

What happened + What you expected to happen

I tried running a simple RL training using rllib and set checkpoint_score_attr="evaluation/episode_reward_mean" in tune.run(). The training ran properly except the checkpoint saving. It showed this error messeges:

2022-08-09 07:20:15,138 ERROR checkpoint_manager.py:320 -- Result dict has no key: evaluation/episode_reward_mean. checkpoint_score_attr must be set to a key in the result dict. Valid keys are: ['evaluation', 'custom_metrics', 'episode_media', 'num_recreated_workers', 'info', 'sampler_results', 'episode_reward_max', 'episode_reward_min', 'episode_reward_mean', 'episode_len_mean', 'episodes_this_iter', 'policy_reward_min', 'policy_reward_max', 'policy_reward_mean', 'hist_stats', 'sampler_perf', 'num_faulty_episodes', 'num_healthy_workers', 'num_agent_steps_sampled', 'num_agent_steps_trained', 'num_env_steps_sampled', 'num_env_steps_trained', 'num_env_steps_sampled_this_iter', 'num_env_steps_trained_this_iter', 'timesteps_total', 'num_steps_trained_this_iter', 'agent_timesteps_total', 'timers', 'counters', 'done', 'episodes_total', 'training_iteration', 'trial_id', 'experiment_id', 'date', 'timestamp', 'time_this_iter_s', 'time_total_s', 'pid', 'hostname', 'node_ip', 'config', 'time_since_restore', 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time', 'perf', 'experiment_tag']

I saw that this behavior had been previously reported (#14374 #14377) and resolved (#14375 #14379) but it reoccurred. Apparently, this line didn't reflect the mentioned pull request somehow.

Versions / Dependencies

ray[rllib]
ray==2.0.0rc0
Python 3.8.10

Tested on headless server and used virtual display (probably irrelevant)

Reproduction script

I think the script in the old issue is still valid but I tested using this similar script:

import ray
from ray import tune

from pyvirtualdisplay import Display

if __name__ == "__main__":
    ray.init()

    config = {
        "env": "CartPole-v1",
        "framework": "torch",

        "timesteps_per_iteration": 10,
        "evaluation_interval": 1,
        "evaluation_num_episodes": 1,
    }
    with Display(visible=False, size=(1400, 900)) as disp:
        analysis = tune.run(
            "DQN",
            stop={"num_env_steps_trained": 2000},
            config=config,
            num_samples=1,
            checkpoint_freq=1,
            keep_checkpoints_num=1,
            checkpoint_score_attr="evaluation/episode_reward_mean"
        )

    ray.shutdown()

Issue Severity

High: It blocks me from completing my task.

@Juno-T Juno-T added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 9, 2022
@xwjiang2010 xwjiang2010 added tune Tune-related issues P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 9, 2022
@xwjiang2010
Copy link
Contributor

Thanks for reporting this @Juno-T !
This is indeed a regression introduced by PR. Putting up a fix and a test now.

@sven1977
Copy link
Contributor

Thanks for your help on this, @xwjiang2010 ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants