[tune] Error saving checkpoint based on nested metric score #27701

Juno-T · 2022-08-09T07:36:25Z

What happened + What you expected to happen

I tried running a simple RL training using rllib and set checkpoint_score_attr="evaluation/episode_reward_mean" in tune.run(). The training ran properly except the checkpoint saving. It showed this error messeges:

2022-08-09 07:20:15,138 ERROR checkpoint_manager.py:320 -- Result dict has no key: evaluation/episode_reward_mean. checkpoint_score_attr must be set to a key in the result dict. Valid keys are: ['evaluation', 'custom_metrics', 'episode_media', 'num_recreated_workers', 'info', 'sampler_results', 'episode_reward_max', 'episode_reward_min', 'episode_reward_mean', 'episode_len_mean', 'episodes_this_iter', 'policy_reward_min', 'policy_reward_max', 'policy_reward_mean', 'hist_stats', 'sampler_perf', 'num_faulty_episodes', 'num_healthy_workers', 'num_agent_steps_sampled', 'num_agent_steps_trained', 'num_env_steps_sampled', 'num_env_steps_trained', 'num_env_steps_sampled_this_iter', 'num_env_steps_trained_this_iter', 'timesteps_total', 'num_steps_trained_this_iter', 'agent_timesteps_total', 'timers', 'counters', 'done', 'episodes_total', 'training_iteration', 'trial_id', 'experiment_id', 'date', 'timestamp', 'time_this_iter_s', 'time_total_s', 'pid', 'hostname', 'node_ip', 'config', 'time_since_restore', 'timesteps_since_restore', 'iterations_since_restore', 'warmup_time', 'perf', 'experiment_tag']

I saw that this behavior had been previously reported (~~#14374~~ #14377) and resolved (~~#14375~~ #14379) but it reoccurred. Apparently, this line didn't reflect the mentioned pull request somehow.

Versions / Dependencies

ray[rllib]
ray==2.0.0rc0
Python 3.8.10

Tested on headless server and used virtual display (probably irrelevant)

Reproduction script

I think the script in the old issue is still valid but I tested using this similar script:

import ray
from ray import tune

from pyvirtualdisplay import Display

if __name__ == "__main__":
    ray.init()

    config = {
        "env": "CartPole-v1",
        "framework": "torch",

        "timesteps_per_iteration": 10,
        "evaluation_interval": 1,
        "evaluation_num_episodes": 1,
    }
    with Display(visible=False, size=(1400, 900)) as disp:
        analysis = tune.run(
            "DQN",
            stop={"num_env_steps_trained": 2000},
            config=config,
            num_samples=1,
            checkpoint_freq=1,
            keep_checkpoints_num=1,
            checkpoint_score_attr="evaluation/episode_reward_mean"
        )

    ray.shutdown()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

xwjiang2010 · 2022-08-09T17:05:11Z

Thanks for reporting this @Juno-T !
This is indeed a regression introduced by PR. Putting up a fix and a test now.

sven1977 · 2022-08-10T08:30:22Z

Thanks for your help on this, @xwjiang2010 ! :)

Juno-T added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 9, 2022

xwjiang2010 added tune Tune-related issues P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 9, 2022

xwjiang2010 mentioned this issue Aug 9, 2022

[air] [checkpoint manager] handle nested metrics properly as scoring attribute. #27715

Merged

7 tasks

krfricke closed this as completed in #27715 Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Error saving checkpoint based on nested metric score #27701

[tune] Error saving checkpoint based on nested metric score #27701

Juno-T commented Aug 9, 2022 •

edited

Loading

xwjiang2010 commented Aug 9, 2022

sven1977 commented Aug 10, 2022

[tune] Error saving checkpoint based on nested metric score #27701

[tune] Error saving checkpoint based on nested metric score #27701

Comments

Juno-T commented Aug 9, 2022 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

xwjiang2010 commented Aug 9, 2022

sven1977 commented Aug 10, 2022

Juno-T commented Aug 9, 2022 •

edited

Loading