[rllib] "TypeError: Object of type 'Box' is not JSON serializable" happen when checkpointing at Ray 0.7 #4613

pengzhenghao · 2019-04-12T14:07:20Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Ray installed from (source or binary): pip install -U wheel/path
Ray version: 0.7.0.dev2
Python version: 3.6
Exact command to reproduce:

I have setup a policy_graphs for multi-agent senario:

    policy_graphs = {
        "policy": (None,
                   obs_space,
                   act_space,
                   {"model": {"custom_model": "CnnPolicy"},
                    "vf_share_layers": True})
    }

wherein obs_space and act_space are both gym.space.Box objects. Concretely, the obs_space is Box(316, ), and the act_space is Box(2,). I am using a custom environment. My environments has observation space like:

obs_space = {k: Box(316,) for k in agents.keys()}

I use the observation_space of one of the agents as the obs_space here, in policy_graphs. Therefore I don't think my custom environment causes the problem.

Then running tune with:

    tune.run(
        "PPO",
        stop={"training_iteration": args.num_iters},
        config={
            ........
            "multiagent": {
                "policy_graphs": policy_graphs,
                "policy_mapping_fn": tune.function(
                    lambda agent_id: "policy"),
            },
        },
    )

which is cloned from the example codes: https://github.com/ray-project/ray/blob/master/python/ray/rllib/examples/multiagent_cartpole.py

Then I got the error:

Traceback (most recent call last):
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 286, in step
    self.checkpoint()
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 177, in checkpoint
    json.dump(runner_state, f, indent=2, cls=_TuneFunctionEncoder)
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/__init__.py", line 179, in dump
    for chunk in iterable:
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 430, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 404, in _iterencode_dict
    yield from chunks
  [Previous line repeated 1 more time]
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 325, in _iterencode_list
    yield from chunks
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 437, in _iterencode
    o = _default(o)
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 52, in default
    return super(_TuneFunctionEncoder, self).default(obj)
  File "/xxx/anaconda3/envs/raydev/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'Box' is not JSON serializable

I tried to dive in to the error. And find that at ray/tune/trial_runner.py, the function checkpoint gather runner_state and uses json.dump to store them.

    def checkpoint(self):
        if not self._metadata_checkpoint_dir:
            return
        metadata_checkpoint_dir = self._metadata_checkpoint_dir
        if not os.path.exists(metadata_checkpoint_dir):
            os.makedirs(metadata_checkpoint_dir)
        runner_state = {
            "checkpoints": list(
                self.trial_executor.get_checkpoints().values()),
            "runner_data": self.__getstate__(),
            "timestamp": time.time()
        }
        tmp_file_name = os.path.join(metadata_checkpoint_dir,
                                     ".tmp_checkpoint")
        with open(tmp_file_name, "w") as f:
            json.dump(runner_state, f, indent=2, cls=_TuneFunctionEncoder) <<<Here!
    ...

I find that the runner_state have structure like:

runner_state = {
    "checkpoints" : [
    {  'config':{
                'multiagent':{
                        'policy_graphs': {
                             'policy': [None, Box(316,), Box(2,), ..]
                            }
                    }
            }
    }
 ]
}

Therefore, it is the policy graph in the multiagent key in the agent.config cause the problem.

I have try the same codes using Ray 0.6.5, with only one modification, the definition of policy graph:

    policy_graphs = {
        "policy": (PPOPolicyGraph, 
        # "policy": (None, # ray 0.7
                   obs_space,
                   act_space,
                   {"model": {"custom_model": "CnnPolicy"},
                    "vf_share_layers": True})
    }

and it turn out that no error is printed.

The error message pop up in the checkpoint frequency. And it do not stop the main process of training. I am not sure is that a common phenomena introduced by Ray 0.7.0 or just a trivial error.

The text was updated successfully, but these errors were encountered:

pengzhenghao · 2019-04-12T14:18:20Z

By the way, I have try the example codes without any argument / modification at multiagent_cartpole.py. The same error printed each time Ray checkpointing.

pengzhenghao · 2019-04-12T14:51:12Z

I have looked up the same codes in Ray 0.6.5, the runner_state there only contain:

runner_state = {
"checkpoints": [{
"config": "80049526....A Very Very Long, And Unreadable String!.....948c0d4d75752e"
}
]
}

I think this explain why the Ray 0.6.5 do not raise such error.

ericl · 2019-04-12T21:40:32Z

Ah, this seems to be due to #4519

Should we revert that @richardliaw ?

richardliaw · 2019-04-12T22:39:01Z

Hm, so we need to have some part of the config properly jsonified for CLI viewing. I think the proper fix is to have the encoder fall back to cloudpickle if fails?

ericl · 2019-04-12T23:22:23Z

There's the safe fallback encoder?

…

On Fri, Apr 12, 2019, 3:39 PM Richard Liaw ***@***.***> wrote: Hm, so we need to have some part of the config properly jsonified for CLI viewing. I think the proper fix is to have the encoder fall back to cloudpickle if fails? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#4613 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SniWzkPZ5B5TGSAvB6Edp7qRY7hUks5vgQsOgaJpZM4csNM7> .

richardliaw · 2019-04-13T00:11:19Z

Looks like safe fallback gets around this by removing the config (in `pretty_print`). Let me push a PR for a cloudpickle fallback...

…

On Fri, Apr 12, 2019 at 4:22 PM Eric Liang ***@***.***> wrote: There's the safe fallback encoder? On Fri, Apr 12, 2019, 3:39 PM Richard Liaw ***@***.***> wrote: > Hm, so we need to have some part of the config properly jsonified for CLI > viewing. I think the proper fix is to have the encoder fall back to > cloudpickle if fails? > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub > <#4613 (comment)>, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AAA6SniWzkPZ5B5TGSAvB6Edp7qRY7hUks5vgQsOgaJpZM4csNM7 > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4613 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUc5YOdRUO_REtt99aM3jo3Oq9X6v_Rks5vgRU3gaJpZM4csNM7> .

ericl added the regression label Apr 12, 2019

ericl self-assigned this Apr 12, 2019

ericl added the bug Something that is supposed to be working; but isn't label Apr 12, 2019

pengzhenghao closed this as completed Apr 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] "TypeError: Object of type 'Box' is not JSON serializable" happen when checkpointing at Ray 0.7 #4613

[rllib] "TypeError: Object of type 'Box' is not JSON serializable" happen when checkpointing at Ray 0.7 #4613

pengzhenghao commented Apr 12, 2019

pengzhenghao commented Apr 12, 2019

pengzhenghao commented Apr 12, 2019 •

edited

Loading

ericl commented Apr 12, 2019

richardliaw commented Apr 12, 2019

ericl commented Apr 12, 2019 via email

richardliaw commented Apr 13, 2019 via email

[rllib] "TypeError: Object of type 'Box' is not JSON serializable" happen when checkpointing at Ray 0.7 #4613

[rllib] "TypeError: Object of type 'Box' is not JSON serializable" happen when checkpointing at Ray 0.7 #4613

Comments

pengzhenghao commented Apr 12, 2019

System information

pengzhenghao commented Apr 12, 2019

pengzhenghao commented Apr 12, 2019 • edited Loading

ericl commented Apr 12, 2019

richardliaw commented Apr 12, 2019

ericl commented Apr 12, 2019 via email

richardliaw commented Apr 13, 2019 via email

pengzhenghao commented Apr 12, 2019 •

edited

Loading