Skip to content

Commit

Permalink
[Feature] Align and simplify policy/ML evaluation in ManiSkill (#544)
Browse files Browse the repository at this point in the history
* work

* standard metrics

* code simplification

* update ppo baseline code

* Update bc.py

* work

* make push t more markov

* work

* work

* Update push_t.py

* Update push_t.py

* fixes

* better hparams for ppo for quadruped reach

* clean up bc code

* fixes

* Update README.md

* code cleanup

* fixes

* fixes

* fixes

* docs update

* Update README.md

* Update replay_trajectory.py
  • Loading branch information
StoneT2000 authored Sep 5, 2024
1 parent 510393b commit 5753e3a
Show file tree
Hide file tree
Showing 31 changed files with 1,019 additions and 864 deletions.
14 changes: 1 addition & 13 deletions docs/source/user_guide/reinforcement_learning/baselines.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,16 +42,4 @@ add large collage image of all tasks

## Evaluation

Since GPU simulation is available, there are a few differences compared to past ManiSkill versions / CPU based gym environments. Namely for efficiency, environments by default do not *reconfigure* on each environment reset. Reconfiguration allows the environment to randomize loaded assets which is necessary for some tasks that procedurally generate objects (PegInsertionSide-v1) or sample random real world objects (PickSingleYCB-v1).

Thus, for more fair comparison between different RL algorithms, when evaluating an RL policy, the environment must reconfigure and and have partial resets turned off (e.g. environments do not reset upon success/fail/termination, only upon episode truncation).

For vectorized environments the code to create a correct evaluation environment by environment ID looks like this:

```python
env_id = "PickCube-v1"
num_eval_envs = 16
env_kwargs = dict(obs_mode="state")
eval_envs = gym.make(env_id, num_envs=num_eval_envs, reconfiguration_freq=1, **env_kwargs)
eval_envs = ManiSkillVectorEnv(eval_envs, ignore_terminations=True)
```
For proper evaluation of RL policies, see how that code is setup in the [evaluation section in the RL setup page](./setup.md#evaluation). All results reported in the results linked above follow the same evaluation setup.
74 changes: 74 additions & 0 deletions docs/source/user_guide/reinforcement_learning/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
This page documents key things to know when setting up ManiSkill environments for reinforcement learning, including:

- How to convert ManiSkill environments to gymnasium API compatible environments, both [single](#gym-environment-api) and [vectorized](#gym-vectorized-environment-api) APIs.
- How to [**correctly** evaluate RL policies fairly](#evaluation)
- [Useful Wrappers](#useful-wrappers)

ManiSkill environments are created by gymnasium's `make` function. The result is by default a "batched" environment where every input and output is batched. Note that this is not standard gymnasium API. If you want the standard gymnasium environemnt / vectorized environment API see the next sections.
Expand Down Expand Up @@ -61,6 +62,79 @@ You may also notice that there are two additional options when creating a vector

Note that for efficiency, everything returned by the environment will be a batched torch tensor on the GPU and not a batched numpy array on the CPU. This the only difference you may need to account for between ManiSkill vectorized environments and gymnasium vectorized environments.

## Evaluation

With the number of different types of environments, algorithms, and approaches to evaluation, we describe below a consistent and standardized way to evaluate all kinds of policies in ManiSkill fairly. In summary, the following setup is necessary to ensure fair evaluation:

- Partial resets are turned off and environments do not reset upon success/fail/termination (`ignore_terminations=True`). Instead we record multiple types of success/fail metrics.
- All parallel environments reconfigure on reset (`reconfiguration_freq=1`), which randomizes object geometries if the task has object randomization.


The code to fairly evaluate policies and record standard metrics in ManiSkill are shown below. For GPU vectorized environments the code to evaluate policies by environment ID the following is recommended:

```python
import gymnasium as gym
from mani_skill.vector.wrappers.gymnasium import ManiSkillVectorEnv
env_id = "PushCube-v1"
num_eval_envs = 64
env_kwargs = dict(obs_mode="state") # modify your env_kwargs here
eval_envs = gym.make(env_id, num_envs=num_eval_envs, reconfiguration_freq=1, **env_kwargs)
# add any other wrappers here
eval_envs = ManiSkillVectorEnv(eval_envs, ignore_terminations=True, record_metrics=True)

# evaluation loop, which will record metrics for complete episodes only
obs, _ = eval_envs.reset(seed=0)
eval_metrics = defaultdict(list)
for _ in range(400):
action = eval_envs.action_space.sample() # replace with your policy action
obs, rew, terminated, truncated, info = eval_envs.step(action)
# note as there are no partial resets, truncated is True for all environments at the same time
if truncated.any():
for k, v in info["final_info"]["episode"].items():
eval_metrics[k].append(v.float())
for k in eval_metrics.keys():
print(f"{k}_mean: {torch.mean(torch.stack(eval_metrics[k])).item()}")
```

And for CPU vectorized environments the following is recommended for evaluation:

```python
import gymnasium as gym
from mani_skill.utils.wrappers import CPUGymWrapper
env_id = "PickCube-v1"
num_eval_envs = 8
env_kwargs = dict(obs_mode="state") # modify your env_kwargs here
def cpu_make_env(env_id, env_kwargs = dict()):
def thunk():
env = gym.make(env_id, reconfiguration_freq=1, **env_kwargs)
env = CPUGymWrapper(env, ignore_terminations=True, record_metrics=True)
# add any other wrappers here
return env
return thunk
vector_cls = gym.vector.SyncVectorEnv if num_eval_envs == 1 else lambda x : gym.vector.AsyncVectorEnv(x, context="forkserver")
eval_envs = vector_cls([cpu_make_env(env_id, env_kwargs) for _ in range(num_eval_envs)])

# evaluation loop, which will record metrics for complete episodes only
obs, _ = eval_envs.reset(seed=0)
eval_metrics = defaultdict(list)
for _ in range(400):
action = eval_envs.action_space.sample() # replace with your policy action
obs, rew, terminated, truncated, info = eval_envs.step(action)
# note as there are no partial resets, truncated is True for all environments at the same time
if truncated.any():
for final_info in info["final_info"]:
for k, v in final_info["episode"].items():
eval_metrics[k].append(v)
for k in eval_metrics.keys():
print(f"{k}_mean: {np.mean(eval_metrics[k])}")
```

The following metrics are recorded and explained below:
- `success_once`: Whether the task was successful at any point in the episode.
- `success_at_end`: Whether the task was successful at the final step of the episode.
- `fail_once/fail_at_end`: Same as the above two but for failures. Note not all tasks have success/fail criteria.
- `return`: The total reward accumulated over the course of the episode.

## Useful Wrappers

RL practitioners often use wrappers to modify and augment environments. These are documented in the [wrappers](../wrappers/index.md) section. Some commonly used ones include:
Expand Down
4 changes: 4 additions & 0 deletions examples/baselines/bc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
__pycache__/
runs/
wandb/
*.egg-info/
50 changes: 34 additions & 16 deletions examples/baselines/bc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,35 +2,53 @@

This behavior cloning implementation is adapted from [here](https://github.com/corl-team/CORL/blob/main/algorithms/offline/any_percent_bc.py).

## Running the script
## Installation

1. Install dependencies
To get started, we recommend using conda/mamba to create a new environment and install the dependencies

```shell
pip install tyro wandb
conda create -n behavior-cloning-ms python=3.9
conda activate behavior-cloning-ms
pip install -e .
```

2. Download trajectories for the selected task.
## Demonstration Download and Preprocessing

By default for fast downloads and smaller file sizes, ManiSkill demonstrations are stored in a highly reduced/compressed format which includes not keeping any observation data. Run the command to download the demonstration and convert it to a format that includes observation data and the desired action space.

```shell
python -m mani_skill.utils.download_demo "PickCube-v1"
python -m mani_skill.utils.download_demo "PushCube-v1"
```

3. Replay the trajectories with the correct control mode. The example below performs this on the Pick Cube task with the `pd_ee_delta_pose` control mode and state observations. For the rgbd example change the state to rgbd to record the correct type of observations.

```shell
env_id="PickCube-v1"
python -m mani_skill.trajectory.replay_trajectory \
--traj-path ~/.maniskill/demos/${env_id}/motionplanning/trajectory.h5 \
--use-first-env-state --allow-failure \
-c pd_ee_delta_pose -o state \
--save-traj --num-procs 4 -b cpu
--traj-path ~/.maniskill/demos/PushCube-v1/motionplanning/trajectory.h5 \
--use-first-env-state -c pd_ee_delta_pos -o state \
--save-traj --num-procs 10 -b cpu
```

Set -o to rgbd for RGBD observations. Note that the control mode can heavily influence how well Behavior Cloning performs. By default we recommend using `pd_joint_delta_pos` for control mode as all tasks can be solved with that control mode, although it is harder to learn with BC than `pd_ee_delta_pos` or `pd_ee_delta_pose` for robots that have those control modes. Finally, the type of demonstration data used can also impact performance, with typically neural network generated demonstrations being easier to learn from than human/motion planning generated demonstrations.

## Training

We provide scripts for state based and rgbd based training. Make sure to use the same sim backend as the backend the demonstrations were collected with.

Moreover, some demonstrations are slow and can exceed the default max episode steps. In this case, you can use the `--max-episode-steps` flag to set a higher value. Most of the time 2x the default value is sufficient.

For state based training:

```shell
python bc.py --env-id "PushCube-v1" \
--demo-path ~/.maniskill/demos/PushCube-v1/motionplanning/trajectory.state.pd_ee_delta_pos.cpu.h5 \
--control-mode "pd_ee_delta_pos" --sim-backend "cpu" --max-episode-steps 100 \
--total-iters 10000
```

4. Run the script and modify the necessary arguments. A full list of arguments can be found in both of the files.
For rgbd based training:

```shell
python bc.py --env "PickCube-v1" \
--demo-path ~/.maniskill/demos/PickCube-v1/motionplanning/trajectory.state.pd_ee_delta_pose.cpu.h5 \
--video --wandb --sim-backend "cpu"
python bc_rgbd.py --env-id "PushCube-v1" \
--demo-path ~/.maniskill/demos/PushCube-v1/motionplanning/trajectory.state.pd_ee_delta_pos.cpu.h5 \
--control-mode "pd_ee_delta_pos" --sim-backend "cpu" --max-episode-steps 100 \
--total-iters 10000
```
Loading

0 comments on commit 5753e3a

Please sign in to comment.