[Feature] Align and simplify policy/ML evaluation in ManiSkill (#544)

* work * standard metrics * code simplification * update ppo baseline code * Update bc.py * work * make push t more markov * work * work * Update push_t.py * Update push_t.py * fixes * better hparams for ppo for quadruped reach * clean up bc code * fixes * Update README.md * code cleanup * fixes * fixes * fixes * docs update * Update README.md * Update replay_trajectory.py
haosulab · Sep 5, 2024 · 5753e3a · 5753e3a
1 parent 510393b
commit 5753e3a
Show file tree

Hide file tree

Showing 31 changed files with 1,019 additions and 864 deletions.
diff --git a/docs/source/user_guide/reinforcement_learning/baselines.md b/docs/source/user_guide/reinforcement_learning/baselines.md
@@ -42,16 +42,4 @@ add large collage image of all tasks
 
 ## Evaluation
 
-Since GPU simulation is available, there are a few differences compared to past ManiSkill versions / CPU based gym environments. Namely for efficiency, environments by default do not *reconfigure* on each environment reset. Reconfiguration allows the environment to randomize loaded assets which is necessary for some tasks that procedurally generate objects (PegInsertionSide-v1) or sample random real world objects (PickSingleYCB-v1).
-
-Thus, for more fair comparison between different RL algorithms, when evaluating an RL policy, the environment must reconfigure and and have partial resets turned off (e.g. environments do not reset upon success/fail/termination, only upon episode truncation). 
-
-For vectorized environments the code to create a correct evaluation environment by environment ID looks like this:
-
-```python
-env_id = "PickCube-v1"
-num_eval_envs = 16
-env_kwargs = dict(obs_mode="state")
-eval_envs = gym.make(env_id, num_envs=num_eval_envs, reconfiguration_freq=1, **env_kwargs)
-eval_envs = ManiSkillVectorEnv(eval_envs, ignore_terminations=True)
-```
+For proper evaluation of RL policies, see how that code is setup in the [evaluation section in the RL setup page](./setup.md#evaluation). All results reported in the results linked above follow the same evaluation setup.
diff --git a/docs/source/user_guide/reinforcement_learning/setup.md b/docs/source/user_guide/reinforcement_learning/setup.md
@@ -3,6 +3,7 @@
 This page documents key things to know when setting up ManiSkill environments for reinforcement learning, including:
 
 - How to convert ManiSkill environments to gymnasium API compatible environments, both [single](#gym-environment-api) and [vectorized](#gym-vectorized-environment-api) APIs.
+- How to [**correctly** evaluate RL policies fairly](#evaluation)
 - [Useful Wrappers](#useful-wrappers)
 
 ManiSkill environments are created by gymnasium's `make` function. The result is by default a "batched" environment where every input and output is batched. Note that this is not standard gymnasium API. If you want the standard gymnasium environemnt / vectorized environment API see the next sections.
@@ -61,6 +62,79 @@ You may also notice that there are two additional options when creating a vector
 
 Note that for efficiency, everything returned by the environment will be a batched torch tensor on the GPU and not a batched numpy array on the CPU. This the only difference you may need to account for between ManiSkill vectorized environments and gymnasium vectorized environments.
 
+## Evaluation
+
+With the number of different types of environments, algorithms, and approaches to evaluation, we describe below a consistent and standardized way to evaluate all kinds of policies in ManiSkill fairly. In summary, the following setup is necessary to ensure fair evaluation:
+
+- Partial resets are turned off and environments do not reset upon success/fail/termination (`ignore_terminations=True`). Instead we record multiple types of success/fail metrics.
+- All parallel environments reconfigure on reset (`reconfiguration_freq=1`), which randomizes object geometries if the task has object randomization.
+
+
+The code to fairly evaluate policies and record standard metrics in ManiSkill are shown below. For GPU vectorized environments the code to evaluate policies by environment ID the following is recommended:
+
+```python
+import gymnasium as gym
+from mani_skill.vector.wrappers.gymnasium import ManiSkillVectorEnv
+env_id = "PushCube-v1"
+num_eval_envs = 64
+env_kwargs = dict(obs_mode="state") # modify your env_kwargs here
+eval_envs = gym.make(env_id, num_envs=num_eval_envs, reconfiguration_freq=1, **env_kwargs)
+# add any other wrappers here
+eval_envs = ManiSkillVectorEnv(eval_envs, ignore_terminations=True, record_metrics=True)
+
+# evaluation loop, which will record metrics for complete episodes only
+obs, _ = eval_envs.reset(seed=0)
+eval_metrics = defaultdict(list)
+for _ in range(400):
+    action = eval_envs.action_space.sample() # replace with your policy action
+    obs, rew, terminated, truncated, info = eval_envs.step(action)
+    # note as there are no partial resets, truncated is True for all environments at the same time
+    if truncated.any():
+        for k, v in info["final_info"]["episode"].items():
+            eval_metrics[k].append(v.float())
+for k in eval_metrics.keys():
+    print(f"{k}_mean: {torch.mean(torch.stack(eval_metrics[k])).item()}")
+```
+
+And for CPU vectorized environments the following is recommended for evaluation:
+
+```python
+import gymnasium as gym
+from mani_skill.utils.wrappers import CPUGymWrapper
+env_id = "PickCube-v1"
+num_eval_envs = 8
+env_kwargs = dict(obs_mode="state") # modify your env_kwargs here
+def cpu_make_env(env_id, env_kwargs = dict()):
+    def thunk():
+        env = gym.make(env_id, reconfiguration_freq=1, **env_kwargs)
+        env = CPUGymWrapper(env, ignore_terminations=True, record_metrics=True)
+        # add any other wrappers here
+        return env
+    return thunk
+vector_cls = gym.vector.SyncVectorEnv if num_eval_envs == 1 else lambda x : gym.vector.AsyncVectorEnv(x, context="forkserver")
+eval_envs = vector_cls([cpu_make_env(env_id, env_kwargs) for _ in range(num_eval_envs)])
+
+# evaluation loop, which will record metrics for complete episodes only
+obs, _ = eval_envs.reset(seed=0)
+eval_metrics = defaultdict(list)
+for _ in range(400):
+    action = eval_envs.action_space.sample() # replace with your policy action
+    obs, rew, terminated, truncated, info = eval_envs.step(action)
+    # note as there are no partial resets, truncated is True for all environments at the same time
+    if truncated.any():
+        for final_info in info["final_info"]:
+            for k, v in final_info["episode"].items():
+                eval_metrics[k].append(v)
+for k in eval_metrics.keys():
+    print(f"{k}_mean: {np.mean(eval_metrics[k])}")
+```
+
+The following metrics are recorded and explained below:
+- `success_once`: Whether the task was successful at any point in the episode.
+- `success_at_end`: Whether the task was successful at the final step of the episode.
+- `fail_once/fail_at_end`: Same as the above two but for failures. Note not all tasks have success/fail criteria.
+- `return`: The total reward accumulated over the course of the episode.
+
 ## Useful Wrappers
 
 RL practitioners often use wrappers to modify and augment environments. These are documented in the [wrappers](../wrappers/index.md) section. Some commonly used ones include:

diff --git a/examples/baselines/bc/.gitignore b/examples/baselines/bc/.gitignore
@@ -0,0 +1,4 @@
+__pycache__/
+runs/
+wandb/
+*.egg-info/
diff --git a/examples/baselines/bc/README.md b/examples/baselines/bc/README.md
@@ -2,35 +2,53 @@
 
 This behavior cloning implementation is adapted from [here](https://github.com/corl-team/CORL/blob/main/algorithms/offline/any_percent_bc.py).
 
-## Running the script
+## Installation
 
-1. Install dependencies
+To get started, we recommend using conda/mamba to create a new environment and install the dependencies
 
 ```shell
-pip install tyro wandb
+conda create -n behavior-cloning-ms python=3.9
+conda activate behavior-cloning-ms
+pip install -e .
 ```
 
-2. Download trajectories for the selected task.
+## Demonstration Download and Preprocessing
+
+By default for fast downloads and smaller file sizes, ManiSkill demonstrations are stored in a highly reduced/compressed format which includes not keeping any observation data. Run the command to download the demonstration and convert it to a format that includes observation data and the desired action space.
 
 ```shell
-python -m mani_skill.utils.download_demo "PickCube-v1"
+python -m mani_skill.utils.download_demo "PushCube-v1"
 ```
 
-3. Replay the trajectories with the correct control mode. The example below performs this on the Pick Cube task with the `pd_ee_delta_pose` control mode and state observations. For the rgbd example change the state to rgbd to record the correct type of observations.
-
 ```shell
-env_id="PickCube-v1"
 python -m mani_skill.trajectory.replay_trajectory \
-  --traj-path ~/.maniskill/demos/${env_id}/motionplanning/trajectory.h5 \
-  --use-first-env-state --allow-failure \
-  -c pd_ee_delta_pose -o state \
-  --save-traj --num-procs 4 -b cpu
+  --traj-path ~/.maniskill/demos/PushCube-v1/motionplanning/trajectory.h5 \
+  --use-first-env-state -c pd_ee_delta_pos -o state \
+  --save-traj --num-procs 10 -b cpu
+```
+
+Set -o to rgbd for RGBD observations. Note that the control mode can heavily influence how well Behavior Cloning performs. By default we recommend using `pd_joint_delta_pos` for control mode as all tasks can be solved with that control mode, although it is harder to learn with BC than `pd_ee_delta_pos` or `pd_ee_delta_pose` for robots that have those control modes. Finally, the type of demonstration data used can also impact performance, with typically neural network generated demonstrations being easier to learn from than human/motion planning generated demonstrations.
+
+## Training
+
+We provide scripts for state based and rgbd based training. Make sure to use the same sim backend as the backend the demonstrations were collected with. 
+
+Moreover, some demonstrations are slow and can exceed the default max episode steps. In this case, you can use the `--max-episode-steps` flag to set a higher value. Most of the time 2x the default value is sufficient.
+
+For state based training:
+
+```shell
+python bc.py --env-id "PushCube-v1" \
+  --demo-path ~/.maniskill/demos/PushCube-v1/motionplanning/trajectory.state.pd_ee_delta_pos.cpu.h5 \
+  --control-mode "pd_ee_delta_pos" --sim-backend "cpu" --max-episode-steps 100 \
+  --total-iters 10000
 ```
 
-4. Run the script and modify the necessary arguments. A full list of arguments can be found in both of the files.
+For rgbd based training:
 
 ```shell
-python bc.py --env "PickCube-v1" \
-  --demo-path ~/.maniskill/demos/PickCube-v1/motionplanning/trajectory.state.pd_ee_delta_pose.cpu.h5 \
-  --video --wandb --sim-backend "cpu"
+python bc_rgbd.py --env-id "PushCube-v1" \
+  --demo-path ~/.maniskill/demos/PushCube-v1/motionplanning/trajectory.state.pd_ee_delta_pos.cpu.h5 \
+  --control-mode "pd_ee_delta_pos" --sim-backend "cpu" --max-episode-steps 100 \
+  --total-iters 10000
 ```