Skip to content

Commit

Permalink
chore(on-policy): update benchmark performance for first-order algori…
Browse files Browse the repository at this point in the history
…thms (#148)

Co-authored-by: borong <[email protected]>
Co-authored-by: Gaiejj <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: zmsn-2077 <[email protected]>
  • Loading branch information
5 people authored Mar 14, 2023
1 parent 02cd790 commit 8058f78
Show file tree
Hide file tree
Showing 8 changed files with 169 additions and 13 deletions.
182 changes: 169 additions & 13 deletions omnisafe/algorithms/on_policy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,23 @@ OmniSafe's Mujoco Velocity Benchmark evaluated the performance of OmniSafe algor

Supported algorithms are listed below:

- [X] **[NeurIPS 2001]** [A Natural Policy Gradient (NaturalPG))](https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf)
- [X] **[PMLR 2015]** [Trust Region Policy Optimization (TRPO)](https://arxiv.org/abs/1502.05477)
- [X] [The Lagrange version of TRPO (TRPO-Lag)](https://cdn.openai.com/safexp-short.pdf)
- [X] **[ICML 2017]** [Constrained Policy Optimization (CPO)](https://proceedings.mlr.press/v70/achiam17a)
- [X] **[ICML 2017]** [Proximal Constrained Policy Optimization (PCPO)](https://proceedings.mlr.press/v70/achiam17a)
- [X] **[ICLR 2019]** [Reward Constrained Policy Optimization (RCPO)](https://openreview.net/forum?id=SkfrvsA9FX)
**First-Order**

- **[NIPS 1999]** [Policy Gradient(PG)](https://papers.nips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf)
- [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)
- [The Lagrange version of PPO (PPO-Lag)](https://cdn.openai.com/safexp-short.pdf)
- **[IJCAI 2022]** [Penalized Proximal Policy Optimization for Safe Reinforcement Learning(P3O)]( https://arxiv.org/pdf/2205.11814.pdf)
- **[NeurIPS 2020]** [First Order Constrained Optimization in Policy Space (FOCOPS)](https://arxiv.org/abs/2002.06506)
- **[NeurIPS 2022]** [Constrained Update Projection Approach to Safe Policy Optimization (CUP)](https://arxiv.org/abs/2209.07089)

**Second-Order**

- **[NeurIPS 2001]** [A Natural Policy Gradient (NaturalPG))](https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf)
- **[PMLR 2015]** [Trust Region Policy Optimization (TRPO)](https://arxiv.org/abs/1502.05477)
- [The Lagrange version of TRPO (TRPO-Lag)](https://cdn.openai.com/safexp-short.pdf)
- **[ICML 2017]** [Constrained Policy Optimization (CPO)](https://proceedings.mlr.press/v70/achiam17a)
- **[ICML 2017]** [Proximal Constrained Policy Optimization (PCPO)](https://proceedings.mlr.press/v70/achiam17a)
- **[ICLR 2019]** [Reward Constrained Policy Optimization (RCPO)](https://openreview.net/forum?id=SkfrvsA9FX)

## Safety-Gymnasium

Expand All @@ -24,16 +35,17 @@ pip install safety_gymnasium
```

## Run the Benchmark

You can set the main function of ``examples/benchmarks/experimrnt_grid.py`` as:

```python
if __name__ == '__main__':
eg = ExperimentGrid(exp_name='Safety_Gymnasium_Goal')
eg = ExperimentGrid(exp_name='On-Policy-Benchmarks')

# set up the algorithms.
base_policy = ['PolicyGradient', 'NaturalPG', 'TRPO', 'PPO']
naive_lagrange_policy = ['PPOLag', 'TRPOLag', 'RCPO', 'OnCRPO', 'PDO']
first_order_policy = ['CUP', 'FOCOPS']
first_order_policy = ['CUP', 'FOCOPS', 'P3O']
second_order_policy = ['CPO', 'PCPO']

eg.add('algo', base_policy + naive_lagrange_policy + first_order_policy + second_order_policy)
Expand All @@ -53,6 +65,9 @@ if __name__ == '__main__':
'SafetyHumanoidVelocity-v4'
])
eg.add('seed', [0, 5, 10, 15, 20])

# total experiment num must can be divided by num_pool
# meanwhile, users should decide this value according to their machine
eg.run(train, num_pool=5)
```

Expand Down Expand Up @@ -106,6 +121,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``

## Example benchmark

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src="./benchmarks/first_order_ant.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">SafetyAntVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
Expand All @@ -117,6 +143,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
padding: 2px;">SafetyAntVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src="./benchmarks/first_order_halfcheetah.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">SafetyHalfCheetahVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
Expand All @@ -128,6 +165,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
padding: 2px;">SafetyHalfCheetahVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src="./benchmarks/first_order_hopper.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">SafetyHopperVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
Expand All @@ -139,6 +187,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
padding: 2px;">SafetyHopperVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src="./benchmarks/first_order_humanoid.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">SafetyHumanoidVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
Expand All @@ -150,6 +209,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
padding: 2px;">SafetyHumanoidVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src="./benchmarks/first_order_walker2d.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">SafetyWalker2dVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
Expand All @@ -161,6 +231,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
padding: 2px;">SafetyWalker2dVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
src="./benchmarks/first_order_swimmer.png">
<br>
<div style="color:orange; border-bottom: 1px solid #d9d9d9;
display: inline-block;
color: #999;
padding: 2px;">SafetySwimmerVelocity-v4</div>
</center>

<center>
<img style="border-radius: 0.3125em;
box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
Expand All @@ -172,15 +253,20 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
padding: 2px;">SafetySwimmerVelocity-v4</div>
</center>


## Experiment Analysis

### Hyperparameters

#### Second Order Methods Specific Hyperparameters
#### First-Order Methods Specific Hyperparameters

**We are continuously improving performance for first-order algorithms and finding better hyperparameters and will release an ultimate version as soon as possible. Meanwhile, we are happy to receive any advice from users, feel free for opening PRs or issues.**

#### Second-Order Methods Specific Hyperparameters

- ``algo_cfgs:kl_early_stop``: Whether to use early stop for KL divergence. In the second order methods, we use line search to find the proper step size. If the KL divergence is too large, we will stop the line search and use the previous step size. So we always set this hyperparameter to ``False``.
- ``algo_cfgs:kl_early_stop``: Whether to use early stop for KL divergence. In the second-order methods, we use line search to find the proper step size. If the KL divergence is too large, we will stop the line search and use the previous step size. So we always set this hyperparameter to ``False``.

- ``model_cfgs:actor:lr``: The learning rate of the actor network. The second order methods use the actor network update the policy by directly setting the parameters of the policy network. So we do not need to set the learning rate of the policy network, which is set to ``None``.
- ``model_cfgs:actor:lr``: The learning rate of the actor network. The second-order methods use the actor network update the policy by directly setting the parameters of the policy network. So we do not need to set the learning rate of the policy network, which is set to ``None``.

### Some Hints

Expand All @@ -192,7 +278,7 @@ In our experiments, we found that somehyperparameters are important for the perf

We have done some experiments to show the effect of these hyperparameters, and we log the best configuration for each algorithm in each environment. You can check it in the ``omnisafe/configs/on_policy``.

In experiments, we found that the ``obs_normlize=True`` always performs better than ``obs_normlize=False`` in the second order methods. That means the reward would increase quicker if we normalize the observation. So we set ``obs_normlize=True`` in almost all the second order methods.
In experiments, we found that the ``obs_normlize=True`` always performs better than ``obs_normlize=False`` in the second-order methods. That means the reward would increase quicker if we normalize the observation. So we set ``obs_normlize=True`` in almost all the second-order methods.

Importantly, we found that the ``rew_normlize=True`` not always performs better than ``rew_normlize=False``, especially in the ``SafetyHopperVelocity-v4`` and ``SafetyWalker2dVelocity`` environment.

Expand All @@ -204,7 +290,77 @@ This hyperparamter depens on the number of CPU cores. We set it to 8 in our expe

If you find that other hyperparameters perform better, please feel free to open an issue or pull request.

### Experiment Results
### First-Order Algorithms Experiment Results

### PG(1M)

| Environment | Reward (OmniSafe) | Cost (Omnisafe) |
| :--------------------------: | :---------------: | :-------------: |
| SafetyAntVelocity-v4 | 1128.4±654.6 | 155.0±96.5 |
| SafetyHalfCheetahVelocity-v4 | 1700.2±902.4 | 422.2±234.1 |
| SafetyHopperVelocity-v4 | 674.4±127.2 | 180.5±26.4 |
| SafetyWalker2dVelocity-v4 | 624.2±301.4 | 125.8±67.5 |
| SafetySwimmerVelocity-v4 | 37.7±8.2 | 695.0±230.3 |
| SafetyHumanoidVelocity-v4 | 612.7±131.6 | 38.9±17.8 |

### PPO(1M)

| Environment | Reward (OmniSafe) | Cost (Omnisafe) |
| :--------------------------: | :---------------: | :-------------: |
| SafetyAntVelocity-v4 | 3012.2±1167.0 | 618.3±255.0 |
| SafetyHalfCheetahVelocity-v4 | 3641.1±1202.3 | 812.8±219.1 |
| SafetyHopperVelocity-v4 | 685.2±132.8 | 170.2±25.7 |
| SafetyWalker2dVelocity-v4 | 723.0±175.3 | 141.0±30.8 |
| SafetySwimmerVelocity-v4 | 52.4±19.9 | 472.9±300.3 |
| SafetyHumanoidVelocity-v4 | 633.3±128.7 | 45.9±16.1 |

### PPOLag(1M)

| Environment | Reward (OmniSafe) | Cost (Omnisafe) |
| :--------------------------: | :---------------: | :-------------: |
| SafetyAntVelocity-v4 | 2256.6±315.1 | 29.8±54.7 |
| SafetyHalfCheetahVelocity-v4 | 2065.5±234.5 | 4.7±5.1 |
| SafetyHopperVelocity-v4 | 415.8±367.9 | 47.2±28.4 |
| SafetyWalker2dVelocity-v4 | 310.4±44.7 | 19.9±9.9 |
| SafetySwimmerVelocity-v4 | 22.0± 7.8 | 63.2±16.3 |
| SafetyHumanoidVelocity-v4 | 623.0±173.7 | 17.0±19.7 |

### P3O(1M)

| Environment | Reward (OmniSafe) | Cost (Omnisafe) |
| :--------------------------: | :---------------: | :-------------: |
| SafetyAntVelocity-v4 | 1837.5±331.2 | 35.5±28.2 |
| SafetyHalfCheetahVelocity-v4 | 1251.2±117.4 | 14.7±15.3 |
| SafetyHopperVelocity-v4 | 779.0±383.2 | 21.4±13.9 |
| SafetyWalker2dVelocity-v4 | 1493.1±515.5 | 27.9±26.7 |
| SafetySwimmerVelocity-v4 | -8.8±14.3 | 125.0±58.5 |
| SafetyHumanoidVelocity-v4 | 1027.3±404.7 | 0.4±2.0 |

### FOCOPS(1M)

| Environment | Reward (OmniSafe) | Cost (Omnisafe) |
| :--------------------------: | :---------------: | :-------------: |
| SafetyAntVelocity-v4 | 2022.0±226.6 | 4.5±5.6 |
| SafetyHalfCheetahVelocity-v4 | 1759.8±414.4 | 31.3±55.2 |
| SafetyHopperVelocity-v4 | 255.4±190.0 | 10.2±12.4 |
| SafetyWalker2dVelocity-v4 | 346.3±100.2 | 22.1±16.1 |
| SafetySwimmerVelocity-v4 | 9.0±17.1 | 86.6 ±80.8 |
| SafetyHumanoidVelocity-v4 | 703.5±188.0 | 14.4±16.5 |

### CUP(1M)

| Environment | Reward (OmniSafe) | Cost (Omnisafe) |
| :--------------------------: | :---------------: | :-------------: |
| SafetyAntVelocity-v4 | 1530.4±723.3 | 60.6±48.4 |
| SafetyHalfCheetahVelocity-v4 | 1217.6±288.0 | 15.2±14.6 |
| SafetyHopperVelocity-v4 | 249.8±308.5 | 32.2±21.1 |
| SafetyWalker2dVelocity-v4 | 673.3±608.6 | 22.2±21.6 |
| SafetySwimmerVelocity-v4 | 1.2±19.3 | 113.9±57.0 |
| SafetyHumanoidVelocity-v4 | 535.0±78.2 | 16.3±13.6 |



### Second-Order Algorithms Experiment Results

#### NaturalPG(1M)

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 8058f78

Please sign in to comment.