chore(on-policy): update benchmark performance for first-order algori…

…thms (#148) Co-authored-by: borong <[email protected]> Co-authored-by: Gaiejj <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: zmsn-2077 <[email protected]>
PKU-Alignment · Mar 14, 2023 · 8058f78 · 8058f78
1 parent 02cd790
commit 8058f78
Show file tree

Hide file tree

Showing 8 changed files with 169 additions and 13 deletions.
diff --git a/omnisafe/algorithms/on_policy/README.md b/omnisafe/algorithms/on_policy/README.md
@@ -8,12 +8,23 @@ OmniSafe's Mujoco Velocity Benchmark evaluated the performance of OmniSafe algor
 
 Supported algorithms are listed below:
 
-- [X] **[NeurIPS 2001]** [A Natural Policy Gradient (NaturalPG))](https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf)
-- [X] **[PMLR 2015]** [Trust Region Policy Optimization (TRPO)](https://arxiv.org/abs/1502.05477)
-- [X] [The Lagrange version of TRPO (TRPO-Lag)](https://cdn.openai.com/safexp-short.pdf)
-- [X] **[ICML 2017]** [Constrained Policy Optimization (CPO)](https://proceedings.mlr.press/v70/achiam17a)
-- [X] **[ICML 2017]** [Proximal Constrained Policy Optimization (PCPO)](https://proceedings.mlr.press/v70/achiam17a)
-- [X] **[ICLR 2019]** [Reward Constrained Policy Optimization (RCPO)](https://openreview.net/forum?id=SkfrvsA9FX)
+**First-Order**
+
+- **[NIPS 1999]** [Policy Gradient(PG)](https://papers.nips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf)
+- [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)
+- [The Lagrange version of PPO (PPO-Lag)](https://cdn.openai.com/safexp-short.pdf)
+- **[IJCAI 2022]** [Penalized Proximal Policy Optimization for Safe Reinforcement Learning(P3O)]( https://arxiv.org/pdf/2205.11814.pdf)
+- **[NeurIPS 2020]** [First Order Constrained Optimization in Policy Space (FOCOPS)](https://arxiv.org/abs/2002.06506)
+- **[NeurIPS 2022]**  [Constrained Update Projection Approach to Safe Policy Optimization (CUP)](https://arxiv.org/abs/2209.07089)
+
+**Second-Order**
+
+- **[NeurIPS 2001]** [A Natural Policy Gradient (NaturalPG))](https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf)
+- **[PMLR 2015]** [Trust Region Policy Optimization (TRPO)](https://arxiv.org/abs/1502.05477)
+- [The Lagrange version of TRPO (TRPO-Lag)](https://cdn.openai.com/safexp-short.pdf)
+- **[ICML 2017]** [Constrained Policy Optimization (CPO)](https://proceedings.mlr.press/v70/achiam17a)
+- **[ICML 2017]** [Proximal Constrained Policy Optimization (PCPO)](https://proceedings.mlr.press/v70/achiam17a)
+- **[ICLR 2019]** [Reward Constrained Policy Optimization (RCPO)](https://openreview.net/forum?id=SkfrvsA9FX)
 
 ## Safety-Gymnasium
 
@@ -24,16 +35,17 @@ pip install safety_gymnasium
 ```
 
 ## Run the Benchmark
+
 You can set the main function of ``examples/benchmarks/experimrnt_grid.py`` as:
 
 ```python
 if __name__ == '__main__':
-    eg = ExperimentGrid(exp_name='Safety_Gymnasium_Goal')
+    eg = ExperimentGrid(exp_name='On-Policy-Benchmarks')
 
     # set up the algorithms.
     base_policy = ['PolicyGradient', 'NaturalPG', 'TRPO', 'PPO']
     naive_lagrange_policy = ['PPOLag', 'TRPOLag', 'RCPO', 'OnCRPO', 'PDO']
-    first_order_policy = ['CUP', 'FOCOPS']
+    first_order_policy = ['CUP', 'FOCOPS', 'P3O']
     second_order_policy = ['CPO', 'PCPO']
 
     eg.add('algo', base_policy + naive_lagrange_policy + first_order_policy + second_order_policy)
@@ -53,6 +65,9 @@ if __name__ == '__main__':
         'SafetyHumanoidVelocity-v4'
         ])
     eg.add('seed', [0, 5, 10, 15, 20])
+
+    # total experiment num must can be divided by num_pool
+    # meanwhile, users should decide this value according to their machine
     eg.run(train, num_pool=5)
 ```
 
@@ -106,6 +121,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
 
 ## Example benchmark
 
+<center>
+    <img style="border-radius: 0.3125em;
+    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
+    src="./benchmarks/first_order_ant.png">
+    <br>
+    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
+    display: inline-block;
+    color: #999;
+    padding: 2px;">SafetyAntVelocity-v4</div>
+</center>
+
 <center>
     <img style="border-radius: 0.3125em;
     box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
@@ -117,6 +143,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
     padding: 2px;">SafetyAntVelocity-v4</div>
 </center>
 
+<center>
+    <img style="border-radius: 0.3125em;
+    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
+    src="./benchmarks/first_order_halfcheetah.png">
+    <br>
+    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
+    display: inline-block;
+    color: #999;
+    padding: 2px;">SafetyHalfCheetahVelocity-v4</div>
+</center>
+
 <center>
     <img style="border-radius: 0.3125em;
     box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
@@ -128,6 +165,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
     padding: 2px;">SafetyHalfCheetahVelocity-v4</div>
 </center>
 
+<center>
+    <img style="border-radius: 0.3125em;
+    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
+    src="./benchmarks/first_order_hopper.png">
+    <br>
+    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
+    display: inline-block;
+    color: #999;
+    padding: 2px;">SafetyHopperVelocity-v4</div>
+</center>
+
 <center>
     <img style="border-radius: 0.3125em;
     box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
@@ -139,6 +187,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
     padding: 2px;">SafetyHopperVelocity-v4</div>
 </center>
 
+<center>
+    <img style="border-radius: 0.3125em;
+    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
+    src="./benchmarks/first_order_humanoid.png">
+    <br>
+    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
+    display: inline-block;
+    color: #999;
+    padding: 2px;">SafetyHumanoidVelocity-v4</div>
+</center>
+
 <center>
     <img style="border-radius: 0.3125em;
     box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
@@ -150,6 +209,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
     padding: 2px;">SafetyHumanoidVelocity-v4</div>
 </center>
 
+<center>
+    <img style="border-radius: 0.3125em;
+    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
+    src="./benchmarks/first_order_walker2d.png">
+    <br>
+    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
+    display: inline-block;
+    color: #999;
+    padding: 2px;">SafetyWalker2dVelocity-v4</div>
+</center>
+
 <center>
     <img style="border-radius: 0.3125em;
     box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
@@ -161,6 +231,17 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
     padding: 2px;">SafetyWalker2dVelocity-v4</div>
 </center>
 
+<center>
+    <img style="border-radius: 0.3125em;
+    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
+    src="./benchmarks/first_order_swimmer.png">
+    <br>
+    <div style="color:orange; border-bottom: 1px solid #d9d9d9;
+    display: inline-block;
+    color: #999;
+    padding: 2px;">SafetySwimmerVelocity-v4</div>
+</center>
+
 <center>
     <img style="border-radius: 0.3125em;
     box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"
@@ -172,15 +253,20 @@ For example, if I train ``PPOLag`` in ``SafetyHumanoidVelocity-v4``
     padding: 2px;">SafetySwimmerVelocity-v4</div>
 </center>
 
+
 ## Experiment Analysis
 
 ### Hyperparameters
 
-#### Second Order Methods Specific Hyperparameters
+#### First-Order Methods Specific Hyperparameters
+
+**We are continuously improving performance for first-order algorithms and finding better hyperparameters and will release an ultimate version as soon as possible. Meanwhile, we are happy to receive any advice from users, feel free for opening PRs or issues.**
+
+#### Second-Order Methods Specific Hyperparameters
 
-- ``algo_cfgs:kl_early_stop``: Whether to use early stop for KL divergence. In the second order methods, we use line search to find the proper step size. If the KL divergence is too large, we will stop the line search and use the previous step size. So we always set this hyperparameter to ``False``.
+- ``algo_cfgs:kl_early_stop``: Whether to use early stop for KL divergence. In the second-order methods, we use line search to find the proper step size. If the KL divergence is too large, we will stop the line search and use the previous step size. So we always set this hyperparameter to ``False``.
 
-- ``model_cfgs:actor:lr``: The learning rate of the actor network. The second order methods use the actor network update the policy by directly setting the parameters of the policy network. So we do not need to set the learning rate of the policy network, which is set to ``None``.
+- ``model_cfgs:actor:lr``: The learning rate of the actor network. The second-order methods use the actor network update the policy by directly setting the parameters of the policy network. So we do not need to set the learning rate of the policy network, which is set to ``None``.
 
 ### Some Hints
 
@@ -192,7 +278,7 @@ In our experiments, we found that somehyperparameters are important for the perf
 
 We have done some experiments to show the effect of these hyperparameters, and we log the best configuration for each algorithm in each environment. You can check it in the ``omnisafe/configs/on_policy``.
 
-In experiments, we found that the ``obs_normlize=True`` always performs better than ``obs_normlize=False`` in the second order methods. That means the reward would increase quicker if we normalize the observation. So we set ``obs_normlize=True`` in almost all the second order methods.
+In experiments, we found that the ``obs_normlize=True`` always performs better than ``obs_normlize=False`` in the second-order methods. That means the reward would increase quicker if we normalize the observation. So we set ``obs_normlize=True`` in almost all the second-order methods.
 
 Importantly, we found that the ``rew_normlize=True`` not always performs better than ``rew_normlize=False``, especially in the ``SafetyHopperVelocity-v4`` and ``SafetyWalker2dVelocity`` environment.
 
@@ -204,7 +290,77 @@ This hyperparamter depens on the number of CPU cores. We set it to 8 in our expe
 
 If you find that other hyperparameters perform better, please feel free to open an issue or pull request.
 
-### Experiment Results
+### First-Order Algorithms Experiment Results
+
+### PG(1M)
+
+|         Environment          | Reward (OmniSafe) | Cost (Omnisafe) |
+| :--------------------------: | :---------------: | :-------------: |
+|     SafetyAntVelocity-v4     |   1128.4±654.6    |   155.0±96.5    |
+| SafetyHalfCheetahVelocity-v4 |   1700.2±902.4    |   422.2±234.1   |
+|   SafetyHopperVelocity-v4    |    674.4±127.2    |   180.5±26.4    |
+|  SafetyWalker2dVelocity-v4   |    624.2±301.4    |   125.8±67.5    |
+|   SafetySwimmerVelocity-v4   |     37.7±8.2      |   695.0±230.3   |
+|  SafetyHumanoidVelocity-v4   |    612.7±131.6    |    38.9±17.8    |
+
+### PPO(1M)
+
+|         Environment          | Reward (OmniSafe) | Cost (Omnisafe) |
+| :--------------------------: | :---------------: | :-------------: |
+|     SafetyAntVelocity-v4     |   3012.2±1167.0   |   618.3±255.0   |
+| SafetyHalfCheetahVelocity-v4 |   3641.1±1202.3   |   812.8±219.1   |
+|   SafetyHopperVelocity-v4    |    685.2±132.8    |   170.2±25.7    |
+|  SafetyWalker2dVelocity-v4   |    723.0±175.3    |   141.0±30.8    |
+|   SafetySwimmerVelocity-v4   |     52.4±19.9     |   472.9±300.3   |
+|  SafetyHumanoidVelocity-v4   |    633.3±128.7    |    45.9±16.1    |
+
+### PPOLag(1M)
+
+|         Environment          | Reward (OmniSafe) | Cost (Omnisafe) |
+| :--------------------------: | :---------------: | :-------------: |
+|     SafetyAntVelocity-v4     |   2256.6±315.1    |    29.8±54.7    |
+| SafetyHalfCheetahVelocity-v4 |   2065.5±234.5    |     4.7±5.1     |
+|   SafetyHopperVelocity-v4    |    415.8±367.9    |    47.2±28.4    |
+|  SafetyWalker2dVelocity-v4   |    310.4±44.7     |    19.9±9.9     |
+|   SafetySwimmerVelocity-v4   |     22.0± 7.8     |    63.2±16.3    |
+|  SafetyHumanoidVelocity-v4   |    623.0±173.7    |    17.0±19.7    |
+
+### P3O(1M)
+
+|         Environment          | Reward (OmniSafe) | Cost (Omnisafe) |
+| :--------------------------: | :---------------: | :-------------: |
+|     SafetyAntVelocity-v4     |   1837.5±331.2    |    35.5±28.2    |
+| SafetyHalfCheetahVelocity-v4 |   1251.2±117.4    |    14.7±15.3    |
+|   SafetyHopperVelocity-v4    |    779.0±383.2    |    21.4±13.9    |
+|  SafetyWalker2dVelocity-v4   |   1493.1±515.5    |    27.9±26.7    |
+|   SafetySwimmerVelocity-v4   |     -8.8±14.3     |   125.0±58.5    |
+|  SafetyHumanoidVelocity-v4   |   1027.3±404.7    |     0.4±2.0     |
+
+### FOCOPS(1M)
+
+|         Environment          | Reward (OmniSafe) | Cost (Omnisafe) |
+| :--------------------------: | :---------------: | :-------------: |
+|     SafetyAntVelocity-v4     |   2022.0±226.6    |     4.5±5.6     |
+| SafetyHalfCheetahVelocity-v4 |   1759.8±414.4    |    31.3±55.2    |
+|   SafetyHopperVelocity-v4    |    255.4±190.0    |    10.2±12.4    |
+|  SafetyWalker2dVelocity-v4   |    346.3±100.2    |    22.1±16.1    |
+|   SafetySwimmerVelocity-v4   |     9.0±17.1      |   86.6 ±80.8    |
+|  SafetyHumanoidVelocity-v4   |    703.5±188.0    |    14.4±16.5    |
+
+### CUP(1M)
+
+|         Environment          | Reward (OmniSafe) | Cost (Omnisafe) |
+| :--------------------------: | :---------------: | :-------------: |
+|     SafetyAntVelocity-v4     |   1530.4±723.3    |    60.6±48.4    |
+| SafetyHalfCheetahVelocity-v4 |   1217.6±288.0    |    15.2±14.6    |
+|   SafetyHopperVelocity-v4    |    249.8±308.5    |    32.2±21.1    |
+|  SafetyWalker2dVelocity-v4   |    673.3±608.6    |    22.2±21.6    |
+|   SafetySwimmerVelocity-v4   |     1.2±19.3      |   113.9±57.0    |
+|  SafetyHumanoidVelocity-v4   |    535.0±78.2     |    16.3±13.6    |
+
+
+
+### Second-Order Algorithms Experiment Results
 
 #### NaturalPG(1M)
 

diff --git a/omnisafe/algorithms/on_policy/benchmarks/first_order_ant.png b/omnisafe/algorithms/on_policy/benchmarks/first_order_ant.png
diff --git a/omnisafe/algorithms/on_policy/benchmarks/first_order_halfcheetah.png b/omnisafe/algorithms/on_policy/benchmarks/first_order_halfcheetah.png
diff --git a/omnisafe/algorithms/on_policy/benchmarks/first_order_hopper.png b/omnisafe/algorithms/on_policy/benchmarks/first_order_hopper.png
diff --git a/omnisafe/algorithms/on_policy/benchmarks/first_order_humanoid.png b/omnisafe/algorithms/on_policy/benchmarks/first_order_humanoid.png
diff --git a/omnisafe/algorithms/on_policy/benchmarks/first_order_swimmer.png b/omnisafe/algorithms/on_policy/benchmarks/first_order_swimmer.png
diff --git a/omnisafe/algorithms/on_policy/benchmarks/first_order_walker2d.png b/omnisafe/algorithms/on_policy/benchmarks/first_order_walker2d.png
diff --git a/omnisafe/algorithms/on_policy/benchmarks/second_order_halfcheetah.png b/omnisafe/algorithms/on_policy/benchmarks/second_order_halfcheetah.png