ray-project · sven1977 · Jun 14, 2022 · Jun 10, 2022 · Jun 10, 2022 · Jun 10, 2022
@@ -23,6 +23,7 @@ Algorithm                      Frameworks Discrete Actions              Continuo
 `Bandits`_ (`TS`_ & `LinUCB`_) torch      **Yes** `+parametric`_        No                 **Yes**                                                                   No
 `BC`_                          tf + torch **Yes** `+parametric`_        **Yes**            **Yes**     `+RNN`_                                                       torch
 `CQL`_                         tf + torch No                            **Yes**            No                                                                        tf + torch
+`CRR`_                         torch      **Yes** `+parametric`_        **Yes**            **Yes**                                                                   torch
 `DDPG`_                        tf + torch No                            **Yes**            **Yes**                                                                   torch
 `APEX-DDPG`_                   tf + torch No                            **Yes**            **Yes**                                                                   torch
 `ES`_                          tf + torch **Yes**                       **Yes**            No                                                                        No
@@ -634,6 +635,28 @@ Tuned examples: `HalfCheetah Random <https://github.com/ray-project/ray/blob/mas
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
 
+
+.. _crr:
+
+Critic Regularized Regression (CRR)
+-----------------------------------
+|pytorch|
+`[paper] <https://arxiv.org/abs/2006.15134>`__ `[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/crr/crr.py>`__
+
+CRR is another offline RL algorithm based on Q-learning that can learn from an offline experience replay.
+The challenge in applying existing Q-learning algorithms to offline RL lies in the overestimation of the Q-function, as well as, the lack of exploration beyond the observed data.
+The latter becomes increasingly important during bootstrapping in the bellman equation, where the Q-function queried for the next state's Q-value(s) does not have support in the observed data.
+To mitigate these issues, CRR implements a simple and yet powerful idea of "value-filtered regression".
+The key idea is to use a learned critic to filter-out the non-promising transitions from the replay dataset. For more details, please refer to the paper (see link above).
+
+Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/cartpole-v0-crr.yaml>`__, `Pendulum-v1 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/crr/pendulum-v1-crr.yaml>`__
+
+.. literalinclude:: ../../../rllib/algorithms/crr/crr.py
+   :language: python
+   :start-after: __sphinx_doc_begin__
+   :end-before: __sphinx_doc_end__
+
+
 Derivative-free
 ~~~~~~~~~~~~~~~
 

@@ -60,7 +60,8 @@ Offline RL:
 
 - `Behavior Cloning (BC; derived from MARWIL implementation) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#bc>`__ 
 - `Conservative Q-Learning (CQL) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#cql>`__ 
-- `Importance Sampling and Weighted Importance Sampling (OPE) <https://docs.ray.io/en/latest/rllib/rllib-offline.html#is>`__ 
+- `Critic Regularized Regression (CRR) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#crr>`__
+- `Importance Sampling and Weighted Importance Sampling (OPE) <https://docs.ray.io/en/latest/rllib/rllib-offline.html#is>`__
 - `Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) <https://docs.ray.io/en/master/rllib/rllib-algorithms.html#marwil>`__ 
 
 Model-free On-policy RL (for Games):

@@ -40,6 +40,8 @@ def __init__(self, trainer_class=None):
         self.n_action_sample = 4
         self.twin_q = True
         self.target_update_grad_intervals = 100
+        # __sphinx_doc_end__
+        # fmt: on
         self.replay_buffer_config = {
             "type": "ReplayBuffer",
             "capacity": 50000,
@@ -57,8 +59,6 @@ def __init__(self, trainer_class=None):
         self.critic_lr = 3e-4
         self.actor_lr = 3e-4
         self.tau = 5e-3
-        # __sphinx_doc_end__
-        # fmt: on
 
         # overriding the trainer config default
         self.num_workers = 0  # offline RL does not need rollout workers