Skip to content

Commit

Permalink
add rnn state info
Browse files Browse the repository at this point in the history
  • Loading branch information
ericl committed May 26, 2019
1 parent dfaf616 commit 5e8fced
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 4 deletions.
61 changes: 59 additions & 2 deletions doc/source/rllib-concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Policies

Policy classes encapsulate the core numerical components of RL algorithms. This typically includes the policy model that determines actions to take, a trajectory postprocessor for experiences, and a loss function to improve the policy given postprocessed experiences. For a simple example, see the policy gradients `graph definition <https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/pg/pg_policy.py>`__.

Most interaction with deep learning frameworks is isolated to the `Policy interface <https://github.com/ray-project/ray/blob/master/python/ray/rllib/policy/policy.py>`__, allowing RLlib to support multiple frameworks. To simplify the definition of policies, RLlib includes `Tensorflow <https://github.com/ray-project/ray/blob/master/python/ray/rllib/policy/tf_policy_template.py>`__ and `PyTorch-specific <https://github.com/ray-project/ray/blob/master/python/ray/rllib/policy/torch_policy_template.py>`__ templates. You can also write your own from scratch. Here is an example:
Most interaction with deep learning frameworks is isolated to the `Policy interface <https://github.com/ray-project/ray/blob/master/python/ray/rllib/policy/policy.py>`__, allowing RLlib to support multiple frameworks. To simplify the definition of policies, RLlib includes `Tensorflow <#building-policies-in-tensorflow>`__ and `PyTorch-specific <#building-policies-in-pytorch>`__ templates. You can also write your own from scratch. Here is an example:

.. code-block:: python
Expand Down Expand Up @@ -46,6 +46,63 @@ Most interaction with deep learning frameworks is isolated to the `Policy interf
def set_weights(self, weights):
self.w = weights["w"]
The above basic policy, when run, will produce batches of observations with the basic ``obs``, ``new_obs``, ``actions``, ``rewards``, ``dones``, and ``infos`` columns. There are two more mechanisms to pass along and emit extra information:

**Policy recurrent state**: Suppose you want to compute actions based on the current timestep of the episode. While it is possible to have the environment provide this as part of the observation, we can instead compute and store it as part of the Policy recurrent state:

.. code-block:: python
def get_initial_state(self):
"""Returns initial RNN state for the current policy."""
return [0] # list of single state element (t=0)
# you could also return multiple values, e.g., [0, "foo"]
def compute_actions(self,
obs_batch,
state_batches,
prev_action_batch=None,
prev_reward_batch=None,
info_batch=None,
episodes=None,
**kwargs):
assert len(state_batches) == len(self.get_initial_state())
new_state_batches = [[
t + 1 for t in state_batches[0]
]]
return ..., new_state_batches, {}
def learn_on_batch(self, samples):
# can access array of the state elements at each timestep
# or state_in_1, 2, etc. if there are multiple state elements
assert "state_in_0" in samples.keys()
assert "state_out_0" in samples.keys()
**Extra action info output**: You can also emit extra outputs at each step which will be available for learning on. For example, you might want to output the behaviour policy logits as extra action info, which can be used for importance weighting, but in general arbitrary values can be stored here (as long as they are convertible to numpy arrays):

.. code-block:: python
def compute_actions(self,
obs_batch,
state_batches,
prev_action_batch=None,
prev_reward_batch=None,
info_batch=None,
episodes=None,
**kwargs):
action_info_batch = {
"some_value": ["foo" for _ in obs_batch],
"other_value": [12345 for _ in obs_batch],
}
return ..., [], action_info_batch
def learn_on_batch(self, samples):
# can access array of the extra values at each timestep
assert "some_value" in samples.keys()
assert "other_value" in samples.keys()
Building Policies in TensorFlow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -427,7 +484,7 @@ Trainers

Trainers are the boilerplate classes that put the above components together, making algorithms accessible via Python API and the command line. They manage algorithm configuration, setup of the rollout workers and optimizer, and collection of training metrics. Trainers also implement the `Trainable API <https://ray.readthedocs.io/en/latest/tune-usage.html#training-api>`__ for easy experiment management.

Example of three equivalent ways of interacting with the PPO trainer:
Example of three equivalent ways of interacting with the PPO trainer, all of which log results in ``~/ray_results``:

.. code-block:: python
Expand Down
4 changes: 2 additions & 2 deletions doc/source/rllib.rst
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ Offline Datasets
* `Input API <rllib-offline.html#input-api>`__
* `Output API <rllib-offline.html#output-api>`__

Building Custom Algorithms
--------------------------
Concepts and Building Custom Algorithms
---------------------------------------
* `Policies <rllib-concepts.html>`__

- `Building Policies in TensorFlow <rllib-concepts.html#building-policies-in-tensorflow>`__
Expand Down

0 comments on commit 5e8fced

Please sign in to comment.