Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wingman -> rllib] IMPALA MultiDiscrete changes #3967

Merged
merged 48 commits into from
Mar 2, 2019
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
a6b1b7a
impala changes
Feb 6, 2019
088985b
fixed newlines
Feb 6, 2019
23029d3
Merge branch 'master' into impala
Feb 7, 2019
3b38ebb
reformatting impalla.py
Feb 7, 2019
1858404
aligned vtrace.py formatting some more
Feb 7, 2019
9840eb6
aligned formatting some more
Feb 7, 2019
e48f9ae
aligned formatting some more
Feb 7, 2019
26eed71
Merge branch 'master' into impala
Feb 8, 2019
3171c8a
fixed impala stuff
Feb 8, 2019
9d62dd1
Address vtrace comments (#6)
pimpke Feb 8, 2019
6597295
Made APPO work with VTrace
stefanpantic Feb 11, 2019
1d31991
Variable is no longer a member
stefanpantic Feb 11, 2019
252f6b3
Optimized imports
stefanpantic Feb 11, 2019
5ef2e30
Changed is_discrete to is_multidiscrete, fixed KL distribution
stefanpantic Feb 11, 2019
cf5c1c5
Fixed KL divergence
stefanpantic Feb 11, 2019
54f4f79
Removed if statement
stefanpantic Feb 11, 2019
38c1896
Merge branch 'master' into impala
Feb 11, 2019
2dd604f
Merge branch 'impala' of https://github.com/wingman-ai/ray into impala
Feb 11, 2019
7cb1f97
revert appo file
Feb 14, 2019
65c82d4
revered stefans appo changes
Feb 14, 2019
946df01
old appo policy graph
Feb 14, 2019
b6b2c52
returned stefan appo changes and returned newline
Feb 14, 2019
56fe32e
fixed newlines in appo_policy_graph
Feb 14, 2019
1d46d7a
Merge branch 'master' into impala
Feb 14, 2019
76046f3
aligned with action_dist changes in ray master
Feb 14, 2019
d017c9f
small appo fixes
Feb 14, 2019
c02b9f5
add vtrace test
ericl Feb 15, 2019
fbbed63
fix appo impala integration
ericl Feb 15, 2019
bcb2113
add to jenkins
ericl Feb 15, 2019
e020527
merged with master
Feb 18, 2019
63e119a
Merge branch 'impala' of https://github.com/wingman-ai/ray into impala
Feb 18, 2019
6e06ba6
fixing appo policy graph changes
Feb 18, 2019
f161edf
fixed vtrace tests
Feb 18, 2019
1dbed08
lint and py2 compat
ericl Feb 18, 2019
aa50f98
kl
ericl Feb 18, 2019
57594c0
Merge branch 'master' into impala
Feb 19, 2019
3f4883b
Merge branch 'impala' of https://github.com/wingman-ai/ray into impala
Feb 19, 2019
d32d253
removed dist_type as it is actually not needed for IMPALA
Feb 19, 2019
967db5c
fixing issue with new gym version
Feb 19, 2019
1da470f
Merge branch 'master' into impala
Feb 20, 2019
9584a7c
lint
ericl Feb 20, 2019
8999621
fix multigpu test
ericl Feb 20, 2019
0cbeb7c
merged with master
Feb 25, 2019
abe797a
Merge branch 'impala' of https://github.com/wingman-ai/ray into impala
Feb 25, 2019
280b21c
Merge branch 'master' into impala
Feb 26, 2019
e0e3060
Merge branch 'master' into impala
Feb 27, 2019
afb462f
Merge remote-tracking branch 'upstream/master' into impala
ericl Mar 1, 2019
eb18cff
fix tests
ericl Mar 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions python/ray/rllib/agents/impala/impala.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@
# max number of workers to broadcast one set of weights to
"broadcast_interval": 1,

# Actions are chosen based on this distribution, if provided
"dist_type": None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed right?

Copy link
Contributor Author

@bjg2 bjg2 Feb 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, as default is DiagGaussian, and we use Categorical (which is MultiCategorical for MultiDiscrete action space). We didn't change the default, which is DiagGaussian and Discrete action space (that's how IMPALA operates at the moment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, isn't it Categorical for discrete spaces?

        elif isinstance(action_space, gym.spaces.Discrete):
            return Categorical, action_space.n

I don't think DiagGaussian ever gets used in IMPALA does it (maybe you meant APPO?)


# Learning params.
"grad_clip": 40.0,
# either "adam" or "rmsprop"
Expand Down
181 changes: 136 additions & 45 deletions python/ray/rllib/agents/impala/vtrace.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@
by Espeholt, Soyer, Munos et al.

See https://arxiv.org/abs/1802.01561 for the full paper.

In addition to the original paper's code, changes have been made
to support MultiDiscrete action spaces. behaviour_policy_logits,
target_policy_logits and actions parameters in the entry point
multi_from_logits method accepts lists of tensors instead of just
tensors.
"""

from __future__ import absolute_import
Expand All @@ -41,29 +47,47 @@


def log_probs_from_logits_and_actions(policy_logits, actions):
return multi_log_probs_from_logits_and_actions(
[policy_logits], [actions])[0]


def multi_log_probs_from_logits_and_actions(policy_logits, actions):
"""Computes action log-probs from policy logits and actions.

In the notation used throughout documentation and comments, T refers to the
time dimension ranging from 0 to T-1. B refers to the batch size and
NUM_ACTIONS refers to the number of actions.
ACTION_SPACE refers to the list of numbers each representing a number of actions.

Args:
policy_logits: A float32 tensor of shape [T, B, NUM_ACTIONS] with
un-normalized log-probabilities parameterizing a softmax policy.
actions: An int32 tensor of shape [T, B] with actions.
policy_logits: A list with length of ACTION_SPACE of float32
tensors of shapes
[T, B, ACTION_SPACE[0]],
...,
[T, B, ACTION_SPACE[-1]]
with un-normalized log-probabilities parameterizing a softmax policy.
actions: A list with length of ACTION_SPACE of int32
tensors of shapes
[T, B],
...,
[T, B]
with actions.

Returns:
A float32 tensor of shape [T, B] corresponding to the sampling log
probability of the chosen action w.r.t. the policy.
A list with length of ACTION_SPACE of float32
tensors of shapes
[T, B],
...,
[T, B]
corresponding to the sampling log probability
of the chosen action w.r.t. the policy.
"""
policy_logits = tf.convert_to_tensor(policy_logits, dtype=tf.float32)
actions = tf.convert_to_tensor(actions, dtype=tf.int32)

policy_logits.shape.assert_has_rank(3)
actions.shape.assert_has_rank(2)
log_probs = []
for i in range(len(policy_logits)):
log_probs.append(-tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=policy_logits[i], labels=actions[i]))

return -tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=policy_logits, labels=actions)
return log_probs


def from_logits(behaviour_policy_logits,
Expand All @@ -76,6 +100,40 @@ def from_logits(behaviour_policy_logits,
clip_rho_threshold=1.0,
clip_pg_rho_threshold=1.0,
name='vtrace_from_logits'):
"""multi_from_logits wrapper used only for tests"""

res = multi_from_logits(
[behaviour_policy_logits],
[target_policy_logits],
[actions],
discounts,
rewards,
values,
bootstrap_value,
clip_rho_threshold=clip_rho_threshold,
clip_pg_rho_threshold=clip_pg_rho_threshold,
name=name)

return VTraceFromLogitsReturns(
vs = res.vs,
pg_advantages = res.pg_advantages,
log_rhos=res.log_rhos,
behaviour_action_log_probs=tf.squeeze(res.behaviour_action_log_probs, axis=0),
target_action_log_probs=tf.squeeze(res.target_action_log_probs, axis=0),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for fixing this bit.



def multi_from_logits(
behaviour_policy_logits,
target_policy_logits,
actions,
discounts,
rewards,
values,
bootstrap_value,
clip_rho_threshold=1.0,
clip_pg_rho_threshold=1.0,
name='vtrace_from_logits'):
r"""V-trace for softmax policies.

Calculates V-trace actor critic targets for softmax polices as described in
Expand All @@ -90,16 +148,27 @@ def from_logits(behaviour_policy_logits,

In the notation used throughout documentation and comments, T refers to the
time dimension ranging from 0 to T-1. B refers to the batch size and
NUM_ACTIONS refers to the number of actions.
ACTION_SPACE refers to the list of numbers each representing a number of actions.

Args:
behaviour_policy_logits: A float32 tensor of shape [T, B, NUM_ACTIONS] with
un-normalized log-probabilities parametrizing the softmax behaviour
policy.
target_policy_logits: A float32 tensor of shape [T, B, NUM_ACTIONS] with
un-normalized log-probabilities parametrizing the softmax target policy.
actions: An int32 tensor of shape [T, B] of actions sampled from the
behaviour policy.
behaviour_policy_logits: A list with length of ACTION_SPACE of float32
tensors of shapes
[T, B, ACTION_SPACE[0]],
...,
[T, B, ACTION_SPACE[-1]]
with un-normalized log-probabilities parameterizing the softmax behaviour policy.
target_policy_logits: A list with length of ACTION_SPACE of float32
tensors of shapes
[T, B, ACTION_SPACE[0]],
...,
[T, B, ACTION_SPACE[-1]]
with un-normalized log-probabilities parameterizing the softmax target policy.
actions: A list with length of ACTION_SPACE of int32
tensors of shapes
[T, B],
...,
[T, B]
with actions sampled from the behaviour policy.
discounts: A float32 tensor of shape [T, B] with the discount encountered
when following the behaviour policy.
rewards: A float32 tensor of shape [T, B] with the rewards generated by
Expand Down Expand Up @@ -128,29 +197,31 @@ def from_logits(behaviour_policy_logits,
target_action_log_probs: A float32 tensor of shape [T, B] containing
target policy action probabilities (log \pi(a_t)).
"""
behaviour_policy_logits = tf.convert_to_tensor(
behaviour_policy_logits, dtype=tf.float32)
target_policy_logits = tf.convert_to_tensor(
target_policy_logits, dtype=tf.float32)
actions = tf.convert_to_tensor(actions, dtype=tf.int32)

# Make sure tensor ranks are as expected.
# The rest will be checked by from_action_log_probs.
behaviour_policy_logits.shape.assert_has_rank(3)
target_policy_logits.shape.assert_has_rank(3)
actions.shape.assert_has_rank(2)

with tf.name_scope(
name,
values=[
behaviour_policy_logits, target_policy_logits, actions,
discounts, rewards, values, bootstrap_value
]):
target_action_log_probs = log_probs_from_logits_and_actions(
for i in range(len(behaviour_policy_logits)):
behaviour_policy_logits[i] = tf.convert_to_tensor(
behaviour_policy_logits[i], dtype=tf.float32)
target_policy_logits[i] = tf.convert_to_tensor(
target_policy_logits[i], dtype=tf.float32)
actions[i] = tf.convert_to_tensor(actions[i], dtype=tf.int32)

# Make sure tensor ranks are as expected.
# The rest will be checked by from_action_log_probs.
behaviour_policy_logits[i].shape.assert_has_rank(3)
target_policy_logits[i].shape.assert_has_rank(3)
actions[i].shape.assert_has_rank(2)

with tf.name_scope(name, values=[behaviour_policy_logits, target_policy_logits, actions,
discounts, rewards, values,
bootstrap_value]):
target_action_log_probs = multi_log_probs_from_logits_and_actions(
target_policy_logits, actions)
behaviour_action_log_probs = log_probs_from_logits_and_actions(
behaviour_action_log_probs = multi_log_probs_from_logits_and_actions(
behaviour_policy_logits, actions)
log_rhos = target_action_log_probs - behaviour_action_log_probs

log_rhos = get_log_rhos(target_action_log_probs,
behaviour_action_log_probs)

vtrace_returns = from_importance_weights(
log_rhos=log_rhos,
discounts=discounts,
Expand All @@ -159,6 +230,7 @@ def from_logits(behaviour_policy_logits,
bootstrap_value=bootstrap_value,
clip_rho_threshold=clip_rho_threshold,
clip_pg_rho_threshold=clip_pg_rho_threshold)

return VTraceFromLogitsReturns(
log_rhos=log_rhos,
behaviour_action_log_probs=behaviour_action_log_probs,
Expand All @@ -183,13 +255,13 @@ def from_importance_weights(log_rhos,
by Espeholt, Soyer, Munos et al.

In the notation used throughout documentation and comments, T refers to the
time dimension ranging from 0 to T-1. B refers to the batch size and
NUM_ACTIONS refers to the number of actions. This code also supports the
case where all tensors have the same number of additional dimensions, e.g.,
`rewards` is [T, B, C], `values` is [T, B, C], `bootstrap_value` is [B, C].
time dimension ranging from 0 to T-1. B refers to the batch size. This code
also supports the case where all tensors have the same number of additional
dimensions, e.g., `rewards` is [T, B, C], `values` is [T, B, C],
`bootstrap_value` is [B, C].

Args:
log_rhos: A float32 tensor of shape [T, B, NUM_ACTIONS] representing the
log_rhos: A float32 tensor of shape [T, B] representing the
log importance sampling weights, i.e.
log(target_policy(a) / behaviour_policy(a)). V-trace performs operations
on rhos in log-space for numerical stability.
Expand Down Expand Up @@ -246,6 +318,14 @@ def from_importance_weights(log_rhos,
if clip_rho_threshold is not None:
clipped_rhos = tf.minimum(
clip_rho_threshold, rhos, name='clipped_rhos')

tf.summary.histogram('clipped_rhos_1000', tf.minimum(1000.0, rhos))
tf.summary.scalar(
'num_of_clipped_rhos',
tf.reduce_sum(tf.cast(
tf.equal(clipped_rhos, clip_rho_threshold), tf.int32))
)
tf.summary.scalar('size_of_clipped_rhos', tf.size(clipped_rhos))
else:
clipped_rhos = rhos

Expand Down Expand Up @@ -298,3 +378,14 @@ def scanfunc(acc, sequence_item):
return VTraceReturns(
vs=tf.stop_gradient(vs),
pg_advantages=tf.stop_gradient(pg_advantages))


def get_log_rhos(behaviour_action_log_probs, target_action_log_probs):
"""With the selected log_probs for multi-discrete actions of behaviour
and target policies we compute the log_rhos for calculating the vtrace."""
log_rhos = [t - b for t,
b in zip(target_action_log_probs, behaviour_action_log_probs)]
log_rhos = [tf.convert_to_tensor(l, dtype=tf.float32) for l in log_rhos]
log_rhos = tf.reduce_sum(tf.stack(log_rhos), axis=0)

return log_rhos
Loading