-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wingman -> rllib] IMPALA MultiDiscrete changes #3967
Conversation
Test FAILed. |
Do you know what's going on with the formatting changes? It makes it kind of hard to review. |
Test FAILed. |
Aligned formatting with yours impala.py, it should be ok to take a look at merge request now. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're changing the vtrace code now, I think it's also good to copy over the vtrace_test file from here: https://github.com/deepmind/scalable_agent/blob/master/vtrace_test.py
and at least get it to run in the normal Discrete() case. It would also be good to add a new test for the MultiDiscrete() spaces.
I would also add an end-to-end runnable script that just verifies MultiDiscrete doesn't crash with a real env (you can add the file to rllib/test/, and add an entry in multi_node_tests.sh)
@@ -101,6 +109,11 @@ def __init__(self, | |||
"Must use `truncate_episodes` batch mode with V-trace." | |||
self.config = config | |||
self.sess = tf.get_default_session() | |||
self._is_discrete = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_multidiscrete is more clear I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, fixed.
log(target_policy(a) / behaviour_policy(a)). V-trace performs operations | ||
on rhos in log-space for numerical stability. | ||
rhos: A float32 tensor of shape [T, B, NUM_ACTIONS] representing the | ||
importance sampling weights, i.e. target_policy(a) / behaviour_policy(a). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change it from log space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were experimenting with both log prob and prob spaces, will return it to log space.
'vs', 'pg_advantages', 'log_rhos', 'behaviour_action_log_probs', | ||
'target_action_log_probs' | ||
'vs', 'pg_advantages', 'rhos', 'behaviour_action_policy', | ||
'target_action_policy' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep these in log space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
@@ -70,6 +70,9 @@ | |||
# max number of workers to broadcast one set of weights to | |||
"broadcast_interval": 1, | |||
|
|||
# Actions are chosen based on this distribution, if provided | |||
"dist_type": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, as default is DiagGaussian, and we use Categorical (which is MultiCategorical for MultiDiscrete action space). We didn't change the default, which is DiagGaussian and Discrete action space (that's how IMPALA operates at the moment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused, isn't it Categorical for discrete spaces?
elif isinstance(action_space, gym.spaces.Discrete):
return Categorical, action_space.n
I don't think DiagGaussian ever gets used in IMPALA does it (maybe you meant APPO?)
]) | ||
|
||
VTraceReturns = collections.namedtuple('VTraceReturns', 'vs pg_advantages') | ||
|
||
|
||
def log_probs_from_logits_and_actions(policy_logits, actions): | ||
def select_policy_values_using_actions(policy_logits, actions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update doc comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -93,10 +92,10 @@ def from_logits(behaviour_policy_logits, | |||
NUM_ACTIONS refers to the number of actions. | |||
|
|||
Args: | |||
behaviour_policy_logits: A float32 tensor of shape [T, B, NUM_ACTIONS] with | |||
behaviour_policy: A float32 tensor of shape [T, B, NUM_ACTIONS] with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update doc comments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -33,14 +33,14 @@ | |||
nest = tf.contrib.framework.nest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update the file comment to say modified to support MultiDiscrete spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Test FAILed. |
Test FAILed. |
* Address vtrace comments * Update get_log_rhos method's comment
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Is this pull request waiting on something? |
Ah looks like test_supported_spaces is failing on APPO somehow. I can take a look in a bit. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Merged, thanks! |
…project#3967)" (ray-project#4332)" This reverts commit 3c41cb9.
NOTE: This is the beginning of the pull request, so we can align. No unit tests were run, nor the changes are tested broader than we need them.
What do these changes do?
Enables the MultiDiscrete actions space in IMPALA.
Adds the MultiCategorical action distribution - we use IMPALA with that distribution - out model outputs categorical logits for each action. Possibly not all action spaces work for all action distributions, tested is that Discrete works with DiagGaussian (default) and that MultiDiscrete works with categorical (MultiCategorical). Note that model actually outputs a 1-dimensional tensor, that is reshaped to logit sizes according to provided MultiDiscrete action space.
VTrace is replaced with an implementation that works for MultiDiscrete action space as well.