-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Update autoregressive actions example. #47829
[RLlib] Update autoregressive actions example. #47829
Conversation
rllib/core/models/configs.py
Outdated
@@ -160,6 +160,13 @@ class _MLPConfig(ModelConfig): | |||
"_" are allowed. | |||
output_layer_bias_initializer_config: Configuration to pass into the | |||
initializer defined in `output_layer_bias_initializer`. | |||
clip_log_std: If the log std should be clipped by `log_std_clip_param`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel like this comment is confusing. We should write that clipping is only applied to those action distribution parameters that encode the log-std for a DiagGaussian action distribution. Any other node's output (or if there is no DiagGaussian) is not clipped.
Mentioning the value function makes it confusing.
rllib/algorithms/sac/sac_catalog.py
Outdated
@@ -187,6 +194,8 @@ def build_pi_head(self, framework: str) -> Model: | |||
hidden_layer_activation=self.pi_and_qf_head_activation, | |||
output_layer_dim=required_output_dim, | |||
output_layer_activation="linear", | |||
clip_log_std=is_diag_gaussian, | |||
log_std_clip_param=self._model_config_dict["log_std_clip_param"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do .get
here to be defensive against any custom models that use custom model_config_dicts that are NOT derived from our gigantic (old) model config?
@@ -100,7 +102,7 @@ | |||
# exceeds 150 in evaluation. | |||
stop = { | |||
f"{NUM_ENV_STEPS_SAMPLED_LIFETIME}": 100000, | |||
f"{EVALUATION_RESULTS}/{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": 150.0, | |||
f"{EVALUATION_RESULTS}/{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": -0.012, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does it roughly start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It roughly starts at around -0.55
- -0.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiice!!
super().reset(seed=seed) | ||
|
||
# Randomly initialize the state between -1 and 1 | ||
self.state = np.random.uniform(-1, 1, size=(1,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that this can be negative, too. Makes sense!
…annot only watch the state but needs to also watch the first action. Furthermore, implemented the 'ValueFunctionAPI' in the 'AutoregressiveActionsRLM' and ran some tests. Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
5083233
to
2b04f6c
Compare
Signed-off-by: simonsays1980 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @simonsays1980 :)
Signed-off-by: simonsays1980 <[email protected]>
…w 'AutoregressiveActionsEnv'. Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
The autoregressive actions example had an environment in which the agent ould cheat by looking only on the state when defining both actions,
a1
anda2
. This PR proposes a new environment to test autoregressive actions modules in which the agent has to watch both the state and the actiona1
to define the actiona2
optimally. Rewards are based on the absolute negative deviance between the desired action fora2
and its actual counterpart.Furthermore, this PR introduces the
ValueFunctionAPI
for theAutoregressiveActionsRLM
in the corresponding example which simplifies code and fixes actually an error due to the old_compute_values
definition.Related issue number
Closes #44662
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.