-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Eps greedy ope #28837
[RLlib] Eps greedy ope #28837
Conversation
… OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
new_prob = np.exp(convert_to_numpy(log_likelihoods)) | ||
|
||
if self.epsilon_greedy > 0.0: | ||
if not hasattr(self.policy.action_space, "n"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issubclass(self.policy.action_space, (Discrete, MultiDiscrete))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want this to depend on those classes. The minimum abstraction needs is that it should have an n
attribute. This is more flexible for down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is this a public spec though?
a general principle is not to rely on internal implementation details. for example, MultiBinary also has member variable named n ...
also, I kind of feel like we need to check for policy.action_space. original_space first, something like that, in case people are doing action space flattening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point on MultiBinary. So this only works for simple discrete action space. Should we just assume the spaces would be gym.Space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, not ideal, but it seems to be everywhere already, so probably fine:
https://github.com/ray-project/ray/blob/master/rllib/algorithms/appo/appo_tf_policy.py#L167
@@ -91,6 +192,21 @@ def check_action_prob_in_batch(self, batch: SampleBatchType) -> None: | |||
"`off_policy_estimation_methods: {}` to disable estimation." | |||
) | |||
|
|||
def _compute_action_probs(self, batch: SampleBatch): | |||
log_likelihoods = compute_log_likelihoods_from_input_dict(self.policy, batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put these 2 lines in an else clause so we don't spend time on them if epsilon greedy is specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two should run regardless of eps? you modify new_prob in case of epsilon greedy?
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. tests look good too.
* 1. Introduced new abstraction: OfflineEvaluator that is the parent of OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * algorithm ope evaluation is now able to skip split_by_episode Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed some unittests Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * added eps greedy exploration to ope methods Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed dm and dr variance issues Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * cleaned up the inheritance Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed test Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * nit Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed nits Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed the typos Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * nit Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
feature requested :)
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.