[RLlib] Eps greedy ope #28837

kouroshHakha · 2022-09-28T01:39:27Z

Why are these changes needed?

feature requested :)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

… OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver

nice!

gjoliver · 2022-09-28T09:56:10Z

rllib/offline/estimators/off_policy_estimator.py

+        new_prob = np.exp(convert_to_numpy(log_likelihoods))
+
+        if self.epsilon_greedy > 0.0:
+            if not hasattr(self.policy.action_space, "n"):


issubclass(self.policy.action_space, (Discrete, MultiDiscrete))?

I don't want this to depend on those classes. The minimum abstraction needs is that it should have an n attribute. This is more flexible for down the line.

hmm, is this a public spec though?
a general principle is not to rely on internal implementation details. for example, MultiBinary also has member variable named n ...
also, I kind of feel like we need to check for policy.action_space. original_space first, something like that, in case people are doing action space flattening?

I see your point on MultiBinary. So this only works for simple discrete action space. Should we just assume the spaces would be gym.Space?

yeah, not ideal, but it seems to be everywhere already, so probably fine:
https://github.com/ray-project/ray/blob/master/rllib/algorithms/appo/appo_tf_policy.py#L167

gjoliver · 2022-09-28T09:59:23Z

rllib/offline/estimators/off_policy_estimator.py

@@ -91,6 +192,21 @@ def check_action_prob_in_batch(self, batch: SampleBatchType) -> None:
                "`off_policy_estimation_methods: {}` to disable estimation."
            )

+    def _compute_action_probs(self, batch: SampleBatch):
+        log_likelihoods = compute_log_likelihoods_from_input_dict(self.policy, batch)


put these 2 lines in an else clause so we don't spend time on them if epsilon greedy is specified?

These two should run regardless of eps? you modify new_prob in case of epsilon greedy?

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver

looks good. tests look good too.

* 1. Introduced new abstraction: OfflineEvaluator that is the parent of OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * algorithm ope evaluation is now able to skip split_by_episode Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed some unittests Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * added eps greedy exploration to ope methods Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed dm and dr variance issues Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * cleaned up the inheritance Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed test Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * nit Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed nits Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed the typos Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * nit Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

kouroshHakha added 9 commits September 27, 2022 16:31

1. Introduced new abstraction: OfflineEvaluator that is the parent of…

d68327a

… OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

algorithm ope evaluation is now able to skip split_by_episode

c9f82c7

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

601e92b

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

676babd

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'master' into fix_ope_speed

7f0983a

fixed some unittests

5e4d9e0

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

added eps greedy exploration to ope methods

12a3ef2

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

f0680b4

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

cbaebf2

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha requested review from sven1977, gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and krfricke as code owners September 28, 2022 01:39

gjoliver reviewed Sep 28, 2022

View reviewed changes

kouroshHakha added 13 commits September 28, 2022 10:21

wip

e4e53f6

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

34cd602

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed dm and dr variance issues

ddf2910

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

240e2be

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

cleaned up the inheritance

33401da

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

e340b25

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

80bb48b

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed test

b1db2ec

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

nit

b576b83

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed nits

7d59e37

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed the typos

0c3d09d

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'fix_ope_speed' into eps_greedy_ope

28fb313

nit

c642cc4

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha added 3 commits September 28, 2022 17:58

wip

7f0e8cb

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

db5cc31

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'master' into eps_greedy_ope

73d8bc0

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver approved these changes Sep 29, 2022

View reviewed changes

gjoliver merged commit c1e0d39 into ray-project:master Sep 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Eps greedy ope #28837

[RLlib] Eps greedy ope #28837

kouroshHakha commented Sep 28, 2022

gjoliver left a comment

gjoliver Sep 28, 2022

kouroshHakha Sep 28, 2022

gjoliver Sep 28, 2022

kouroshHakha Sep 28, 2022

gjoliver Sep 28, 2022

gjoliver Sep 28, 2022

kouroshHakha Sep 29, 2022 •

edited

Loading

gjoliver left a comment

[RLlib] Eps greedy ope #28837

[RLlib] Eps greedy ope #28837

Conversation

kouroshHakha commented Sep 28, 2022

Why are these changes needed?

Related issue number

Checks

gjoliver left a comment

Choose a reason for hiding this comment

gjoliver Sep 28, 2022

Choose a reason for hiding this comment

kouroshHakha Sep 28, 2022

Choose a reason for hiding this comment

gjoliver Sep 28, 2022

Choose a reason for hiding this comment

kouroshHakha Sep 28, 2022

Choose a reason for hiding this comment

gjoliver Sep 28, 2022

Choose a reason for hiding this comment

gjoliver Sep 28, 2022

Choose a reason for hiding this comment

kouroshHakha Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

kouroshHakha Sep 29, 2022 •

edited

Loading