[RLlib] Fix ope speed #28834

kouroshHakha · 2022-09-27T23:40:47Z

Why are these changes needed?

The ope abstraction is updated to separate multi-timestep ope (RL) vs. single-timestep ope (bandits) to make bandits faster.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

… OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha · 2022-09-27T23:43:12Z

rllib/algorithms/algorithm.py

@@ -936,10 +939,20 @@ def duration_fn(num_units_done):
            if self.reward_estimators:
                # Compute off-policy estimates
                metrics["off_policy_estimator"] = {}
-                total_batch = concat_samples(all_batches)


We shouldn't concatenate and then split by episode here. In case of bandits we don't even need to split_by_episodes since each rows is already one episode. In case of RL, each batch is already ended at some episode so we gain nothing by concating all batches together.

kouroshHakha · 2022-09-27T23:43:50Z

rllib/algorithms/algorithm.py

+                for batch in all_batches:
+                    for name, estimator in self.reward_estimators.items():
+                        estimate_result = estimator.estimate(
+                            batch, split_by_episode=self.config["ope_split_by_episode"]


added this optional parameter that is default to True in algorithm config. This will allow bandit users to gain speed up by setting this to False.

should we name this split_batch_by_episode?
also a random suggestion, I think we can reduce a level of nesty-ness, since the outmost if self.reward_estimators is not necessary:

estimates = defaultdict(list) for name, estimator in self.reward_estimators.items(): for batch in all_batches: estimate_result = estimator.estimate(...) estimates[name].append(estimate_result) if estimates: metrics["off_policy_estimator"] = {} for name, estimate_list in estimates.items(): avg_estimate = tree.... metrics["off_policy_estimator"][name] = avg_estimate

just to make the code look a bit nicer.

for name: I don't know, I think ope should be part of this since it's an evaluation() variable and users need to know from the name that this is for OPE only. So something like ope_split_batch_by_episode is kinda the name that I want to put here but it's more verbose which I think is ok?

yeah, makes sense, ope_split_batch_by_episode sounds much better.

kouroshHakha · 2022-09-27T23:44:52Z

rllib/algorithms/algorithm_config.py

@@ -925,10 +932,12 @@ def evaluation(
            self.evaluation_num_workers = evaluation_num_workers
        if custom_evaluation_function is not None:
            self.custom_evaluation_function = custom_evaluation_function
-        if always_attach_evaluation_results:
+        if always_attach_evaluation_results is not None:


I noticed that these are buggy. i.e. previously if always_attach_evaluation_results was set to False and by default it was true this call would not have overriden it.

You have to explicitly check if these variables are not None otherwise False would also not get assigned

kouroshHakha · 2022-09-27T23:46:18Z

rllib/offline/__init__.py

@@ -9,6 +9,8 @@
 from ray.rllib.offline.output_writer import OutputWriter, NoopOutput
 from ray.rllib.offline.resource import get_offline_io_resource_bundles
 from ray.rllib.offline.shuffled_input import ShuffledInput
+from ray.rllib.offline.feature_importance import FeatureImportance


Moving feature importance out of it's previous place because it doesn't really fit the definition of off-policy evaluation in literature. It is now a sub-class of OfflineEvaluator which OffPolicyEstimator is also a sub-class of.

this should be a separate pr

If this is not done here it will break the feature_importance code.

kouroshHakha · 2022-09-27T23:47:14Z

rllib/offline/estimators/direct_method.py

@@ -62,50 +63,38 @@ def __init__(
        ), "self.model must implement `estimate_v`!"

    @override(OffPolicyEstimator)
-    def estimate(self, batch: SampleBatchType) -> Dict[str, Any]:
-        """Compute off-policy estimates.
+    def estimate_multi_step(self, episode: SampleBatch) -> Dict[str, float]:


I am now separating episodic RL logic from bandits which will be easier to maintain and debug. It will also focus on our current use-case which is bandits.

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver

cool, only a few minor issues.
more significantly, I want to raise the soul-searching question of should we add some unit tests for these mathy estimate_xxx() functions, now that they are nicely separated?

gjoliver · 2022-09-28T09:40:48Z

rllib/algorithms/algorithm.py

+                for batch in all_batches:
+                    for name, estimator in self.reward_estimators.items():
+                        estimate_result = estimator.estimate(
+                            batch, split_by_episode=self.config["ope_split_by_episode"]


should we name this split_batch_by_episode?
also a random suggestion, I think we can reduce a level of nesty-ness, since the outmost if self.reward_estimators is not necessary:

estimates = defaultdict(list) for name, estimator in self.reward_estimators.items(): for batch in all_batches: estimate_result = estimator.estimate(...) estimates[name].append(estimate_result) if estimates: metrics["off_policy_estimator"] = {} for name, estimate_list in estimates.items(): avg_estimate = tree.... metrics["off_policy_estimator"][name] = avg_estimate

just to make the code look a bit nicer.

gjoliver · 2022-09-28T09:42:18Z

rllib/algorithms/algorithm_config.py

@@ -925,10 +932,12 @@ def evaluation(
            self.evaluation_num_workers = evaluation_num_workers
        if custom_evaluation_function is not None:
            self.custom_evaluation_function = custom_evaluation_function
-        if always_attach_evaluation_results:
+        if always_attach_evaluation_results is not None:


gjoliver · 2022-09-28T09:44:09Z

rllib/offline/estimators/direct_method.py

+
+    @override(OffPolicyEstimator)
+    def estimate_single_step(self, batch: SampleBatch) -> Dict[str, float]:
+        estimates_per_epsiode = {"v_behavior": None, "v_target": None}


this is nice. rename this variable to something else since we are not dealing with an episode here?

what should I name it? it's a batch of single time steps

estimate_per_sample?

how about estimate_on_single_episode vs. estimate_on_single_step_samples?

gjoliver · 2022-09-28T09:45:40Z

rllib/offline/estimators/doubly_robust.py

+
+    @override(OffPolicyEstimator)
+    def estimate_single_step(self, batch: SampleBatchType) -> Dict[str, float]:
+        estimates_per_epsiode = {"v_behavior": None, "v_target": None}


gjoliver · 2022-09-28T09:48:40Z

rllib/offline/estimators/weighted_importance_sampling.py

-        estimates["v_delta"] = estimates["v_target"] - estimates["v_behavior"]
-
-        return estimates
+    def estimate_multi_step(self, episode: SampleBatch) -> Dict[str, float]:


is this a good time to add some unit tests for these single/multi-step util functions?
we want to double check the math next right?

gjoliver · 2022-09-28T09:50:24Z

rllib/offline/estimators/off_policy_estimator.py

+            if is_overridden(self.estimate_single_step):
+                estimates_per_epsiode.append(self.estimate_single_step(batch))
+            else:
+                raise NotImplementedError(


why do you have to do this? if it's not overridden, the default implementation will just raise the Error?

Good point. I don't know what's the best way to do this. I wanted to give more informative error message because they can also go on with multi_step by setting ope_split_by_episode=True

so why not just raise this specific message in estimate_single_step() right above?

because that function does not need to know about how it's called? like based on ope_split_by_episode ?

Another way would be to try and catch the error and change the error message.

I see. huh ok.
but however we want to do this, should we do the same for self.estimate_multi_step() call above as well?

estimate_multi_step() always works. the other one only works on bandits. An estimator down the line may only implement one or the other or both and should raise not implemented error. The error message should tell the user that hey if you unintentionally wanted to do estimate_multi_step() but your split_by_eps is False and estimate_single_step is not implemented. you should make it True. I think of a better way.

wait, I think we are mixing some things up here.
if you look at the two API calls above, they are both NotImplemented.
this makes me feel like the knowledge of multi-step will always work is coming from the children implementations that we already have, and it's not from the perspective of an abstract API class. like, if someone is only reading the OffPolicyEstimator class by itself, they will get curious why we catch in one place, but not the other.
either, it's fine to have the knowledge leak from children class to the APIs here, in which case, we should be able to write a more specific error message for the APIs, or we keep the API class pure, and makes no assumption about children implementation.
not trying to nit-picking, you get my idea, just trying to tell the story from a reader of our code.
thanks man.

oh you are right. I agree with you on the leakage problem. We don't want this info to leak from children to the base parent abstract class. I'll remove the override check then.

These nit-pickings are actually important imo. Shoot more :)

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver

nice man. a few nits left, and also lint is failing.

gjoliver · 2022-09-28T23:45:39Z

rllib/offline/estimators/tests/test_ope_math.py

+        for class_module in ope_classes:
+            for policy_tag in ["good", "bad"]:
+                target_policy = self.policies[policy_tag]
+                estimator_good = class_module(target_policy, gamma=0)


got confused for a second. seems this should just be estimator, since it can be good and bad depends on policy_tag.

oh good catch. misnomer :)

gjoliver · 2022-09-28T23:46:57Z

rllib/offline/estimators/off_policy_estimator.py

+        return all_episodes
+
+    @OverrideToImplementCustomLogic
+    def peak_on_single_episode(self, episode: SampleBatch) -> None:


typo, peek_on_single_episode?

gjoliver · 2022-09-28T23:53:15Z

rllib/offline/estimators/weighted_importance_sampling.py

+
+        eps_id = episode[SampleBatch.EPS_ID][0]
+        if eps_id not in self.p:
+            raise ValueError(f"Episode {eps_id} not passed through the fit function")


update the error message? it's not fit anymore? maybe say:
"Can not find target weight for episode {eps_id}. Did it go though the peek_on_single_episode() function?"?

gjoliver · 2022-09-28T23:53:35Z

rllib/offline/estimators/weighted_importance_sampling.py

+        eps_id = episode[SampleBatch.EPS_ID][0]
+        if eps_id in self.p:
+            raise ValueError(
+                f"Episode {eps_id} already paseed through the fit function"


Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver

cool, cool! let's merge this after tests pass.

kouroshHakha · 2022-09-29T03:16:10Z

@gjoliver Let's merge?

gjoliver · 2022-09-29T08:01:47Z

done. like this change a lot.

* 1. Introduced new abstraction: OfflineEvaluator that is the parent of OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * algorithm ope evaluation is now able to skip split_by_episode Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed some unittests Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * wip Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed dm and dr variance issues Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * cleaned up the inheritance Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * lint Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed test Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * nit Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed nits Signed-off-by: Kourosh Hakhamaneshi <[email protected]> * fixed the typos Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

kouroshHakha added 5 commits September 27, 2022 16:31

1. Introduced new abstraction: OfflineEvaluator that is the parent of…

d68327a

… OPE and feature importance 2. introduced estimate_multi_step vs. estimate_single_step Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

algorithm ope evaluation is now able to skip split_by_episode

c9f82c7

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

601e92b

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

676babd

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

Merge branch 'master' into fix_ope_speed

7f0983a

kouroshHakha requested review from sven1977, gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla and krfricke as code owners September 27, 2022 23:40

kouroshHakha assigned gjoliver and sven1977 Sep 27, 2022

kouroshHakha commented Sep 27, 2022

View reviewed changes

fixed some unittests

5e4d9e0

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver reviewed Sep 28, 2022

View reviewed changes

kouroshHakha added 7 commits September 28, 2022 10:21

wip

e4e53f6

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

wip

34cd602

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed dm and dr variance issues

ddf2910

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

240e2be

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

cleaned up the inheritance

33401da

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

e340b25

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

lint

80bb48b

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver reviewed Sep 28, 2022

View reviewed changes

fixed test

b1db2ec

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

kouroshHakha added 3 commits September 28, 2022 17:02

nit

b576b83

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed nits

7d59e37

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

fixed the typos

0c3d09d

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

gjoliver approved these changes Sep 29, 2022

View reviewed changes

gjoliver merged commit e6c995d into ray-project:master Sep 29, 2022

[RLlib] Fix ope speed #28834

[RLlib] Fix ope speed #28834

Conversation

kouroshHakha commented Sep 27, 2022

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

kouroshHakha commented Sep 29, 2022

gjoliver commented Sep 29, 2022