[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983

sven1977 · 2022-11-03T16:48:21Z

Signed-off-by: sven1977 [email protected]

Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates.

Adds a num_grad_updates counter to the Policy base class. This counter is incremented by 1 by the policy itself upon learn_on_batch or learn_on_loaded_batch.
It can be overridden by a global_vars update being pushed down from the Algorithm -> Worker -> Policy.
Adds num_grad_updates_per_policy to global vars of each RolloutWorker.
The number of grad updates per policy is passed down to sampler policies from the trained policies (after the training step). This happens along with the weights. Note that only those workers that had samples and only those policies that were actually updated, do get this num_grad_updates broadcast.
Uses this new counter to track, how much off-policy'ness we have in our individual updates:
Off-policy'ness is measured by taking the difference between the number of grad updates that the policy used for sampling has already undergone vs the number of grad updates that the actually trained policy version has already gone through. So for a 100% on-policy algo (like PG), we would expect values of 0.0. For IMPALA and APPO, we should expect small'ish values (between 0.0 and 1.0) and for PPG (with its num_sgd_iter setting), we might even see values >> 1.0, depending on the config, of course.
Added some unit checks for PPO/IMPALA/APPO unit tests.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

…off_policyness_to_metrics_for_appo

Signed-off-by: sven1977 <[email protected]>

avnishn

this change p much makes sense to me.

I had some minor suggested changes that I think should improve readability of the change

avnishn · 2022-11-08T17:39:42Z

rllib/utils/test_utils.py

+    # Roughly: Reaches up to 0.4 for 2 rollout workers and up to 0.2 for
+    # 1 rollout worker.
+    if not (lower_limit <= off_policy_ness <= upper_limit):
+        raise AssertionError()


raise a value error here, needs a message.

Yes, you are absolutely right. Forgot to add this. ...

avnishn · 2022-11-08T17:40:06Z

rllib/utils/metrics/learner_info.py

@@ -100,6 +100,13 @@ def _all_tower_reduce(path, *tower_data):
        return None

    if isinstance(path[-1], str):
+        # TODO(sven): We need to fix this terrible dependency on `str.starts_with`


fair enough

avnishn · 2022-11-08T20:38:57Z

rllib/policy/sample_batch.py

@@ -103,6 +103,9 @@ def __init__(self, *args, **kwargs):
        self.zero_padded = kwargs.pop("_zero_padded", False)
        # Whether this batch is used for training (vs inference).
        self._is_training = kwargs.pop("_is_training", None)
+        # Average number of grad updates that have been performed


is it really the average number of updates? I think its the total sum of gradient updates that have occurred over the life of the trained policy.

We need to update this comment, unless you intended to say "Average" in which case needs to be reflected in the code.

Good question. So it's actually the weighted average over the different samples (timesteps) in that batch.

Imagine you have two rollout workers and one of them has a policy that has undergone 2 updates thus far, the other uses a policy that has undergone 3 updates thus far. Both workers produce rollouts of length 50 (rollout_fragment_length==50). The train batch size is 100, so we concatenate these 2x len=50 batches to one that's 100 ts long. This new 100ts batch will have its num_gradient_updates property set to 2.5 as it's the weighted average (both original batches contribute 50%).

Let me try making this more clear in the comment here ..

Enhanced the comment to make this more clear.

oh I see. sg

avnishn · 2022-11-08T20:42:50Z

rllib/policy/torch_policy.py

            }
        )
+        self.num_grad_updates += 1


move this to before line 75? then compute DIFF_NUM_GRAD_UPDATES_VS_SAMPLER_POLICY as self.num_grad_updates - (postprocessed_batch.num_grad_updates or 0) - 1

It makes more sense from a readability perspective, because the assumption is that the most recent gradient computation has already happened.

alternatively, make a new temporary var called previous_num_grad_updates which is self.num_grad_updates and use that for computing DIFF_NUM_GRAD_UPDATES_VS_SAMPLER_POLICY

You are right, we should increase the counter as soon as we do the actual update. Fixed.

Signed-off-by: sven1977 <[email protected]>

…off_policyness_to_metrics_for_appo

Signed-off-by: sven1977 <[email protected]>

…icy'ness for performed updates. (ray-project#29983) Signed-off-by: Weichen Xu <[email protected]>

wip

3ab1948

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners November 3, 2022 16:48

sven1977 changed the title ~~[RLlib] Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates.~~ [WIP; RLlib] Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates. Nov 3, 2022

sven1977 added 5 commits November 7, 2022 16:10

Merge branch 'master' of https://github.com/ray-project/ray into add_…

6c6bbc8

…off_policyness_to_metrics_for_appo

Merge branch 'master' of https://github.com/ray-project/ray into add_…

e96cf42

…off_policyness_to_metrics_for_appo

wip

03453d9

Signed-off-by: sven1977 <[email protected]>

wip

b3b4fe6

Signed-off-by: sven1977 <[email protected]>

LINT

af380a8

Signed-off-by: sven1977 <[email protected]>

sven1977 changed the title ~~[WIP; RLlib] Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates.~~ [RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. Nov 8, 2022

sven1977 assigned kouroshHakha and avnishn and unassigned kouroshHakha Nov 8, 2022

sven1977 added 5 commits November 8, 2022 14:59

LINT

4a386a0

Signed-off-by: sven1977 <[email protected]>

wip

467be53

Signed-off-by: sven1977 <[email protected]>

fixes

ca194ec

Signed-off-by: sven1977 <[email protected]>

fixes

e71ed92

Signed-off-by: sven1977 <[email protected]>

fixes

601e105

Signed-off-by: sven1977 <[email protected]>

avnishn reviewed Nov 8, 2022

View reviewed changes

sven1977 added 3 commits November 9, 2022 10:59

wip

8dfe43b

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

fc4f2f1

…off_policyness_to_metrics_for_appo

LINT

fb60665

Signed-off-by: sven1977 <[email protected]>

avnishn approved these changes Nov 9, 2022

View reviewed changes

sven1977 merged commit 7563211 into ray-project:master Nov 9, 2022

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022

[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-pol…

cdd62b2

…icy'ness for performed updates. (ray-project#29983) Signed-off-by: Weichen Xu <[email protected]>

sven1977 deleted the add_off_policyness_to_metrics_for_appo branch June 2, 2023 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983

[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983

sven1977 commented Nov 3, 2022 •

edited

Loading

avnishn left a comment

avnishn Nov 8, 2022

sven1977 Nov 9, 2022

sven1977 Nov 9, 2022

avnishn Nov 8, 2022

avnishn Nov 8, 2022

sven1977 Nov 9, 2022

sven1977 Nov 9, 2022

avnishn Nov 9, 2022

avnishn Nov 8, 2022

sven1977 Nov 9, 2022

[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983

[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983

Conversation

sven1977 commented Nov 3, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

avnishn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sven1977 commented Nov 3, 2022 •

edited

Loading