Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 3, 2022

Signed-off-by: sven1977 [email protected]

Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates.

  • Adds a num_grad_updates counter to the Policy base class. This counter is incremented by 1 by the policy itself upon learn_on_batch or learn_on_loaded_batch.

  • It can be overridden by a global_vars update being pushed down from the Algorithm -> Worker -> Policy.

  • Adds num_grad_updates_per_policy to global vars of each RolloutWorker.

  • The number of grad updates per policy is passed down to sampler policies from the trained policies (after the training step). This happens along with the weights. Note that only those workers that had samples and only those policies that were actually updated, do get this num_grad_updates broadcast.

  • Uses this new counter to track, how much off-policy'ness we have in our individual updates:

  • Off-policy'ness is measured by taking the difference between the number of grad updates that the policy used for sampling has already undergone vs the number of grad updates that the actually trained policy version has already gone through. So for a 100% on-policy algo (like PG), we would expect values of 0.0. For IMPALA and APPO, we should expect small'ish values (between 0.0 and 1.0) and for PPG (with its num_sgd_iter setting), we might even see values >> 1.0, depending on the config, of course.

  • Added some unit checks for PPO/IMPALA/APPO unit tests.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 changed the title [RLlib] Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates. [WIP; RLlib] Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates. Nov 3, 2022
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 changed the title [WIP; RLlib] Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates. [RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. Nov 8, 2022
@sven1977 sven1977 assigned kouroshHakha and avnishn and unassigned kouroshHakha Nov 8, 2022
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Copy link
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change p much makes sense to me.

I had some minor suggested changes that I think should improve readability of the change

# Roughly: Reaches up to 0.4 for 2 rollout workers and up to 0.2 for
# 1 rollout worker.
if not (lower_limit <= off_policy_ness <= upper_limit):
raise AssertionError()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise a value error here, needs a message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are absolutely right. Forgot to add this. ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -100,6 +100,13 @@ def _all_tower_reduce(path, *tower_data):
return None

if isinstance(path[-1], str):
# TODO(sven): We need to fix this terrible dependency on `str.starts_with`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

@@ -103,6 +103,9 @@ def __init__(self, *args, **kwargs):
self.zero_padded = kwargs.pop("_zero_padded", False)
# Whether this batch is used for training (vs inference).
self._is_training = kwargs.pop("_is_training", None)
# Average number of grad updates that have been performed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really the average number of updates? I think its the total sum of gradient updates that have occurred over the life of the trained policy.

We need to update this comment, unless you intended to say "Average" in which case needs to be reflected in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So it's actually the weighted average over the different samples (timesteps) in that batch.

Imagine you have two rollout workers and one of them has a policy that has undergone 2 updates thus far, the other uses a policy that has undergone 3 updates thus far. Both workers produce rollouts of length 50 (rollout_fragment_length==50). The train batch size is 100, so we concatenate these 2x len=50 batches to one that's 100 ts long. This new 100ts batch will have its num_gradient_updates property set to 2.5 as it's the weighted average (both original batches contribute 50%).

Let me try making this more clear in the comment here ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enhanced the comment to make this more clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see. sg

}
)
self.num_grad_updates += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to before line 75? then compute DIFF_NUM_GRAD_UPDATES_VS_SAMPLER_POLICY as self.num_grad_updates - (postprocessed_batch.num_grad_updates or 0) - 1

It makes more sense from a readability perspective, because the assumption is that the most recent gradient computation has already happened.

alternatively, make a new temporary var called previous_num_grad_updates which is self.num_grad_updates and use that for computing DIFF_NUM_GRAD_UPDATES_VS_SAMPLER_POLICY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we should increase the counter as soon as we do the actual update. Fixed.

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 merged commit 7563211 into ray-project:master Nov 9, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
@sven1977 sven1977 deleted the add_off_policyness_to_metrics_for_appo branch June 2, 2023 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants