-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983
[RLlib] Add metrics to IMPALA/APPO/PPO (prototype) to measure off-policy'ness for performed updates. #29983
Conversation
Signed-off-by: sven1977 <[email protected]>
…off_policyness_to_metrics_for_appo
…off_policyness_to_metrics_for_appo
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change p much makes sense to me.
I had some minor suggested changes that I think should improve readability of the change
rllib/utils/test_utils.py
Outdated
# Roughly: Reaches up to 0.4 for 2 rollout workers and up to 0.2 for | ||
# 1 rollout worker. | ||
if not (lower_limit <= off_policy_ness <= upper_limit): | ||
raise AssertionError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise a value error here, needs a message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are absolutely right. Forgot to add this. ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -100,6 +100,13 @@ def _all_tower_reduce(path, *tower_data): | |||
return None | |||
|
|||
if isinstance(path[-1], str): | |||
# TODO(sven): We need to fix this terrible dependency on `str.starts_with` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
rllib/policy/sample_batch.py
Outdated
@@ -103,6 +103,9 @@ def __init__(self, *args, **kwargs): | |||
self.zero_padded = kwargs.pop("_zero_padded", False) | |||
# Whether this batch is used for training (vs inference). | |||
self._is_training = kwargs.pop("_is_training", None) | |||
# Average number of grad updates that have been performed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it really the average number of updates? I think its the total sum of gradient updates that have occurred over the life of the trained policy.
We need to update this comment, unless you intended to say "Average" in which case needs to be reflected in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. So it's actually the weighted average over the different samples (timesteps) in that batch.
Imagine you have two rollout workers and one of them has a policy that has undergone 2 updates thus far, the other uses a policy that has undergone 3 updates thus far. Both workers produce rollouts of length 50 (rollout_fragment_length==50). The train batch size is 100, so we concatenate these 2x len=50 batches to one that's 100 ts long. This new 100ts batch will have its num_gradient_updates
property set to 2.5 as it's the weighted average (both original batches contribute 50%).
Let me try making this more clear in the comment here ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhanced the comment to make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see. sg
rllib/policy/torch_policy.py
Outdated
} | ||
) | ||
self.num_grad_updates += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to before line 75? then compute DIFF_NUM_GRAD_UPDATES_VS_SAMPLER_POLICY
as self.num_grad_updates - (postprocessed_batch.num_grad_updates or 0) - 1
It makes more sense from a readability perspective, because the assumption is that the most recent gradient computation has already happened.
alternatively, make a new temporary var called previous_num_grad_updates
which is self.num_grad_updates
and use that for computing DIFF_NUM_GRAD_UPDATES_VS_SAMPLER_POLICY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, we should increase the counter as soon as we do the actual update. Fixed.
Signed-off-by: sven1977 <[email protected]>
…off_policyness_to_metrics_for_appo
Signed-off-by: sven1977 <[email protected]>
…icy'ness for performed updates. (ray-project#29983) Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: sven1977 [email protected]
Add metrics to IMPALA/APPO (prototype) to measure off-policy'ness for performed updates.
Adds a
num_grad_updates
counter to the Policy base class. This counter is incremented by 1 by the policy itself uponlearn_on_batch
orlearn_on_loaded_batch
.It can be overridden by a global_vars update being pushed down from the Algorithm -> Worker -> Policy.
Adds
num_grad_updates_per_policy
to global vars of each RolloutWorker.The number of grad updates per policy is passed down to sampler policies from the trained policies (after the training step). This happens along with the weights. Note that only those workers that had samples and only those policies that were actually updated, do get this
num_grad_updates
broadcast.Uses this new counter to track, how much off-policy'ness we have in our individual updates:
Off-policy'ness is measured by taking the difference between the number of grad updates that the policy used for sampling has already undergone vs the number of grad updates that the actually trained policy version has already gone through. So for a 100% on-policy algo (like PG), we would expect values of 0.0. For IMPALA and APPO, we should expect small'ish values (between 0.0 and 1.0) and for PPG (with its num_sgd_iter setting), we might even see values >> 1.0, depending on the config, of course.
Added some unit checks for PPO/IMPALA/APPO unit tests.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.