Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example containing a proposal for computing an adapted (time-dependent) GAE used by the PPO algorithm (via callback on_postprocess_trajectory) #20850

Merged
merged 3 commits into from
Dec 9, 2021

Conversation

kk-55
Copy link
Contributor

@kk-55 kk-55 commented Dec 2, 2021

I've made a proposal for an adapted, time-dependent computation of advantages (GAE, PPO algo).
I've packed my proposal into an example where the advantages are computed in the callback function on_postprocess_trajectory, but could also be included directly in a postprocess_fn.

Why are these changes useful and interesting?

See this short document, a review of my proposal is much appreciated (potentially there are shortcomings/issues I've not seen).

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…t) GAE used by the PPO algorithm (via callback on_postprocess_trajectory)
@simonsays1980
Copy link
Collaborator

simonsays1980 commented Dec 3, 2021

@kk-55 Great work! Thank you for contributing!

I see in the checks that there are some trailing whitespaces. If you run scripts/format.sh in your ray repo the linter should take care of these. Be sure you installed flake8 before (can be done by installing the ray/python/requirements_linters.txt).

I guess than all CI tests should pass

Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-55 I guess the problem in the CI checks are trailing whitespace in your comments.

rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:2:75: W291 trailing whitespace
--
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:3:77: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:5:76: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:8:78: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:9:69: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:91:68: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:98:69: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:103:76: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:108:76: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:109:79: W291 trailing whitespace
  | 🚨 Error: The command exited with status 1

My suggestion is to go explicitly to the lines and erase the trailing whitespace manually. It comes usually from writing a space before the next word is typed in and breaking line before the word is written.

Keep up the good work!

@sven1977
Copy link
Contributor

sven1977 commented Dec 7, 2021

Hey @kk-55 , thanks for this PR! Looks great.

Some questions:

  • Using the new callback (e.g. in a PPO setup) would result in advantages calculations being done twice, correct? Once in the e.g. PPOTFPolicy, the other time in your new callback. I'm guessing this is ok for now as the callback postprocessing happens last: 1) policy.exploration.on_postprocess_trajectory, 2) policy.postprocess_trajectory, 3) custom on_postprocess_trajectory.

  • We should add an actual example script/test that uses the new callback.

  • We could maybe even move the new callback class into examples/callbacks/..., then add a new example script (e.g. gae_with_non_equidistant_timesteps.py) that uses the callback and add this new example script to our BUILD (CI-tests) as well.

  • We should mention this somewhere in our docs (as we are currently overhauling them, not sure where exactly, though) :)

@kk-55
Copy link
Contributor Author

kk-55 commented Dec 7, 2021

Hey @sven1977, thanks for your response!

Some questions:

* Using the new callback (e.g. in a PPO setup) would result in advantages calculations being done twice, correct? Once in the e.g. PPOTFPolicy, the other time in your new callback. I'm guessing this is ok for now as the callback postprocessing happens last: 1) `policy.exploration.on_postprocess_trajectory`, 2) `policy.postprocess_trajectory`, 3) custom `on_postprocess_trajectory`.

Yes, that's absolutely correct! Calculation of advantages is being done twice, first time in compute_advantages and then in my custom callback.
I had started with an implementation as an additional option in compute_advantages (if use_adapted_gae: ...), but RLlib makes this dummy initialization during start and that would cause a special handling when accessing values of SampleBatch.INFOS (once it's an array, but in general it's a dict).
I guess one could also update the policy with a custom postprocessing function implementing this time-dependent GAE in it.

* We should add an actual example script/test that uses the new callback.

I've actually started to use this time-dependent GAE, but I guess my example/env is too complex and much too large. Perhaps, one could construct a fictive example from the CartPole env, but I'm not sure if this makes sense.

* We could maybe even move the new callback class into `examples/callbacks/...`, then add a new example script (e.g. `gae_with_non_equidistant_timesteps.py`) that uses the callback and add this new example script to our BUILD (CI-tests) as well.

Good idea! I agree if you would like to do it this way.

* We should mention this somewhere in our docs (as we are currently overhauling them, not sure where exactly, though) :)

Also good idea! ;-) I guess you could mention this somewhere in the callbacks section as an example or directly in the examples chapter, but it could also find its place somewhere in the algorithms/PPO section.

For me, I see this time-dependent GAE as an opportunity to adapt PPO to problems that are modeled as semi-MDPs rather than MDPs. It's just my experimental work and not verified by any theory or proofs, but John Schulman meant that it looks right to him ;-)

@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Dec 9, 2021
@sven1977
Copy link
Contributor

sven1977 commented Dec 9, 2021

Hey @kk-55 , this is great and thanks for your detailed answers!
Let's merge this for now, there is no harm in adding the standalone example code.
Please consider providing w follow-up PR with the suggested enhancements (example script that actually uses the new GAE method, adding this script to our CI tests, some documentation on this, etc..).

@sven1977 sven1977 merged commit 9acf2f9 into ray-project:master Dec 9, 2021
@kk-55 kk-55 deleted the patch_02/Dec/2021 branch December 9, 2021 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants