Example containing a proposal for computing an adapted (time-dependent) GAE used by the PPO algorithm (via callback on_postprocess_trajectory) #20850

kk-55 · 2021-12-02T16:59:23Z

I've made a proposal for an adapted, time-dependent computation of advantages (GAE, PPO algo).
I've packed my proposal into an example where the advantages are computed in the callback function on_postprocess_trajectory, but could also be included directly in a postprocess_fn.

Why are these changes useful and interesting?

See this short document, a review of my proposal is much appreciated (potentially there are shortcomings/issues I've not seen).

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…t) GAE used by the PPO algorithm (via callback on_postprocess_trajectory)

simonsays1980 · 2021-12-03T15:51:04Z

@kk-55 Great work! Thank you for contributing!

I see in the checks that there are some trailing whitespaces. If you run scripts/format.sh in your ray repo the linter should take care of these. Be sure you installed flake8 before (can be done by installing the ray/python/requirements_linters.txt).

I guess than all CI tests should pass

simonsays1980

@kk-55 I guess the problem in the CI checks are trailing whitespace in your comments.

rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:2:75: W291 trailing whitespace
--
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:3:77: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:5:76: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:8:78: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:9:69: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:91:68: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:98:69: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:103:76: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:108:76: W291 trailing whitespace
  | rllib/examples/compute_adapted_gae_on_postprocess_trajectory.py:109:79: W291 trailing whitespace
  | 🚨 Error: The command exited with status 1

My suggestion is to go explicitly to the lines and erase the trailing whitespace manually. It comes usually from writing a space before the next word is typed in and breaking line before the word is written.

Keep up the good work!

sven1977 · 2021-12-07T12:46:57Z

Hey @kk-55 , thanks for this PR! Looks great.

Some questions:

Using the new callback (e.g. in a PPO setup) would result in advantages calculations being done twice, correct? Once in the e.g. PPOTFPolicy, the other time in your new callback. I'm guessing this is ok for now as the callback postprocessing happens last: 1) policy.exploration.on_postprocess_trajectory, 2) policy.postprocess_trajectory, 3) custom on_postprocess_trajectory.
We should add an actual example script/test that uses the new callback.
We could maybe even move the new callback class into examples/callbacks/..., then add a new example script (e.g. gae_with_non_equidistant_timesteps.py) that uses the callback and add this new example script to our BUILD (CI-tests) as well.
We should mention this somewhere in our docs (as we are currently overhauling them, not sure where exactly, though) :)

kk-55 · 2021-12-07T14:30:56Z

Hey @sven1977, thanks for your response!

Some questions:

* Using the new callback (e.g. in a PPO setup) would result in advantages calculations being done twice, correct? Once in the e.g. PPOTFPolicy, the other time in your new callback. I'm guessing this is ok for now as the callback postprocessing happens last: 1) `policy.exploration.on_postprocess_trajectory`, 2) `policy.postprocess_trajectory`, 3) custom `on_postprocess_trajectory`.

Yes, that's absolutely correct! Calculation of advantages is being done twice, first time in compute_advantages and then in my custom callback.
I had started with an implementation as an additional option in compute_advantages (if use_adapted_gae: ...), but RLlib makes this dummy initialization during start and that would cause a special handling when accessing values of SampleBatch.INFOS (once it's an array, but in general it's a dict).
I guess one could also update the policy with a custom postprocessing function implementing this time-dependent GAE in it.

* We should add an actual example script/test that uses the new callback.

I've actually started to use this time-dependent GAE, but I guess my example/env is too complex and much too large. Perhaps, one could construct a fictive example from the CartPole env, but I'm not sure if this makes sense.

* We could maybe even move the new callback class into `examples/callbacks/...`, then add a new example script (e.g. `gae_with_non_equidistant_timesteps.py`) that uses the callback and add this new example script to our BUILD (CI-tests) as well.

Good idea! I agree if you would like to do it this way.

* We should mention this somewhere in our docs (as we are currently overhauling them, not sure where exactly, though) :)

Also good idea! ;-) I guess you could mention this somewhere in the callbacks section as an example or directly in the examples chapter, but it could also find its place somewhere in the algorithms/PPO section.

For me, I see this time-dependent GAE as an opportunity to adapt PPO to problems that are modeled as semi-MDPs rather than MDPs. It's just my experimental work and not verified by any theory or proofs, but John Schulman meant that it looks right to him ;-)

sven1977 · 2021-12-09T13:48:11Z

Hey @kk-55 , this is great and thanks for your detailed answers!
Let's merge this for now, there is no harm in adding the standalone example code.
Please consider providing w follow-up PR with the suggested enhancements (example script that actually uses the new GAE method, adding this script to our CI tests, some documentation on this, etc..).

Example containing a proposal for computing an adapted (time-dependen…

129fd96

…t) GAE used by the PPO algorithm (via callback on_postprocess_trajectory)

Ran script format.sh

912db93

simonsays1980 reviewed Dec 4, 2021

View reviewed changes

Manually removed trailing whitespaces

99e3192

sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Dec 9, 2021

sven1977 approved these changes Dec 9, 2021

View reviewed changes

sven1977 merged commit 9acf2f9 into ray-project:master Dec 9, 2021

kk-55 deleted the patch_02/Dec/2021 branch December 9, 2021 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example containing a proposal for computing an adapted (time-dependent) GAE used by the PPO algorithm (via callback on_postprocess_trajectory) #20850

Example containing a proposal for computing an adapted (time-dependent) GAE used by the PPO algorithm (via callback on_postprocess_trajectory) #20850

kk-55 commented Dec 2, 2021 •

edited by sven1977

Loading

simonsays1980 commented Dec 3, 2021 •

edited

Loading

simonsays1980 left a comment

sven1977 commented Dec 7, 2021

kk-55 commented Dec 7, 2021 •

edited

Loading

sven1977 commented Dec 9, 2021

Example containing a proposal for computing an adapted (time-dependent) GAE used by the PPO algorithm (via callback on_postprocess_trajectory) #20850

Example containing a proposal for computing an adapted (time-dependent) GAE used by the PPO algorithm (via callback on_postprocess_trajectory) #20850

Conversation

kk-55 commented Dec 2, 2021 • edited by sven1977 Loading

Why are these changes useful and interesting?

Checks

simonsays1980 commented Dec 3, 2021 • edited Loading

simonsays1980 left a comment

Choose a reason for hiding this comment

sven1977 commented Dec 7, 2021

kk-55 commented Dec 7, 2021 • edited Loading

sven1977 commented Dec 9, 2021

kk-55 commented Dec 2, 2021 •

edited by sven1977

Loading

simonsays1980 commented Dec 3, 2021 •

edited

Loading

kk-55 commented Dec 7, 2021 •

edited

Loading