Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Added benchmark experiment for SAC with MuJoCo, PPO with MuJoCo and DQN with Atari. #44262

Merged

Conversation

simonsays1980
Copy link
Collaborator

@simonsays1980 simonsays1980 commented Mar 25, 2024

Why are these changes needed?

This PR adds some benchmark runs for the following algorithms:

  1. DQN: Atari & Atari with PB2
  2. SAC: MuJoCo & MuJoCo with PB2 (HalfCheetah-v4, Hopper-v4, Humanoid-v4, Ant-v4, Walker2d-v4)
  3. PPO: MuJoCo & MuJoCo with PB2

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

… for all 5 environments and uses the best configuration to run a final trial.

Signed-off-by: Simon Zehnder <[email protected]>
… for all 7 environments and uses the best configuration to run a final trial. In addition added a benchmark run using parameters from the paper.

Signed-off-by: Simon Zehnder <[email protected]>
…noisy encoders. Furthermore, finished benchmark script for Atari with DQN Rainbow.

Signed-off-by: Simon Zehnder <[email protected]>
@simonsays1980 simonsays1980 self-assigned this Mar 27, 2024
@sven1977 sven1977 changed the title Added benchmark experiment for SAC with MuJoCo. [RLlib] Added benchmark experiment for SAC with MuJoCo. Apr 1, 2024
@sven1977 sven1977 marked this pull request as ready for review April 1, 2024 19:43
# CleanRL: https://wandb.ai/cleanrl/cleanrl.benchmark/reports/Mujoco--VmlldzoxODE0NjE
# AgileRL: https://github.com/AgileRL/AgileRL?tab=readme-ov-file#benchmarks

benchmark_envs = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR @simonsays1980 ! Did we actually run these on RLlib and got these results? Or how did you get to these episode-return numbers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ;) No we did not - I collected results from the papers. Take a look at my slack comment towards it, too.

# Stop training if the mean reward is reached.
if (
result["sampler_results/episode_reward_mean"]
>= self.benchmark_envs[result["env"]]["sampler_results/episode_reward_mean"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had no idea "env" is always part of the results dict. ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only when it is chosen by tunegird_search`


# See the following links for becnhmark results of other libraries:
# Original paper: https://arxiv.org/abs/1812.05905
# CleanRL: https://wandb.ai/cleanrl/cleanrl.benchmark/reports/Mujoco--VmlldzoxODE0NjE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are MuJoCo bechmarks. Are there respective results reported for CleanRL for Atari?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there are - but not for all of the Atari games I think.

@@ -0,0 +1,366 @@
import gymnasium as gym
from gymnasium.wrappers import AtariPreprocessing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do a quick comparison of this wrapper (including the settings we use below, dim=84, noop=0, grayscale=True) vs our own Atari wrapper func in rllib.env.wrappers.atari_wrappers::wrap_atari_for_new_api_stack()?

RLlib nowadays (new API stack) uses a DreamerV3 style Atari setup:
dim=64
grayscale=True
noop=30
action_repeat_prob=0.0 (<- deterministic!)
small action space=True

I'd be totally happy with replacing our function with this gymnasium wrapper here, but I want to make sure we don't mess up our existing benchmarks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to keep in mind that I took the benchmarks from the DQN Rainbow paper that is already some years old - there they use these settings. DreamerV3 is newer and therefore has taken - maybe - advantage of some developments there.

The ultimate question is: to what do we want to compare our algorithms?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just be bullish and:
a) Use our well-proven hyperparams and network architectures (e.g. the DreamerV3 CNN stack mentioned above, which has proven to be strong on Atari tasks with e.g. PPO).
b) The reason users use RLlib is not b/c it runs 20% faster or 2% slower than SB3 or another library. It's b/c it's scalable, multi-agent capable, multi-GPU capable, etc.. So as long as we show proper benchmark results (with one set of hyperparams or another) and maybe how scaling (adding more workers) affects these results, we should be solid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c) If we do this, then we won't have the problem of always having to "look back" and backward-support these quite old settings (DQN paper 2015 -> almost 10 years old). We can freely find new and better parameters and scaling settings as long as they improve the performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on your arguments. I added the PB2 run files to find such parameters and perform as best as possible with our setup and as discussed - we should run it with 1, 2, 4, x GPUs to show scalability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regard to the atari preprocessing. Let's make two single runs - one with our preprocessing and another with the gymnasium one and see which one performs better. It suffices to run for maybe 100-200 iterations imo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sven1977 Tbh I cannot find the settings for DreamerV3 using this wrap_atari_for_new_api_stack. The function also has no grayscale to be set. It uses a specific env_config:

# [2]: "We follow the evaluation protocol of Machado et al. (2018) with 200M
# environment steps, action repeat of 4, a time limit of 108,000 steps per
# episode that correspond to 30 minutes of game play, no access to life
# information, full action space, and sticky actions. Because the world model
# integrates information over time, DreamerV2 does not use frame stacking.
# The experiments use a single-task setup where a separate agent is trained
# for each game. Moreover, each agent uses only a single environment instance.
env_config={
    # "sticky actions" but not according to Danijar's 100k configs.
    "repeat_action_probability": 0.0,
    # "full action space" but not according to Danijar's 100k configs.
    "full_action_space": False,
    # Already done by MaxAndSkip wrapper: "action repeat" == 4.
    "frameskip": 1,
}

@simonsays1980 simonsays1980 changed the title [RLlib] Added benchmark experiment for SAC with MuJoCo. [RLlib] Added benchmark experiment for SAC with MuJoCo, PPO with MuJoCo and DQN with Atari. Apr 2, 2024
evaluation_num_workers=1,
evaluation_parallel_to_training=True,
evaluation_config={
"explore": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be False?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't PPO - having a stochastic policy - explore in evaluation to keep performance? I remember such discussions on the forum about the old stack PPO where users had this set to False and encountered a drop in performance.

@sven1977
Copy link
Contributor

sven1977 commented Apr 5, 2024

Just a few nits before we can merge:

  • question on explore=True/False in eval config
  • use our own Atari wrapper in the Atari benchmarks

Copy link
Contributor

@sven1977 sven1977 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Thanks @simonsays1980 . Can be merged pending comments/nits.

… added resources to all benchmark scripts. Fixed minor bug with Tune stopper.

Signed-off-by: Simon Zehnder <[email protected]>
@sven1977 sven1977 merged commit 61ef56f into ray-project:master Apr 11, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants