-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Added benchmark experiment for SAC with MuJoCo, PPO with MuJoCo and DQN with Atari. #44262
[RLlib] Added benchmark experiment for SAC with MuJoCo, PPO with MuJoCo and DQN with Atari. #44262
Conversation
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
… for all 5 environments and uses the best configuration to run a final trial. Signed-off-by: Simon Zehnder <[email protected]>
… for all 7 environments and uses the best configuration to run a final trial. In addition added a benchmark run using parameters from the paper. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
…noisy encoders. Furthermore, finished benchmark script for Atari with DQN Rainbow. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
# CleanRL: https://wandb.ai/cleanrl/cleanrl.benchmark/reports/Mujoco--VmlldzoxODE0NjE | ||
# AgileRL: https://github.com/AgileRL/AgileRL?tab=readme-ov-file#benchmarks | ||
|
||
benchmark_envs = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome PR @simonsays1980 ! Did we actually run these on RLlib and got these results? Or how did you get to these episode-return numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ;) No we did not - I collected results from the papers. Take a look at my slack comment towards it, too.
# Stop training if the mean reward is reached. | ||
if ( | ||
result["sampler_results/episode_reward_mean"] | ||
>= self.benchmark_envs[result["env"]]["sampler_results/episode_reward_mean"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had no idea "env" is always part of the results dict. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is only when it is chosen by tune
gird_search`
|
||
# See the following links for becnhmark results of other libraries: | ||
# Original paper: https://arxiv.org/abs/1812.05905 | ||
# CleanRL: https://wandb.ai/cleanrl/cleanrl.benchmark/reports/Mujoco--VmlldzoxODE0NjE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are MuJoCo bechmarks. Are there respective results reported for CleanRL for Atari?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there are - but not for all of the Atari games I think.
@@ -0,0 +1,366 @@ | |||
import gymnasium as gym | |||
from gymnasium.wrappers import AtariPreprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do a quick comparison of this wrapper (including the settings we use below, dim=84, noop=0, grayscale=True) vs our own Atari wrapper func in rllib.env.wrappers.atari_wrappers::wrap_atari_for_new_api_stack()
?
RLlib nowadays (new API stack) uses a DreamerV3 style Atari setup:
dim=64
grayscale=True
noop=30
action_repeat_prob=0.0 (<- deterministic!)
small action space=True
I'd be totally happy with replacing our function with this gymnasium wrapper here, but I want to make sure we don't mess up our existing benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to keep in mind that I took the benchmarks from the DQN Rainbow paper that is already some years old - there they use these settings. DreamerV3 is newer and therefore has taken - maybe - advantage of some developments there.
The ultimate question is: to what do we want to compare our algorithms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just be bullish and:
a) Use our well-proven hyperparams and network architectures (e.g. the DreamerV3 CNN stack mentioned above, which has proven to be strong on Atari tasks with e.g. PPO).
b) The reason users use RLlib is not b/c it runs 20% faster or 2% slower than SB3 or another library. It's b/c it's scalable, multi-agent capable, multi-GPU capable, etc.. So as long as we show proper benchmark results (with one set of hyperparams or another) and maybe how scaling (adding more workers) affects these results, we should be solid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c) If we do this, then we won't have the problem of always having to "look back" and backward-support these quite old settings (DQN paper 2015 -> almost 10 years old). We can freely find new and better parameters and scaling settings as long as they improve the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on your arguments. I added the PB2
run files to find such parameters and perform as best as possible with our setup and as discussed - we should run it with 1, 2, 4, x GPUs to show scalability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regard to the atari preprocessing. Let's make two single runs - one with our preprocessing and another with the gymnasium
one and see which one performs better. It suffices to run for maybe 100-200 iterations imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 Tbh I cannot find the settings for DreamerV3 using this wrap_atari_for_new_api_stack
. The function also has no grayscale to be set. It uses a specific env_config
:
# [2]: "We follow the evaluation protocol of Machado et al. (2018) with 200M
# environment steps, action repeat of 4, a time limit of 108,000 steps per
# episode that correspond to 30 minutes of game play, no access to life
# information, full action space, and sticky actions. Because the world model
# integrates information over time, DreamerV2 does not use frame stacking.
# The experiments use a single-task setup where a separate agent is trained
# for each game. Moreover, each agent uses only a single environment instance.
env_config={
# "sticky actions" but not according to Danijar's 100k configs.
"repeat_action_probability": 0.0,
# "full action space" but not according to Danijar's 100k configs.
"full_action_space": False,
# Already done by MaxAndSkip wrapper: "action repeat" == 4.
"frameskip": 1,
}
evaluation_num_workers=1, | ||
evaluation_parallel_to_training=True, | ||
evaluation_config={ | ||
"explore": True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't PPO - having a stochastic policy - explore in evaluation to keep performance? I remember such discussions on the forum about the old stack PPO where users had this set to False and encountered a drop in performance.
Just a few nits before we can merge:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved! Thanks @simonsays1980 . Can be merged pending comments/nits.
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
… added resources to all benchmark scripts. Fixed minor bug with Tune stopper. Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Signed-off-by: Simon Zehnder <[email protected]>
Why are these changes needed?
This PR adds some benchmark runs for the following algorithms:
PB2
PB2
(HalfCheetah-v4
,Hopper-v4
,Humanoid-v4
,Ant-v4
,Walker2d-v4
)PB2
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.