Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Add separate learning rates for policy and alpha to SAC. #47078

Merged
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
ffe5048
Added spearate learnign rates for policy, critic, and alpha to SAC. T…
simonsays1980 Aug 12, 2024
3fc16cf
Merge branch 'master' into add-actor-specific-learning-rate
sven1977 Aug 12, 2024
cd5450c
Added an additional 'ciritc_lr', change 'policy_lr' to 'actor_lr', an…
simonsays1980 Aug 13, 2024
38cac91
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 13, 2024
8f64c7c
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 13, 2024
15e2898
Apply suggestions from code review
sven1977 Aug 14, 2024
d0679b7
[RLlib; Offline RL] Implement twin-Q net option for CQL. (#47105)
simonsays1980 Aug 13, 2024
e15ef64
[core] remove unused GcsAio(Publisher|Subscriber) methods and subclas…
rynewang Aug 13, 2024
fc0f1fe
[Core] Fix a bug where we submit the actor creation task to the wrong…
jjyao Aug 13, 2024
387a083
[doc][build] Update all changed files timestamp to latest (#47115)
khluu Aug 13, 2024
326eaae
[serve] split `test_proxy.py` into unit and e2e tests (#47112)
zcin Aug 13, 2024
33d574a
[Utility] add `env_float` utility into `ray._private.ray_constants` (…
hongpeng-guo Aug 13, 2024
eff647b
[Data] Fix progress bars not showing % progress (#47120)
scottjlee Aug 13, 2024
ca98c7f
[data] change data17 to datal (#47082)
aslonnie Aug 14, 2024
530f511
[ci] change data build for all python versions to arrow 17 (#47121)
can-anyscale Aug 14, 2024
cbaad59
[doc][rllib] add missing public api references (#47111)
can-anyscale Aug 14, 2024
ce283ad
[Core] Clarify docstring for get_gpu_ids() that it is only called ins…
petern48 Aug 14, 2024
88031fe
[doc][rllib] the rest of missing api references + lint checker (#47114)
can-anyscale Aug 14, 2024
a137979
Light up Ask AI button When Seach is Open (#47054)
cristianjd Aug 14, 2024
aac0a6e
[serve] immediately send ping in router when receiving new replica se…
zcin Aug 14, 2024
d560f3e
[data] Add label to indicate if operator is backpressured (#47095)
omatthew98 Aug 14, 2024
2626ea7
[Core] Add ray[adag] option to pip install (#47009)
ruisearch42 Aug 14, 2024
bb5c322
[Doc] Run pre-commit on tune docs (#47108)
peytondmurray Aug 14, 2024
7ae5621
[release tests] update anyscale service utils (#46397)
zcin Aug 14, 2024
2f77ddd
[core][experimental] Build an operation-based execution schedule for …
kevin85421 Aug 14, 2024
9c17e13
[serve] remove warnings about ongoing requests default change (#47085)
zcin Aug 14, 2024
7cc321a
[serve] `__init__` functions have no return values (#47144)
aslonnie Aug 15, 2024
8e6f9cc
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 15, 2024
8d98825
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 15, 2024
6808cbb
Turned off test 'self_play_with_policy_checkpoint' b/c it was failing…
simonsays1980 Aug 15, 2024
3a75d15
Uncommented 'pretrain_bc_single_agent_evaluate_as_multi_agent' b/c hy…
simonsays1980 Aug 16, 2024
f1dde3e
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 16, 2024
39f90a6
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 19, 2024
79f34ff
Switched test for old stack CQL to 'torch-only' b/c 'tf2' fails perma…
simonsays1980 Aug 19, 2024
3944b31
Fixed a small bug with uninitialized learning rates on old stack SAC.
simonsays1980 Aug 19, 2024
0a7aaa5
Merged master.
simonsays1980 Aug 20, 2024
3d670c5
Added actor- and critic-specific learning rates to HalfCheetah tests …
simonsays1980 Aug 20, 2024
380e8e6
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 20, 2024
179fce4
Fixed error in 'test_worker_failures' due to the base 'lr' not set to…
simonsays1980 Aug 20, 2024
d6d4d5a
Fixed error in doc codes not implementing 'lr=None' and adapted learn…
simonsays1980 Aug 20, 2024
44336ae
Tuned learning rates on multi-agent SAC example.
simonsays1980 Aug 20, 2024
2dbb93d
Merge branch 'master' into add-actor-specific-learning-rate
simonsays1980 Aug 20, 2024
351b0a8
Added tuned learning rates to single agent SAC tuned example and Half…
simonsays1980 Aug 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 73 additions & 1 deletion rllib/algorithms/sac/sac.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
deprecation_warning,
)
from ray.rllib.utils.framework import try_import_tf, try_import_tfp
from ray.rllib.utils.typing import RLModuleSpecType, ResultDict
from ray.rllib.utils.typing import LearningRateOrSchedule, RLModuleSpecType, ResultDict

tf1, tf, tfv = try_import_tf()
tfp = try_import_tfp()
Expand Down Expand Up @@ -82,6 +82,11 @@ def __init__(self, algo_class=None):
"critic_learning_rate": 3e-4,
"entropy_learning_rate": 3e-4,
}
self.actor_lr = 3e-5
self.critic_lr = 3e-4
self.alpha_lr = 3e-4
# Set `lr` parameter to `None` and ensure it is not used.
self.lr = None
self.grad_clip = None
self.target_network_update_freq = 0

Expand Down Expand Up @@ -135,6 +140,9 @@ def training(
clip_actions: Optional[bool] = NotProvided,
grad_clip: Optional[float] = NotProvided,
optimization_config: Optional[Dict[str, Any]] = NotProvided,
actor_lr: Optional[LearningRateOrSchedule] = NotProvided,
critic_lr: Optional[LearningRateOrSchedule] = NotProvided,
alpha_lr: Optional[LearningRateOrSchedule] = NotProvided,
target_network_update_freq: Optional[int] = NotProvided,
_deterministic_loss: Optional[bool] = NotProvided,
_use_beta_distribution: Optional[bool] = NotProvided,
Expand Down Expand Up @@ -239,6 +247,56 @@ def training(
optimization_config: Config dict for optimization. Set the supported keys
`actor_learning_rate`, `critic_learning_rate`, and
`entropy_learning_rate` in here.
actor_lr: The learning rate (float) or learning rate schedule for the
policy in the format of
[[timestep, lr-value], [timestep, lr-value], ...] In case of a
schedule, intermediary timesteps will be assigned to linearly
interpolated learning rate values. A schedule config's first entry
must start with timestep 0, i.e.: [[0, initial_value], [...]].
Note: It is common practice (two-timescale approach) to use a smaller
learning rate for the policy than for the critic to ensure that the
critic gives adequate values for improving the policy.
Note: If you require a) more than one optimizer (per RLModule),
b) optimizer types that are not Adam, c) a learning rate schedule that
is not a linearly interpolated, piecewise schedule as described above,
or d) specifying c'tor arguments of the optimizer that are not the
learning rate (e.g. Adam's epsilon), then you must override your
Learner's `configure_optimizer_for_module()` method and handle
lr-scheduling yourself.
The default value is 3e-5, one decimal less than the respective
learning rate of the critic (see `critic_lr`).
critic_lr: The learning rate (float) or learning rate schedule for the
critic in the format of
[[timestep, lr-value], [timestep, lr-value], ...] In case of a
schedule, intermediary timesteps will be assigned to linearly
interpolated learning rate values. A schedule config's first entry
must start with timestep 0, i.e.: [[0, initial_value], [...]].
Note: It is common practice (two-timescale approach) to use a smaller
learning rate for the policy than for the critic to ensure that the
critic gives adequate values for improving the policy.
Note: If you require a) more than one optimizer (per RLModule),
b) optimizer types that are not Adam, c) a learning rate schedule that
is not a linearly interpolated, piecewise schedule as described above,
or d) specifying c'tor arguments of the optimizer that are not the
learning rate (e.g. Adam's epsilon), then you must override your
Learner's `configure_optimizer_for_module()` method and handle
lr-scheduling yourself.
The default value is 3e-4, one decimal higher than the respective
learning rate of the actor (policy) (see `actor_lr`).
alpha_lr: The learning rate (float) or learning rate schedule for the
hyperparameter alpha in the format of
[[timestep, lr-value], [timestep, lr-value], ...] In case of a
schedule, intermediary timesteps will be assigned to linearly
interpolated learning rate values. A schedule config's first entry
must start with timestep 0, i.e.: [[0, initial_value], [...]].
Note: If you require a) more than one optimizer (per RLModule),
b) optimizer types that are not Adam, c) a learning rate schedule that
is not a linearly interpolated, piecewise schedule as described above,
or d) specifying c'tor arguments of the optimizer that are not the
learning rate (e.g. Adam's epsilon), then you must override your
Learner's `configure_optimizer_for_module()` method and handle
lr-scheduling yourself.
The default value is 3e-4, identical to the critic learning rate (`lr`).
target_network_update_freq: Update the target network every
`target_network_update_freq` steps.
_deterministic_loss: Whether the loss should be calculated deterministically
Expand Down Expand Up @@ -289,6 +347,12 @@ def training(
self.grad_clip = grad_clip
if optimization_config is not NotProvided:
self.optimization = optimization_config
if actor_lr is not NotProvided:
self.actor_lr = actor_lr
if critic_lr is not NotProvided:
self.critic_lr = critic_lr
if alpha_lr is not NotProvided:
self.alpha_lr = alpha_lr
if target_network_update_freq is not NotProvided:
self.target_network_update_freq = target_network_update_freq
if _deterministic_loss is not NotProvided:
Expand Down Expand Up @@ -362,6 +426,14 @@ def validate(self) -> None:
"`EpisodeReplayBuffer`."
)

if self.enable_rl_module_and_learner and self.lr:
sven1977 marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError(
"Basic learning rate parameter `lr` is not `None`. For SAC "
"use the specific learning rate parameters `actor_lr`, `critic_lr` "
"and `alpha_lr`, for the actor, critic, and the hyperparameter "
"`alpha`, respectively."
)

@override(AlgorithmConfig)
def get_rollout_fragment_length(self, worker_index: int = 0) -> int:
if self.rollout_fragment_length == "auto":
Expand Down
8 changes: 4 additions & 4 deletions rllib/algorithms/sac/torch/sac_torch_learner.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def configure_optimizers_for_module(
optimizer_name="qf",
optimizer=optim_critic,
params=params_critic,
lr_or_lr_schedule=config.lr,
lr_or_lr_schedule=config.critic_lr,
)
# If necessary register also an optimizer for a twin Q network.
if config.twin_q:
Expand All @@ -72,7 +72,7 @@ def configure_optimizers_for_module(
optimizer_name="qf_twin",
optimizer=optim_twin_critic,
params=params_twin_critic,
lr_or_lr_schedule=config.lr,
lr_or_lr_schedule=config.critic_lr,
)

# Define the optimizer for the actor.
Expand All @@ -86,7 +86,7 @@ def configure_optimizers_for_module(
optimizer_name="policy",
optimizer=optim_actor,
params=params_actor,
lr_or_lr_schedule=config.lr,
lr_or_lr_schedule=config.actor_lr,
)

# Define the optimizer for the temperature.
Expand All @@ -97,7 +97,7 @@ def configure_optimizers_for_module(
optimizer_name="alpha",
optimizer=optim_temperature,
params=[temperature],
lr_or_lr_schedule=config.lr,
lr_or_lr_schedule=config.alpha_lr,
)

@override(DQNRainbowTorchLearner)
Expand Down
4 changes: 4 additions & 0 deletions rllib/tuned_examples/sac/benchmark_sac_mujoco.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,10 @@ def stop_all(self):
.training(
initial_alpha=1.001,
lr=3e-4,
# Choose a smaller learning rate for the actor (policy).
sven1977 marked this conversation as resolved.
Show resolved Hide resolved
actor_lr=3e-5,
critic_lr=3e-4,
alpha_lr=1e-4,
target_entropy="auto",
n_step=1,
tau=0.005,
Expand Down
8 changes: 6 additions & 2 deletions rllib/tuned_examples/sac/benchmark_sac_mujoco_pb2.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@
# Copy bottom % with top % weights.
quantile_fraction=0.25,
hyperparam_bounds={
"lr": [1e-5, 1e-3],
"actor_lr": [1e-5, 1e-3],
"critic_lr": [1e-6, 1e-4],
"alpha_lr": [1e-6, 1e-3],
"gamma": [0.95, 0.99],
"n_step": [1, 3],
"initial_alpha": [1.0, 1.5],
Expand Down Expand Up @@ -80,7 +82,9 @@
# TODO (simon): Adjust to new model_config_dict.
.training(
initial_alpha=tune.choice([1.0, 1.5]),
lr=tune.uniform(1e-5, 1e-3),
actor_lr=tune.uniform(1e-5, 1e-3),
critic_lr=tune.uniform([1e-6, 1e-4]),
alpha_lr=tune.uniform([1e-6, 1e-3]),
target_entropy=tune.choice([-10, -5, -1, "auto"]),
n_step=tune.choice([1, 3, (1, 3)]),
tau=tune.uniform(0.001, 0.1),
Expand Down
5 changes: 4 additions & 1 deletion rllib/tuned_examples/sac/multi_agent_pendulum_sac.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,10 @@
.environment(env="multi_agent_pendulum")
.training(
initial_alpha=1.001,
lr=8e-4,
# Use a smaller
sven1977 marked this conversation as resolved.
Show resolved Hide resolved
actor_lr=3e-5,
critic_lr=3e-4,
alpha_lr=8e-4,
target_entropy="auto",
n_step=1,
tau=0.005,
Expand Down
5 changes: 4 additions & 1 deletion rllib/tuned_examples/sac/pendulum_sac.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,10 @@
.environment(env="Pendulum-v1")
.training(
initial_alpha=1.001,
lr=3e-4,
# Use a smaller learning rate for the policy.
actor_lr=3e-5,
critic_lr=3e-4,
alpha_lr=1e-4,
target_entropy="auto",
n_step=1,
tau=0.005,
Expand Down