Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(algo_wrapper, configs): rename update cycle and refactor structure #213

Merged
merged 14 commits into from
Apr 17, 2023
32 changes: 32 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,45 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Features

- Feat(pid-lagrange, test): add algo and update test [@Jiayi Zhou](https://github.com/Gaiejj) in PR [#210](https://github.com/OmniSafeAI/omnisafe/pull/210).

- Feat(saute, simmer): support saute rl and clean the code [@Jiayi Zhou](https://github.com/Gaiejj) in PR [#209](https://github.com/OmniSafeAI/omnisafe/pull/209).

- Feat(off-policy): support off-policy lag by [@Jiayi Zhou](https://github.com/Gaiejj) in PR [#204](https://github.com/OmniSafeAI/omnisafe/pull/204).

- Chore: upload tutorial by [@Borong Zhang](https://github.com/muchvo) in PR [#201](https://github.com/OmniSafeAI/omnisafe/pull/201).

- Chore(pre-commit): [pre-commit.ci] autoupdate by [@pre-commit.ci](https://github.com/apps/pre-commit-ci) in PR [#200](https://github.com/OmniSafeAI/omnisafe/pull/200).

- Feat: update CLI for gpu and statistics tools by [@Borong Zhang](https://github.com/muchvo) in PR [#192](https://github.com/OmniSafeAI/omnisafe/pull/192).

- Feat: add `ruff` and `codespell` integration by [@XuehaiPan](https://github.com/XuehaiPan) in PR [#186](https://github.com/OmniSafeAI/omnisafe/pull/186).

### Fixes

- Fix: enable smooth param in Costs when plotting [@Borong Zhang](https://github.com/muchvo) in PR [#208](https://github.com/OmniSafeAI/omnisafe/pull/208).

- Fix(off-policy): fix log when not update by [@Jiayi Zhou](https://github.com/Gaiejj) in PR [#206](https://github.com/OmniSafeAI/omnisafe/pull/206).

- Fix: check duplicated parameters and values which are specified in experiment grid by [@Borong Zhang](https://github.com/muchvo) in PR [#203](https://github.com/OmniSafeAI/omnisafe/pull/203).

- Fix(experiment grid): fix file path problem when using gpu in experiment grid by [@Borong Zhang](https://github.com/muchvo) in PR [#194](https://github.com/OmniSafeAI/omnisafe/pull/194).

### Documentation

- Docs: fix small typo in README.md by [@mickelliu](https://github.com/mickelliu) in PR [#211](https://github.com/OmniSafeAI/omnisafe/pull/211).

- Docs: change link to OmniSafeAI by [@Jiaming Ji](https://github.com/zmsn-2077) in PR [#205](https://github.com/OmniSafeAI/omnisafe/pull/205).

- Docs: update api documents by [@Jiayi Zhou](https://github.com/Gaiejj) in PR [#191](https://github.com/OmniSafeAI/omnisafe/pull/191).

### Refactor

- Refactor(algo_wrapper, configs): rename update cycle and refactor structure by [@Jiayi Zhou](https://github.com/Gaiejj) in PR [#213](https://github.com/OmniSafeAI/omnisafe/pull/213).

- Refactor: update hyper-parameters for first-order algorithms by [@Borong Zhang](https://github.com/muchvo) in PR [#199](https://github.com/OmniSafeAI/omnisafe/pull/199).

- Refactor: condense top-level benchmarks by [@Jiaming Ji](https://github.com/zmsn-2077) in PR [#198](https://github.com/OmniSafeAI/omnisafe/pull/198).

## v0.2.2

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ omnisafe eval ./saved_source/PPO-{SafetyPointGoal1-v0} --num-episode 1

# Quick training some algorithms to validate your thoughts
# Note: use `key1:key2`, your can select key of hyperparameters which are recursively contained, and use `--custom-cfgs`, you can add custom cfgs via CLI
omnisafe train --algo PPO --total-steps 2048 --vector-env-nums 1 --custom-cfgs algo_cfgs:update_cycle --custom-cfgs 1024
omnisafe train --algo PPO --total-steps 2048 --vector-env-nums 1 --custom-cfgs algo_cfgs:steps_per_epoch --custom-cfgs 1024

# Quick training some algorithms via a saved config file, the format is as same as default format
omnisafe train-config ./saved_source/train_config.yaml
Expand Down
4 changes: 2 additions & 2 deletions docs/source/baserl/ppo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -340,7 +340,7 @@ Quick start
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down Expand Up @@ -472,7 +472,7 @@ Configs

- clip (float): Clipping parameter for PPO.

- update_cycle (int): Number of steps to update the policy network.
- steps_per_epoch (int): Number of steps to update the policy network.
- update_iters (int): Number of iterations to update the policy network.
- batch_size (int): Batch size for each iteration.
- target_kl (float): Target KL divergence.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/baserl/trpo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -494,7 +494,7 @@ Quick start
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down Expand Up @@ -757,7 +757,7 @@ Configs
- cg_iters (int): Number of iterations for conjugate gradient.
- fvp_sample_freq (int): Frequency of sampling for Fisher vector product.

- update_cycle (int): Number of steps to update the policy network.
- steps_per_epoch (int): Number of steps to update the policy network.
- update_iters (int): Number of iterations to update the policy network.
- batch_size (int): Batch size for each iteration.
- target_kl (float): Target KL divergence.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ We give an example below:
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down
4 changes: 2 additions & 2 deletions docs/source/saferl/cpo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -460,7 +460,7 @@ Quick start
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down Expand Up @@ -715,7 +715,7 @@ Configs
- cg_iters (int): Number of iterations for conjugate gradient.
- fvp_sample_freq (int): Frequency of sampling for Fisher vector product.

- update_cycle (int): Number of steps to update the policy network.
- steps_per_epoch (int): Number of steps to update the policy network.
- update_iters (int): Number of iterations to update the policy network.
- batch_size (int): Batch size for each iteration.
- target_kl (float): Target KL divergence.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/saferl/focops.rst
Original file line number Diff line number Diff line change
Expand Up @@ -448,7 +448,7 @@ Quick start
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down Expand Up @@ -590,7 +590,7 @@ Configs

- clip (float): Clipping parameter for FOCOPS.

- update_cycle (int): Number of steps to update the policy network.
- steps_per_epoch (int): Number of steps to update the policy network.
- update_iters (int): Number of iterations to update the policy network.
- batch_size (int): Batch size for each iteration.
- target_kl (float): Target KL divergence.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/saferl/lag.rst
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@ Quick start
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down Expand Up @@ -450,7 +450,7 @@ Configs

- clip (float): Clipping parameter for PPOLag.

- update_cycle (int): Number of steps to update the policy network.
- steps_per_epoch (int): Number of steps to update the policy network.
- update_iters (int): Number of iterations to update the policy network.
- batch_size (int): Batch size for each iteration.
- target_kl (float): Target KL divergence.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/saferl/pcpo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -438,7 +438,7 @@ Quick start
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down Expand Up @@ -674,7 +674,7 @@ Configs
- cg_iters (int): Number of iterations for conjugate gradient.
- fvp_sample_freq (int): Frequency of sampling for Fisher vector product.

- update_cycle (int): Number of steps to update the policy network.
- steps_per_epoch (int): Number of steps to update the policy network.
- update_iters (int): Number of iterations to update the policy network.
- batch_size (int): Batch size for each iteration.
- target_kl (float): Target KL divergence.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/start/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Train policy
--algo PPO
--total-steps 1024
--vector-env-nums 1
--custom-cfgs algo_cfgs:update_cycle
--custom-cfgs algo_cfgs:steps_per_epoch
--custom-cfgs 512

Here we provide a video example:
Expand All @@ -44,7 +44,7 @@ Train policy


.. hint::
The above command will train a policy with PPO algorithm, and the total training steps is 1024. The vector environment number is 1, which means that the training process will use 1 CPU core. The ``algo_cfgs:update_cycle`` is the update cycle of the PPO algorithm, which means that the policy will be updated every 512 steps.
The above command will train a policy with PPO algorithm, and the total training steps is 1024. The vector environment number is 1, which means that the training process will use 1 CPU core. The ``algo_cfgs:steps_per_epoch`` is the update cycle of the PPO algorithm, which means that the policy will be updated every 512 steps.

Customize Configuration
-----------------------
Expand Down
2 changes: 1 addition & 1 deletion examples/benchmarks/example_cli_benchmark_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ train_cfgs:torch_threads:
[1]
train_cfgs:total_steps:
1024
algo_cfgs:update_cycle:
algo_cfgs:steps_per_epoch:
512
seed:
[0]
4 changes: 2 additions & 2 deletions examples/benchmarks/run_experiment_grid.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,8 @@ def train(
eg.add('logger_cfgs:use_wandb', [False])
eg.add('train_cfgs:vector_env_nums', [4])
eg.add('train_cfgs:torch_threads', [1])
eg.add('algo_cfgs:update_cycle', [2048])
eg.add('train_cfgs:total_steps', [1024000])
eg.add('algo_cfgs:steps_per_epoch', [20000])
eg.add('train_cfgs:total_steps', [10000000])
eg.add('seed', [0])
# total experiment num must can be divided by num_pool
# meanwhile, users should decide this value according to their machine
Expand Down
2 changes: 1 addition & 1 deletion examples/train_from_custom_dict.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
'parallel': 1,
},
'algo_cfgs': {
'update_cycle': 2048,
'steps_per_epoch': 2048,
'update_iters': 1,
},
'logger_cfgs': {
Expand Down
2 changes: 1 addition & 1 deletion images/CLI_example.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 7 additions & 3 deletions omnisafe/algorithms/algo_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ def __init__(
self._plotter: Plotter = None
self.cfgs = self._init_config()
self._init_checks()
self._init_algo()

def _init_config(self):
"""Init config."""
Expand Down Expand Up @@ -94,7 +95,7 @@ def _init_config(self):
exp_name = f'{self.algo}-{{{self.env_id}}}'
cfgs.recurisve_update({'exp_name': exp_name, 'env_id': self.env_id, 'algo': self.algo})
cfgs.train_cfgs.recurisve_update(
{'epochs': cfgs.train_cfgs.total_steps // cfgs.algo_cfgs.update_cycle},
{'epochs': cfgs.train_cfgs.total_steps // cfgs.algo_cfgs.steps_per_epoch},
)
return cfgs

Expand All @@ -107,8 +108,8 @@ def _init_checks(self):
self.env_id in support_envs()
), f"{self.env_id} doesn't exist. Please choose from {support_envs()}."

def learn(self):
"""Agent Learning."""
def _init_algo(self):
"""Init algo."""
# Use number of physical cores as default.
# If also hardware threading CPUs should be used
# enable this by the use_number_of_threads=True
Expand All @@ -129,6 +130,9 @@ def learn(self):
env_id=self.env_id,
cfgs=self.cfgs,
)

def learn(self):
"""Agent Learning."""
ep_ret, ep_cost, ep_len = self.agent.learn()

self._init_statistical_tools()
Expand Down
27 changes: 13 additions & 14 deletions omnisafe/algorithms/off_policy/ddpg.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,23 +52,25 @@ def _init_env(self) -> None:
self._seed,
self._cfgs,
)
assert (self._cfgs.algo_cfgs.update_cycle) % (
assert (self._cfgs.algo_cfgs.steps_per_epoch) % (
distributed.world_size() * self._cfgs.train_cfgs.vector_env_nums
) == 0, 'The number of steps per epoch is not divisible by the number of environments.'

assert (
int(self._cfgs.train_cfgs.total_steps) % self._cfgs.algo_cfgs.update_cycle == 0
int(self._cfgs.train_cfgs.total_steps) % self._cfgs.algo_cfgs.steps_per_epoch == 0
), 'The total number of steps is not divisible by the number of steps per epoch.'
self._epochs = int(self._cfgs.train_cfgs.total_steps // self._cfgs.algo_cfgs.update_cycle)
self._epochs = int(
self._cfgs.train_cfgs.total_steps // self._cfgs.algo_cfgs.steps_per_epoch,
)
self._epoch = 0
self._update_cycle = self._cfgs.algo_cfgs.update_cycle // (
self._steps_per_epoch = self._cfgs.algo_cfgs.steps_per_epoch // (
distributed.world_size() * self._cfgs.train_cfgs.vector_env_nums
)
self._steps_per_sample = self._cfgs.algo_cfgs.steps_per_sample
self._update_cycle = self._cfgs.algo_cfgs.update_cycle
assert (
self._update_cycle % self._steps_per_sample == 0
self._steps_per_epoch % self._update_cycle == 0
), 'The number of steps per epoch is not divisible by the number of steps per sample.'
self._samples_per_epoch = self._update_cycle // self._steps_per_sample
self._samples_per_epoch = self._steps_per_epoch // self._update_cycle
self._update_count = 0

def _init_model(self) -> None:
Expand All @@ -80,9 +82,6 @@ def _init_model(self) -> None:
epochs=self._epochs,
).to(self._device)

if distributed.world_size() > 1:
distributed.sync_params(self._actor_critic)

def _init(self) -> None:
self._buf = VectorOffPolicyBuffer(
obs_space=self._env.observation_space,
Expand Down Expand Up @@ -161,7 +160,7 @@ def learn(self) -> tuple[int | float, ...]:
epoch * self._samples_per_epoch,
(epoch + 1) * self._samples_per_epoch,
):
step = sample_step * self._steps_per_sample * self._cfgs.train_cfgs.vector_env_nums
step = sample_step * self._update_cycle * self._cfgs.train_cfgs.vector_env_nums

roll_out_start = time.time()
# set noise for exploration
Expand All @@ -170,7 +169,7 @@ def learn(self) -> tuple[int | float, ...]:

# collect data from environment
self._env.roll_out(
roll_out_step=self._steps_per_sample,
roll_out_step=self._update_cycle,
agent=self._actor_critic,
buffer=self._buf,
logger=self._logger,
Expand Down Expand Up @@ -204,8 +203,8 @@ def learn(self) -> tuple[int | float, ...]:

self._logger.store(
**{
'TotalEnvSteps': step,
'Time/FPS': self._cfgs.algo_cfgs.update_cycle / (time.time() - epoch_time),
'TotalEnvSteps': step + 1,
'Time/FPS': self._cfgs.algo_cfgs.steps_per_epoch / (time.time() - epoch_time),
'Time/Total': (time.time() - start_time),
'Time/Epoch': (time.time() - epoch_time),
'Train/Epoch': epoch,
Expand Down
3 changes: 0 additions & 3 deletions omnisafe/algorithms/off_policy/sac.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,6 @@ def _init_model(self) -> None:
epochs=self._epochs,
).to(self._device)

if distributed.world_size() > 1:
distributed.sync_params(self._actor_critic)

def _init(self) -> None:
super()._init()
if self._cfgs.algo_cfgs.auto_alpha:
Expand Down
3 changes: 0 additions & 3 deletions omnisafe/algorithms/off_policy/td3.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,6 @@ def _init_model(self) -> None:
epochs=self._epochs,
).to(self._device)

if distributed.world_size() > 1:
distributed.sync_params(self._actor_critic)

def _update_reward_critic(
self,
obs: torch.Tensor,
Expand Down
12 changes: 6 additions & 6 deletions omnisafe/algorithms/on_policy/base/policy_gradient.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,11 @@ def _init_env(self) -> None:
self._seed,
self._cfgs,
)
assert (self._cfgs.algo_cfgs.update_cycle) % (
assert (self._cfgs.algo_cfgs.steps_per_epoch) % (
distributed.world_size() * self._cfgs.train_cfgs.vector_env_nums
) == 0, 'The number of steps per epoch is not divisible by the number of environments.'
self._steps_per_epoch = (
self._cfgs.algo_cfgs.update_cycle
self._cfgs.algo_cfgs.steps_per_epoch
// distributed.world_size()
// self._cfgs.train_cfgs.vector_env_nums
)
Expand Down Expand Up @@ -199,15 +199,15 @@ def _init_log(self) -> None:
self._logger.setup_torch_saver(what_to_save)
self._logger.torch_save()

self._logger.register_key('Metrics/EpRet', window_length=50, min_and_max=True)
self._logger.register_key('Metrics/EpRet', window_length=50)
self._logger.register_key('Metrics/EpCost', window_length=50)
self._logger.register_key('Metrics/EpLen', window_length=50)

self._logger.register_key('Train/Epoch')
self._logger.register_key('Train/Entropy')
self._logger.register_key('Train/KL')
self._logger.register_key('Train/StopIter')
self._logger.register_key('Train/PolicyRatio')
self._logger.register_key('Train/PolicyRatio', min_and_max=True)
self._logger.register_key('Train/LR')
if self._cfgs.model_cfgs.actor_type == 'gaussian_learning':
self._logger.register_key('Train/PolicyStd')
Expand Down Expand Up @@ -270,8 +270,8 @@ def learn(self) -> tuple[int | float, ...]:

self._logger.store(
**{
'TotalEnvSteps': (epoch + 1) * self._cfgs.algo_cfgs.update_cycle,
'Time/FPS': self._cfgs.algo_cfgs.update_cycle / (time.time() - epoch_time),
'TotalEnvSteps': (epoch + 1) * self._cfgs.algo_cfgs.steps_per_epoch,
'Time/FPS': self._cfgs.algo_cfgs.steps_per_epoch / (time.time() - epoch_time),
'Time/Total': (time.time() - start_time),
'Time/Epoch': (time.time() - epoch_time),
'Train/Epoch': epoch,
Expand Down
Loading