Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.
Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)
You can run this code on colab demo link, train the agent and monitor with tensorboard, play LunarLander-v2 environment with trained network. This agent can solve LunarLander-v2 within 1~2 hours computed by Google Colab CPU backend. It can reach about > 250 average score.
- MuZero network
- 5 step unroll
- L_pg+cmpo
- L_v
- L_r
- L_m (5 step)
- Stacking 8 observations
- Mini-batch update
- Hidden state scaled within [-1,1]
- Gradient clipping by value [-1,1]
- Dynamics network gradient scale 1/2
- Target network(prior parameters) moving average update
- Categorical representation (value, reward model)
- Normalized advantage
- Tensorboard monitoring
- Retrace estimator
- CNN representation network
- LSTM dynamics network
- Atari environment
- Self-play use agent network (originally target network)
Target network 1-step unroll : When calculating v_pi_prior(s) and second term of L_pg+cmpo.
Unroll 5-step(agent network) : Unroll agent network to optimize.
1-step unrolls for L_m (target network) : When calculating pi_cmpo of L_m.
Score graph Loss graph Lunarlander play length and last rewards Var variables of advantage normalization
Need your help! Welcome to contribute, advice, question, etc.
Contact : [email protected] (Available languages : English, Korean)
Author's presentation : https://icml.cc/virtual/2021/poster/10769
Lunarlander-v2 env document : https://www.gymlibrary.dev/environments/box2d/lunar_lander/