Diversity Is All You Need (DIAYN) is a deep reinforcement learning algorithm for automatically discovering a set of useful skills. The algorithm is based on the paper Diversity is All You Need: Learning Diverse Skills without a Reward Function, with additional videos on the project website.
Follow the installation instructions in the README.
Every time DIAYN is run, the set of learned skills is saved in a .pkl
file. We provide a few utilities for visualizing the learned skills.
The script visualize_skills.py creates videos of the learned skills.
# Usage: python scripts/visualize_skills.py <snapshot.pkl>
python scripts/visualize_skills.py data/demo/half_cheetah/seed_1/itr_3000.pkl
Use the --separate_videos=True
flag to create a separate video for each skill, and use the --max-path-length=100
to specify the number of steps per episode.
The script plot_traces.py plots the location of the agent throughout an episode.
# Usage: python scripts/plot_traces.py <snapshot.pkl>
python scripts/plot_traces.py data/demo/half_cheetah/seed_1/itr_3000.pkl
Additional flags allow the user to choose the number of rollouts per skill, the length of each rollout, and which dimensions of the state to plot.
Training a new set of skills is as easy as running
# Usage: python examples/mujoco_all_diayn.py --env=<ENV> --log_dir=<LOG_DIR>
python examples/mujoco_all_diayn.py --env=half-cheetah --log_dir=data/demo
The following environments are currently supported: swimmer, hopper, walker, half-cheetah, ant, humanoid, point, point-maze, inverted-pendulum, inverted-double-pendulum, mountain-car, lunar-lander, bipedal-walker
. To add a new environment, simply add a new entry to the ENV_PARAMS
dictionary on line 54 of mujoco_all_diayn.py
.
The log directory specifies where checkpoints and the progress.csv
log will be saved. Set this to /dev/null
if you don't want to save these.
The rllab library has a script frontend.py that can be used for plotting training in real time (using the progress.csv
log). Note that the script must be re-run every time the log is updated:
# Usage: python rllab/viskit/frontend.py <progress.csv>
python rllab/viskit/frontend.py data/demo/half_cheetah/seed_1/progress.csv
The imitation learning experiments require two checkpoints (.pkl
files), one to use as the expert and another to use as the student. To run the imitation experiment:
# Usage: python scripts/imitate_skills.py --expert_snapshot=<expert.pkl> --student_snapshot=<student.pkl>
python scripts/imitate_skills.py --expert_snapshot=data/demo/half_cheetah/seed_1/itr_2500.pkl --student_snapshot=data/demo/half_cheetah/seed_2/itr_2500.pkl
This script saves a JSON file storing the results of the experiment:
M
- A 2D array with size (num. expert skills X num. student skills). EntryM[i, j]
is the predicted probability that student skillj
matches expert skilli
.L
- A 2D array with size (num. expert skills X num. student skills). EntryL[i, j]
is a list of the log-probability that student skillj
matches expert skilli
at every step in a rollout. Note that rollouts may have different lengths, so do not attempt to reshapeL
into a 3D tensor.dist_vec
- A 1D array with size (num. expert skills). Entrydist_vec[i]
is the distance between expert skilli
and the retrieved student skill. Distance is computed as the Euclidean distance between states in a rollout.
The finetuning experiments require a checkpoint of saved skills to use as initialization. To run the finetuning experiment:
# Usage: python examples/mujoco_all_diayn_finetune.py --env=<ENV> --snapshot=<snapshot.pkl> --log_dir=<log_dir>
python examples/mujoco_all_diayn_finetune.py --env=half-cheetah --snapshot=data/demo/half_cheetah/seed_1/itr_2500.pkl --log_dir=data/finetune/half_cheetah
Feel free to modify any part of the code to test new hypotheses and find better algorithms for unsupervised skill discovery. We offer two pieces of advice:
- When doing hyperparameter searches, include the names of the hyperparameters in the list
TAG_KEYS
. Names included inTAG_KEYS
will be appended to the checkpoint filenames. - Create a dedicated
logs
folder and number your experiments. In our experiments, we usedlog_dir=<log_dir>/DIAYN_XXX
, whereXXX
was the experiment number.
There are many open problems raised by DIAYN. Below, we offer a list of open problems. If interested in working on any of these, feel free to reach out to [email protected].
- Try running DIAYN on a new environment. Some good candidates are the environments in roboschool and the Box2D environments in Gym, and the environments in pybullet. Send GIFs of trained skills to [email protected] for inclusion in the project website (we ensure you get credited).
- Use pretrained skills to create a robot that dances to some song.
- Use pretrained skills to create a game, where the human player decides which skill to use.
- There is a chicken-and-egg dilemma in DIAYN: skills learn to be diverse by using the discriminator's decision function, but the discriminator cannot learn to discriminate skills if they are not diverse. We found that synchronous updates of the policy and discriminator worked well enough for our experiments. We expect that more careful balancing of discriminator and policy updates would accelerate learning, leading to more diverse states. A first step in this direction is to modify the training loop to do N discriminator updates for every M policy updates, where N and M are tuneable hyperparameters.
- While DIAYN was developed on top of Soft Actor Critic, it could be applied on top of any RL algorithm. What are the benefits and limitations of applying DIAYN on top of (say) PPO and Evolutionary Strategies?
- In preliminary experiments, we tried using a continuous prior distribution p(z). Neither a uniform nor Gaussian prior distribution allowed us to learn skills as well as a Categorical prior. Can continuous priors be made to work? One initial idea is to use a Dirichlet distribution as the prior, as the Dirichlet can interpolate between a categorical distribution (which we know works) and a uniform distribution (which doesn't work yet).
- Can skills discovered by used for hierarchical reinforcement learning? We suspect that answer is yes, but have not yet gotten it working. A good first step would be to learn a meta-policy that chooses which skill to use for the next (say) 100 steps.
- Can we "evolve" an expanding set of diverse skills? In our experiments, we fixed the number of skills apriori, and sampled uniformly from this distribution during training. Taking inspiration from evolutionary algorithms, we might be able to learn a set of skills that expands over time. One approach is to start with two skills, and add a new skill every time the existing skills are sufficiently distinguishable.
- Can new skills be initialized more intelligently? In our experiments, we randomly initialize a skill the first time it is used. An alternative would be to initialize the new skill with whatever existing skill is currently most distinguishable. This has the interpretation of doing a tree-search, where the most diverse leaf is split in two.
- How can large scale, distributed training more quickly learn a large set of more diverse skills? A straightforward method for applying DIAYN in the distributed setting is to send a discriminator to each worker, and have each working independently learn a set of diverse skills. The skills are then aggregated by a central coordinator, which learns a new global discriminator. The new discriminator is sent to each worker, and the process is repeated.
The DIAYN algorithm was developed by Benjamin Eysenbach, in collaboration with Abhishek Gupta, Julian Ibarz, and Sergey Levine. Work was completed while Benjamin was a member of the Google AI Residency. We thank Tuomas Haarnoja for the implementation of SAC on top of which this project is build.
This project is released under the Apache 2.0 License. This is not an official Google product.
@article{eysenbach2018,
author = "Eysenbach, Benjamin and Gupta, Abhishek and Ibarz, Julian and Levine, Sergey",
title = "Diversity is All You Need: Learning Diverse Skills without a Reward Function",
year = "2018"
}