Summary:
They explain why rigourous experiments are crucial for research progress. They point out all issues with Deep RL experiments. They illustrate their points with extensive experiments.
They demonstrate the effect of :
- Random seeds: need to evaluate on more than 5.
- Hyper-parameters: huge effect, need for grid-search.
- Environments: some algorithms work well in particular environment.
- Codebase: different implementations lead to different results.
- Evaluation metrics: max does not mean anything. No best metric.
Final thoughts:
Great article because what they show is striking. I need to respect their recommendations, i.e :
- open-source code, make it reproducible
- grid-search
- random seed > 50
- careful with evaluation metric