Horovod is a open-source library for distributed deep learning.
It uses the Ring-AllReduce algorithm for efficient distributed training of neural networks.
This repository is a very simple hands-on guide for using Horovod-Pytorch with NVIDIA-Docker.
The aim is to provide a template for other projects using Horovod for Pytorch.
It also attempts to provide a more detailed explanation of what is going on.
The Horovod documentation leaves much to the imagination as of February, 2020.
Here, I try to explain the details as much as I can.
Please star/fork my repository if you find this tutorial helpful!
To run this project please install NVIDIA-Docker first.
Unfortunately for Windows users, NVIDIA-Docker is only available for Linux as of the time of writing.
NVIDIA-Docker has many dependencies, such as the NVIDIA driver and Docker.
These are all necessary for this project.
I am using Docker because I have found that local installation often fails.
This is likely due to complicated dependency issues.
Also, catastrophic errors are easier to handle in a Docker container than on a local machine.
Please view basic Docker concepts for this project.
Don't be afraid! It's not that difficult to understand!
The Docker container generated by the Dockerfile will create a Ubuntu 18.04 LTS image with CUDA 10.0, CuDNN 7.6.0.64-1, NCCL 2.4.7-1, and OpenMPI 4.0.2.
Python version is 3.6.7, Pytorch is 1.4.0, and Torchvision is 0.5.0.
The settings were modified from the currently available official horovod image.
The current official horovod Docker image has an issue with pillow 7 incompatibility with Torchvision 0.4.2.
A very simple task using ResNet34 for CIFAR10 classification was used.
Its main purpose is to explain what is going on.