Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
code		code
configs/train		configs/train
experiments		experiments
.gitignore		.gitignore
NOTES_MLflow.md		NOTES_MLflow.md
NOTES_Polyaxon.md		NOTES_Polyaxon.md
NOTES_Trains.md		NOTES_Trains.md
README.md		README.md
requirements.txt		requirements.txt

README.md

Reproducible ImageNet training with Ignite

In this example, we provide script and tools to perform reproducible experiments on training neural networks on ImageNet dataset.

Features:

Distributed training with mixed precision by nvidia/apex
Experiments tracking with MLflow or Polyaxon or TRAINS

There are three possible options: 1) Experiments tracking with MLflow, 2) Experiments tracking with Polyaxon or 3) Experiments tracking with TRAINS.

Experiments tracking with TRAINS / MLflow is more suitable for a local machine with GPU(s). For experiments tracking with Polyaxon user needs to have Polyaxon installed on a machine/cluster/cloud and can schedule experiments with polyaxon-cli. User can choose one option and skip the descriptions of another option.

Notes for experiments tracking with MLflow
Notes for experiments tracking with Polyaxon
Notes for experiments tracking with TRAINS

Implementation details

Files tree description:

code
  |___ dataflow : module privides data loaders and various transformers
  |___ scripts : executable training script
  |___ utils : other helper modules

configs
  |___ train : training python configuration files  
  
experiments 
  |___ mlflow : MLflow related files
  |___ plx : Polyaxon related files
  |___ trains : requirements.txt to install Trains python package
 
notebooks : jupyter notebooks to check specific parts from code modules

Code and configs

py_config_runner

We use py_config_runner package to execute python scripts with python configuration files.

Training script

Training script is located code/scripts and contains

training.py, single training script with possiblity to use one of MLflow / Polayaxon / Trains experiments tracking systems.

Training script contains run method required by py_config_runner to run a script with a configuration.

The split between training script and configuration python file is the following. Configuration file being a python script defines necessary components for neural network training:

Dataflow: training/validation/train evaluation data loaders with custom data augmentations
Model
Optimizer
Criterion
LR scheduler
other parameters: device, number of epochs, etc

Training script uses these components to setup and run training and validation loops. By default, processing group with "nccl" backend is initialized for distributed configuration (even for a single GPU).

Training script is generic, uses ignite.distributed API, and adapts training components to provided distributed configuration (e.g. uses DistribtedDataParallel model wrapper, uses distributed sampling, scales batch size etc).

Configurations

baseline_resnet50.py : trains ResNet50

Results

Model	Training Top-1 Accuracy	Training Top-5 Accuracy	Test Top-1 Accuracy	Test Top-5 Accuracy
ResNet-50	78%	92%	77%	94%

Acknowledgements

Part of trainings was done within Tesla GPU Test Drive on 2 Nvidia V100 GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imagenet

imagenet

README.md

Reproducible ImageNet training with Ignite

Implementation details

Code and configs

py_config_runner

Training script

Configurations

Results

Acknowledgements

Files

imagenet

Directory actions

More options

Directory actions

More options

Latest commit

History

imagenet

Folders and files

parent directory

README.md

Reproducible ImageNet training with Ignite

Implementation details

Code and configs

py_config_runner

Training script

Configurations

Results

Acknowledgements