User can run ImageNet training using MLflow experiments tracking system on the local machine.
We use conda
and MLflow to
handle experiments/runs and all python dependencies.
Please, install these tools:
We need to also install Nvidia/APEX and libraries for opencv. APEX is automatically installed on the first run.
Manually, all can be installed with the following commands.
Important, please, check the content of experiments/setup_opencv.sh
before running.
sh experiments/setup_apex.sh
sh experiments/setup_opencv.sh
Since 10/2019, we need to register an account in order to download the dataset. To download the dataset, use the following form : http://www.image-net.org/download.php
To configure the path to already existing ImageNet dataset, please specify DATASET_PATH
environment variable
export DATASET_PATH=/path/to/imagenet
# export DATASET_PATH=$PWD/input/imagenet
Setup mlflow output path as a local storage (option with remote storage is not supported):
export MLFLOW_TRACKING_URI=/path/to/output/mlruns
# e.g export MLFLOW_TRACKING_URI=$PWD/output/mlruns
Create once "Trainings" experiment
mlflow experiments create -n Trainings
or check existing experiments:
mlflow experiments list
Please, make sure to adapt training data loader batch size to your GPU type. By default, batch size is 64.
export MLFLOW_TRACKING_URI=/path/to/output/mlruns
# e.g export MLFLOW_TRACKING_URI=$PWD/output/mlruns
mlflow run experiments/mlflow --experiment-name=Trainings -P config_path=configs/train/baseline_r50.py -P num_gpus=1
For optimal devices usage, please, make sure to adapt training data loader batch size to your infrastructure. By default, batch size is 64 per process.
export MLFLOW_TRACKING_URI=/path/to/output/mlruns
# e.g export MLFLOW_TRACKING_URI=$PWD/output/mlruns
mlflow run experiments/mlflow --experiment-name=Trainings -P config_path=configs/train/baseline_r50.py -P num_gpus=2
To visualize experiments and runs, user can start mlflow dashboard:
mlflow server --backend-store-uri /path/to/output/mlruns --default-ainfrastructure/path/to/output/mlruns -p 6026 -h 0.0.0.0
# e.g mlflow server --backend-store-uri $PWD/output/mlruns --default-artifact-root $PWD/output/mlruns -p 6026 -h 0.0.0.0
To visualize experiments and runs, user can start tensorboard:
tensorboard --logdir /path/to/output/mlruns/1
# e.g tensorboard --logdir $PWD/output/mlruns/1
where /1
points to "Training" experiment.
Files tree description:
code
configs
experiments/mlflow : MLflow related files
notebooks
- conda.yaml: defines all python dependencies necessary for our experimentations
- MLproject: defines types of experiments we would like to perform by "entry points":
- main : starts single-node multi-GPU training script
When we execute
mlflow run experiments/mlflow --experiment-name=Trainings -P config_path=configs/train/baseline_r50.py -P num_gpus=2
it executes main
entry point from MLproject and runs provided command.